hadoop - 使用PIG加入后过滤数据

标签 hadoop apache-pig bigdata

我想在两个文件合并后过滤记录。

文件BX-Books.csv包含书籍数据。文件BX-Book-Ratings.csv包含书评数据,其中ISBN是两个文件中的公用列。使用此列完成文件之间的内部联接。
我想得到2002年出版的书籍。

我使用了以下脚本,但我得到了0条记录。

grunt> BookXRecords = LOAD '/user/pradeep/BX-Books.csv'  USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray);
grunt> BookXRating = LOAD '/user/pradeep/BX-Book-Ratings.csv'  USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray);
grunt> BxJoin = JOIN BookXRecords BY ISBN, BookXRating BY ISBN;
grunt> BxJoin_Mod = FOREACH BxJoin GENERATE $0 AS ISBN, $1, $2, $3, $4;
grunt> FLTRBx2002 = FILTER BxJoin_Mod BY $3 == '2002';

最佳答案

我创建了一个test.csv和test-rating.csv以及一个可以使用它们的Pig脚本。它工作得很好。

test.csv

1;abc;author1;2002
2;xyz;author2;2003

test-rating.csv
user1;1;3
user2;2;5

pig 脚本:
A = LOAD 'test.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray);
describe A;
dump A;

B = LOAD 'test-rating.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray);
describe B;
dump B;

C = JOIN A BY ISBN, B BY ISBN;
describe C;
dump C;

D = FOREACH C GENERATE $0 as ISBN,$1,$2,$3;
describe D;
dump D;

E = FILTER D BY $3 == '2002';
describe E;
dump E;

输出:
A: {ISBN: chararray,BookTitle: chararray,BookAuthor: chararray,YearOfPublication: chararray}
(1,abc,author1,2002)
(2,xyz,author2,2003)
B: {user: chararray,ISBN: chararray,rating: chararray}
(user1,1,3)
(user2,2,5)
C: {A::ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray,B::user: chararray,B::ISBN: chararray,B::rating: chararray}
(1,abc,author1,2002,user1,1,3)
(2,xyz,author2,2003,user2,2,5)
D: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray}
(1,abc,author1,2002)
(2,xyz,author2,2003)
E: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray}
(1,abc,author1,2002)

关于hadoop - 使用PIG加入后过滤数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43612081/

相关文章:

python - 如何使用 python pyhs2 连接到配置单元?

java hadoop作业在reducer outputcollector中操作1/double(ONE DIVISION a Double)中的奇怪行为

hadoop - 为什么 Amazon EMR 上的机器越多,我的 Pig UDF 速度就越快?

python - 如何从 CSV 文件中删除一些带有注释的行以将数据加载到 DataFrame?

java - 使用 MongoDB 进行漏斗分析?

scala - 从Spark RDD提取值

java - 映射 : expected org. apache.hadoop.io.Text 中的键类型不匹配,收到 org.apache.hadoop.io.IntWritable

xml - 改变 mapred.reduce.tasks

apache - Apache Pig查询-数据集加入错误1031

java - Pig 用户定义函数中的 aws Amazon S3 客户端凭证