hadoop - HADOOP/PIG-LATIN:计算经常合作的电影明星PIG

标签 hadoop apache-pig

我必须计算从IMDB小样本一起工作的星星,然后进行缩放。
我只需要用于电影中的那些 Actor ,而不必用于电视连续剧

#Input: (actor, title, year, num, type, episode, billing, role)
raw = LOAD 'hdfs://cm:9000/uhadoop/shared/imdb/imdb-stars-example.tsv' USING PigStorage('\t') AS (actor, title, year, num, type, episode, billing, role);
#Line 1: Filter raw to make sure type equals 'THEATRICAL_MOVIE' 
 movies = FILTER raw BY type == 'THEATRICAL_MOVIE';
#Then I get the variables with stars and costars every billing that is equal to 1 is the movies star and every billing >=2 it is the co movie star
 c1 = FILTER movies BY billing == 1;
 c2 = FILTER movies BY billing >= 2;
 c3 = JOIN c1 BY title, c2 BY title;
从这里开始,我需要数出电影中最常出现的一对,而我的大脑刚好挤下来,我尝试了很多事情,但总是会出错。
actor_coactors_freq_movies = GROUP c3 BY actor;
actor_coactors_freq_movies_count = FOREACH actor_coactors_freq_movies GENERATE COUNT($1) AS count, 
group AS actor_pair;
ordered_actor_pair_count = ORDER actor_movie_count BY count DESC;
显然我迷路了,我是所有爵士乐的新手。
感谢您的帮助

最佳答案

第1行:过滤原始数据以确保类型等于“THEATRICAL_MOVIE”
电影= FILTER原始BY类型=='THEATRICAL_MOVIE';
-第2行:生成具有完整电影名称的新关系(将标题+“-” + year +“-” + num“和 Actor 联系在一起)
full_movies1 = FOREACH电影GENERATE CONCAT(title,'-',year,'-',num),actor;
full_movies2 = FOREACH电影GENERATE CONCAT(title,'-',year,'-',num),actor;
-第3行:按 Actor 分组关系
coactor_movies = JOIN full_movies1 BY $ 0,full_movies2 BY $ 0;
转储coactor_movies
coactor_movies2 = FOREACH coactor_movies生成$ 0作为mv,$ 1作为act1,$ 3作为act2;
转储coactor_movies2
-萨科·洛斯·帕雷斯(Saco los pares)配偶
coactor_movies3 = FILTER coactor_movies2 BY act1!= act2;
转储coactor_movies3
-transformo los pares simetricos
coactor_movies4 = FOREACH coactor_movies3生成mv,FLATTEN((act1 转储coactor_movies4
-埃里米诺·迪普利卡多斯
ca1 = DISTINCT coactor_movies4;
转储ca1
-Dejo独奏洛杉矶Actores
ca2 = FOREACH ca1生成act1,act2;
转储ca2
actores =由(act1,act2)组成的组ca2;
结果= FOREACH actores将GEENERATE FLATTEN(group)表示为(act1,act2),COUNT($ 1)作为计数;
or_results =按计数DESC排序的结果;

关于hadoop - HADOOP/PIG-LATIN:计算经常合作的电影明星PIG,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63084642/

相关文章:

java - PIG - 找到接口(interface) org.apache.hadoop.mapreduce.JobContext,但类是预期的

hadoop - 运行 Cassandra + Pig 时出错

hudson - 如何在安装 hudson 的 hadoop 插件后对 hdfs 设置访问控制

twitter-bootstrap - 大数据的最佳实践?

java - 使用 hadoop Map Reduce 的自定义输出格式时未找到类异常

amazon-web-services - 无法在RStudio中处理大型文件

sql - 我想在我现有的配置单元表中添加一个额外的列,以便我可以获得当天的当前时间戳

java - 嵌入在Java中的Pig:本地的PigServer-没有错误消息,但不会启动map reduce(Maven吗?)

java - Apache Pig 处理 CSV,字段用引号括起来

mongodb - "ERROR 6000, Output location validation failed"在 EMR 上使用 PIG MongoDB-Hadoop 连接器