//Pig Program
User = LOAD 'path' USING PigStorage(',') as (id:int, reputation:int, displayname:chararray, loc:chararray, age:int);
Post = LOAD 'path' USING PigStorage(',') as (id:int, post_type:int, creationdate:chararray, score:int, viewcount:int, ownerus)er_id:int, title:chararray, answercount:chararray, commentcount:chararray);
JOIN User BY id, Post BY id;
a = JOIN User BY id, Post BY id;
DUMP a;
User_Group = Group a ALL;
Max_reputation = foreach User_Group Generate(User.displayname, User.reputation, Post.id), MAX(User.reputation), COUNT(Post.id);
所以基本上我将两个不同的表分组,即User和Post
然后对其应用JOIN。
问题陈述:查找显示名称,并且没有具有最高声誉的用户帖子。
所以基本上我需要用户的显示名称和声誉
还有来自Post的ID
我想在联接上应用MAX(User.reputation)和Count(Post.id)即a
请帮忙。
更有用的是,先应用JOIN然后执行MAX和Count或
应用MAX和Count,然后执行JOIN。
最佳答案
问题陈述:查找显示名称,并且没有具有最高声誉的用户帖子。
首先,尝试在关系“用户”的帮助下找到信誉最高的用户的显示名称
然后使用关系“post”应用联接以收集该最大用户的所有帖子。然后基于id进行分组并计数。
以下代码将帮助您实现目标
User = LOAD 'path' USING PigStorage(',') as (id:int, reputation:int, displayname:chararray, loc:chararray, age:int);
Post = LOAD 'path' USING PigStorage(',') as (id:int, post_type:int, creationdate:chararray,score:int, viewcount:int, ownerus)er_id:int, title:chararray, answercount:chararray);
User_grp = GROUP User BY id;
User_each = FOREACH User_grp
{
User_order = ORDER User BY reputation DESC;
User_limit = LIMIT User_order 1;
User_nested = FOREACH User_limit GENERATE id,displayname;
GENERATE flatten(user_nested) as (user_id,displayname);
};
User_join = JOIN User_each by user_id, Post by id;
User_grouping = GROUP User_join BY user_id;
User_output = FOREACH User_grouping GENERATE group as user_id, MAX(user_join.displayname) as displayname, COUNT(user_join.post_type) as post_cnts;
关于hadoop - 如何在使用JOIN的两个不同表上同时使用MAX和COUNT函数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37500482/