hadoop - 在 Hive 中计数和分组

我在hive中有一张表如下，

table1

Cola   | Colb  |  Colc |  Cold  |
---------------------------------
...etc
efo18   691 123 5692                                 
efo18   691 345 5657
...etc
fsx31   950 291 23456                                                         
fsx31   950 404 23456                                                          
fsx31   950 343 23456                                                         
fsx31   950 182 23456                                                         
fsx31   950 120 45042                                                         
fsx31   950 161 23456  
....etc
klz57   490 121 3330                                                          
klz57   490 113 3330                                                          
klz57   490 308 3330                                                          
klz57   490 411 3330                                                           
klz57   490 161 3330                                                          
klz57   386 108 3330                                                          
klz57   490 113 3330                                                          
klz57   490 125 3330                                                          
klz57   490 165 3330                                                          
klz57   490 166 3330  
...etc
---------------------------------

我想要来自 table1 数据的另一个表组中 Cold 的相同值在其中，那些具有相同 Colb 的人有一个子组在该子组中，具有相同 Cola 值的那些属于一个组。换句话说，Cola,Colb,Cold 的每个唯一组合是一行。重复的行被加起来。

insert into table table2(Col1 string,Col2 string,Col3 string,Count int) select cola,colb,cold,count(*) from table1 group by cold,colb,cola;

我预料到了，

Col1   | Col2  |  Col3     |  Count  |
-------------------------------------
efo18    691     5692         1
efo18    691     5657         1
fsx31    950     23456        5   <-----1
fsx31    950     45042        1   <-----1
klz57    490     1234         9   <-----2
klz57    386     1234         1   <-----2
--------------------------------------

我明白了，

table2

Col1   | Col2  |  Col3     |  Count  |
-------------------------------------
efo18    691     5692         1
efo18    691     5657         1
fsx31    950     23456        4   <-----1
fsx31    950     25456        1   <-----1
fsx31    950     45042        1   <-----1
klz57    490     1234         8   <-----2
klz57    386     1234         1   <-----2
klz57    490     1234         1   <-----2
--------------------------------------

我不明白的是我正在做一个分组，先是 Cold，然后是 Colb，然后是 Cola，然后为什么标记为 (<----1) 的行的 Count 和来自 Cola 的值在不同的行中，即使所有内容都属于同一组？ Colc 对于这两行是不同的，但是我没有在分组中使用它 sp 两行有何不同？。同样对于标记为 (<----2) 的行，这里的问题是什么。

更新:

Binary01，我正在尝试你给出的例子

hive> select * from xyz;
OK
x        y       z      zz
xxx     111     222     123 NULL    NULL    NULL
xxx     111     222     123 NULL    NULL    NULL
xxx     101     222     123 NULL    NULL    NULL
xux     111     422     123 NULL    NULL    NULL
xxx     111     522     323 NULL    NULL    NULL
xyx     111     622     123 NULL    NULL    NULL
xxx     115     322     123 NULL    NULL    NULL
xxx     111     122     123 NULL    NULL    NULL
xxx     111     223     123 NULL    NULL    NULL
xxy     111     212     143 NULL    NULL    NULL
xxx     117     222     123 NULL    NULL    NULL

那些 NULL 值在那里做什么？我已经逐行复制粘贴了您的示例。甚至将表创建为

create table xyz(x string ,y string, z string , zz string) 
row format delimited fields terminated by ',';

最后的查询给出，

hive> select * from xyztemp;
OK
xux     111     422     123 NULL    NULL    1
xxx     101     222     123 NULL    NULL    1
xxx     111     122     123 NULL    NULL    1
xxx     111     222     123 NULL    NULL    2
xxx     111     223     123 NULL    NULL    1
xxx     111     522     323 NULL    NULL    1
xxx     115     322     123 NULL    NULL    1
xxx     117     222     123 NULL    NULL    1
xxy     111     212     143 NULL    NULL    1
xyx     111     622     123 NULL    NULL    1

最佳答案

你肯定错过了什么。我尝试使用与您的表格类似的以下数据，并检查输出是否完全符合您的预期。

hive>set hive.cli.print.header=true;
hive> load data local inpath '/home/brdev/sudeep/testdata.txt' into table xyz;
hive> create table xyz(x string ,y string, z string , zz string) row format delimited fields terminated by ',';
hive> select * from xyz;
OK
x       y       z       zz
xxx     111     222     123
xxx     111     222     123
xxx     101     222     123
xux     111     422     123
xxx     111     522     323
xyx     111     622     123
xxx     115     322     123
xxx     111     122     123
xxx     111     223     123
xxy     111     212     143
xxx     117     222     123

hive>create table xyztemp ( aa string,bb string,cc string , dd int);
hive>insert into table xyztemp select x,y,zz,count(*) from xyz group by zz,y,x;
hive> select * from xyztemp;
OK
aa      bb      cc      dd
xxx     101     123     1
xux     111     123     1
xxx     111     123     4
xyx     111     123     1
xxx     115     123     1
xxx     117     123     1
xxy     111     143     1
xxx     111     323     1

我想上面是您正在寻找的预期输出。

关于hadoop - 在 Hive 中计数和分组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17523680/

hadoop - 在 Hive 中计数和分组

上一篇：hadoop - Hadoop 中的重复数据删除

下一篇：hadoop - 从 udf 访问 hdfs 文件