删除名称和年龄组合上的重复项,并使用spark-sql打印结果
Name Age Location
Rajesh 21 London
Suresh 28 California
Sam 26 Delhi
Rajesh 21 Gurgaon
Manish 29 Bengaluru
CREATE TABLE DETAILS
(
NAME STRING,
AGE INT,
LOCATION STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATE BY '\t'
STORED AS TEXTFILE;
======================================================================
LOAD DATA INPATH '/FOLDER/TO/question.txt INTO DETAILS;
======================================================================
CREATE TABLE DETAILS_FILTERED AS
SELECT NAME,AGE,LOCATION FROM DETAILS GROUP BY NAME,AGE;
最佳答案
使用row_number或最小/最大聚合。如果您有时间戳等列来选择最新/第一条记录用户+年龄,那会更好。在这种情况下,您可以将其包含在row_number的order_by子句中。
hive 示例:
select Name,Age,Location
from
(
select Name,Age,Location,
row_number() over(partition by NAME,AGE order by Location) rn --order by makes function more deterministic
from details
)
where rn=1 --filter duplicates
要么
select Name,Age,max(Location) Location
from details
group by Name,Age --aggregate
关于mysql - 从下面给出的示例数据中,删除名称和年龄组合上的重复项并打印结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57770033/