Base_table (Day 01 load from source)
**Id Name City Country**
7682 Stuart Frankfurt Germany
8723 Micke Paris France
2355 Niki New york USA
2097 Deny Italy Rome
new_table (Day 02 load from source)
**Id Name City Country**
7682 Stuart *Darmstadt* Germany
8723 Micke Paris France
2355 Niki New york USA
*9057 Bony Prague Prague*
比较以上 2 个表格时,可以看到以下 3 个变化。
- Record Id 7682's City name changed to Darmstadt in Day 02 load
- Record Id 2097 is deleted in Day 02 load and it was present in Day 01 load
- New Record inserted with Id 9057 on Day 02 load
需要捕获所有上述 3 个更改并将其附加到 Base_table
Below 3 records should be captured out of comparision
7682 Stuart Darmstadt Germany
2097 Deny Italy Rome
9057 Bony Prague Prague
Base_table output after appending Day 02 changes
**Id Name City Country**
7682 Stuart Frankfurt Germany
8723 Micke Paris France
2355 Niki New york USA
2097 Deny Italy Rome
*7682 Stuart Darmstadt Germany*
*2097 Deny Italy Rome*
*9057 Bony Prague Prague*
我能够使用 SQL 连接获取插入和删除的记录,但无法获取更新的记录。为了获取更新的记录,我在本地将文件复制到 linux 并进行比较,但它不适用于大量数据。任何人都可以分享您处理此类情况的经验吗?
最佳答案
select inline
(
array
(
case
when n.id is null then struct(b.*)
else struct (n.*)
end
)
)
from base_table as b
full join new_table as n
on n.id = b.id
where b.id is null
or n.id is null
or struct(b.*) not in (struct(n.*))
+------+--------+-----------+---------+
| col1 | col2 | col3 | col4 |
+------+--------+-----------+---------+
| 2097 | Deny | Italy | Rome |
| 7682 | Stuart | Darmstadt | Germany |
| 9057 | Bony | Prague | Prague |
+------+--------+-----------+---------+
关于hadoop - 比较 2 个配置单元表以查找没有任何唯一列/时间戳的更新/插入/删除记录并将其附加到 Hadoop 中的基表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43802507/