hadoop - 比较 2 个配置单元表以查找没有任何唯一列/时间戳的更新/插入/删除记录并将其附加到 Hadoop 中的基表

Base_table (Day 01 load from source)

 **Id    Name    City      Country** 
 7682   Stuart  Frankfurt   Germany
 8723   Micke   Paris       France
 2355   Niki    New york    USA
 2097   Deny    Italy       Rome

new_table (Day 02 load from source)

 **Id    Name    City      Country** 
 7682   Stuart  *Darmstadt*  Germany
 8723   Micke   Paris       France
 2355   Niki    New york    USA
 *9057  Bony    Prague      Prague*

比较以上 2 个表格时，可以看到以下 3 个变化。

Record Id 7682's City name changed to Darmstadt in Day 02 load

Record Id 2097 is deleted in Day 02 load and it was present in Day 01 load

New Record inserted with Id 9057 on Day 02 load

需要捕获所有上述 3 个更改并将其附加到 Base_table

Below 3 records should be captured out of comparision

7682   Stuart  Darmstadt   Germany
2097   Deny    Italy       Rome
9057   Bony    Prague      Prague

Base_table output after appending Day 02 changes

**Id    Name    City      Country** 
 7682   Stuart  Frankfurt   Germany
 8723   Micke   Paris       France
 2355   Niki    New york    USA
 2097   Deny    Italy       Rome
*7682   Stuart  Darmstadt   Germany*   
*2097   Deny    Italy       Rome*
*9057   Bony    Prague      Prague*

我能够使用 SQL 连接获取插入和删除的记录，但无法获取更新的记录。为了获取更新的记录，我在本地将文件复制到 linux 并进行比较，但它不适用于大量数据。任何人都可以分享您处理此类情况的经验吗？

最佳答案

select      inline
            (
                array
                (
                    case 
                        when n.id is null then struct(b.*)
                        else struct (n.*)
                    end
                )
            )

from                    base_table  as b
            full join   new_table   as n
            on          n.id = b.id

where       b.id is null 
        or  n.id is null
        or  struct(b.*) not in (struct(n.*))

+------+--------+-----------+---------+
| col1 |  col2  |   col3    |  col4   |
+------+--------+-----------+---------+
| 2097 | Deny   | Italy     | Rome    |
| 7682 | Stuart | Darmstadt | Germany |
| 9057 | Bony   | Prague    | Prague  |
+------+--------+-----------+---------+

关于hadoop - 比较 2 个配置单元表以查找没有任何唯一列/时间戳的更新/插入/删除记录并将其附加到 Hadoop 中的基表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43802507/

hadoop - 比较 2 个配置单元表以查找没有任何唯一列/时间戳的更新/插入/删除记录并将其附加到 Hadoop 中的基表

上一篇：python - 删除表语句中的 Hive ParseException

下一篇：hadoop - 将非 HA Hadoop 集群转换为 HA 集群