java - 将一个表中的数百万行与另一个表中的数百万行进行比较的最快方法

标签 java database oracle performance optimization

<分区>

我想比较两个表，每个表有百万条记录，并从比较中得到匹配数据。

要从两个表中获取匹配数据，我们首先比较 table1 中的名称不应等于 table2 中的名称。然后我们比较 table1 中的城市应该等于 table2 中的城市，最后我们比较 table1 中的 date_of_birth 应该与 +- table2 中 date_of-birth 的 1 年范围。

表 1 中的一行可以与 表 2 中的数据有多个匹配项。此外，对于每个匹配项，我需要一个唯一的记录 ID，单个表 1 行的多个匹配数据必须具有相同的唯一记录 ID。

我尝试使用 Java 代码和 PL/SQL 过程，但两者都需要数小时，因为这涉及数百万数据与数百万数据的比较。有没有更快的方法来进行这种匹配？

最佳答案

"I tried using java by storing data from both tables in list via jdbc connection and then iterating one list with the other. But it was very slow and took many hours to complete, even got time out exception many time."

恭喜。这是启蒙之路的第一步。数据库在处理数据方面比 Java 好得多。 Java 是一种很好的通用编程语言，但数据库针对关系数据处理进行了优化:它们只是处理速度更快，占用更少的 CPU、更少的内存和更少的网络流量。

"I also created an sql procedure for the same, it was some what faster than java program but still took a lot time (couple of hours) to complete."

您正处于启蒙的第二步边缘:逐行处理(即过程迭代)很慢。 SQL 是一种基于集合的范例。集合处理要快得多。

为了给出具体的建议，我们需要一些关于您实际在做什么的细节，但作为示例，此查询将为您提供两个表中这些列的匹配集:

select col1, col2, col3
from huge_table_1
INTERSECT
select col1, col2, col3
from huge_table_2

MINUS 运算符会为您提供 huge_table_1 中不在 huge_table_2 中的行。翻转表格以获得正面设置。

select col1, col2, col3
from huge_table_1
MINUS
select col1, col2, col3
from huge_table_2

拥抱套装的乐趣!

"we are first comparing the name in huge_table_1 should not be equal to name in huge_table_2. Then we are comparing city in huge_table_1 should be equal to city in huge_table_2 and then finally we are comparing date_of_birth in huge_table_1 should be with in +-1 year range of date_of-birth in huge_table_2"

嗯。从不平等开始通常是不好的，尤其是在大表中。您很可能会有很多与这些匹配条件不匹配的名称。但是你可以尝试这样的事情:

select * from huge_table_1 ht1
where exists
      ( select null from huge_table_2 ht2
        where ht2.city = ht1.city
        and ht1.date_of birth between add_months(ht2.date_of birth, -12) 
                                  and add_months(ht2.date_of birth, 12) 
        and ht2.name != ht1.name)
/

关于java - 将一个表中的数百万行与另一个表中的数百万行进行比较的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43406210/

上一篇：javascript - 如何使用 Knex 仅将唯一数据插入 Postgres？

下一篇：python - 将字典转换为数据框python

相关文章：

java - 如何在 Java 枚举中使用 name() 方法

database - 如何实现数据库引擎独立分页？

php - 注册表未在数据库表中发布详细信息

database - 从 VSTS 数据库版本生成数据更改脚本

sql - oracle:按日期和时间戳排序

java - java中元素未写入数组

java - 在双数组中读取和存储文件内容

java - 如何使用java在正则表达式中转义星号

java - hibernate.异常.SQLGrammarException : could not extract ResultSet: ORA-00942: table or view does not exist

sql - Oracle 查询 all_tab_columns.data_default(类型 LONG)