python - 为什么不同 Pandas DataFrame 之间相同值的这些哈希值不同?

标签 python pandas dataframe

当在两个 DataFrame 中对相同的电子邮件地址进行哈希处理时,我返回了不同的哈希值。

这两个数据帧 df1 和 df2 每个都包含一列需要进行哈希处理的电子邮件地址,因此可以在内部连接时比较哈希值,如下所示:

import pandas as pd

### Boring part to import the data ###

# define table 1 as df1
df1 = pd.DataFrame([[2, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="4c3e2d222823216229212d25200c2a262628622f2321" rel="noreferrer noopener nofollow">[email protected]</a>'], [6, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6c08050a0a091e0902184209010d05002c1954545454190909020842081f07061f" rel="noreferrer noopener nofollow">[email protected]</a>'], [7, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1d6f7c737972703378707c74715d796e7768337e" rel="noreferrer noopener nofollow">[email protected]</a>'], [8, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0f7c6e626a216a626e66634f6b64666c65216c" rel="noreferrer noopener nofollow">[email protected]</a>'], [200, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="33575a55555641565d471d565e525a5f7350595b401d5c5c" rel="noreferrer noopener nofollow">[email protected]</a>'], [18, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="deacbfb0bab1b3f0bbb3bfb7b29eadb7b7bab4baf0baba" rel="noreferrer noopener nofollow">[email protected]</a>'], [19, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a7d5c6c9c3c8ca89c2cac6cecbe7c3d4cdc3d489cd" rel="noreferrer noopener nofollow">[email protected]</a>']])
df1 = df1.set_axis(['ID1', 'email 1'], axis=1)

# define table 2 as df2
df2 = pd.DataFrame([[100, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="fe909b89d09b939f9792be948d8d9d9ad09a" rel="noreferrer noopener nofollow">[email protected]</a>'], [6, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f7939e91919285929983d9929a969e9bb7828292929993d993849c9d84" rel="noreferrer noopener nofollow">[email protected]</a>'], [99, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="8ae4effda4efe7ebe3e6caeee0e0f9eea4e9e5e4e7" rel="noreferrer noopener nofollow">[email protected]</a>'], [10, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="107e75673e757d71797c507a7878633e737f" rel="noreferrer noopener nofollow">[email protected]</a>'], [115, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1d73786a3378707c74715d796e7777796e796e337e7279" rel="noreferrer noopener nofollow">[email protected]</a>'], [116, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="335d56441d565e525a5f73574059585957401d505858" rel="noreferrer noopener nofollow">[email protected]</a>'], [8, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0d7e6c60682368606c64614d6966646e67236e" rel="noreferrer noopener nofollow">[email protected]</a>'], [200, '<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b2d6dbd4d4d7c0d7dcc69cd7dfd3dbdef2d8d6c1d8d69cd1dd" rel="noreferrer noopener nofollow">[email protected]</a>']])
df2 = df2.set_axis(['ID2', 'email 2'], axis=1)

### End part to import the data ###

### Fun part now... ###

# hash the emails in each row of df1?
df1['hash 1'] = pd.util.hash_pandas_object(df1['email 1'].astype(str))  

# hash the emails in each row of df2?
df2['hash 2'] = pd.util.hash_pandas_object(df2['email 2'].astype(str)) 

# perform an inner join of df1 and df2 about their IDs, ID1 and ID2 respectively
df3 = pd.merge(df1, df2, how='inner', left_on='ID1', right_on='ID2') 

# add an email comparison column
df3['same email'] = df3['email 1'] == df3['email 2']

# add a hash comparison column
df3['same hash'] = df3['hash 1'] == df3['hash 2']

# print the table...
print(df3)
 

结果显示,虽然第 1 行中的电子邮件地址相同(据我所知),但哈希值不同:

   ID1                           email 1                hash 1  ID2                       email 2                hash 2  same email  same hash
0    6  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="83e7eae5e5e6f1e6edf7ade6eee2eaefc3f6bbbbbbbbf6e6e6ede7ade7f0e8e9f0" rel="noreferrer noopener nofollow">[email protected]</a>  18381560226251184406    6  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c0a4a9a6a6a5b2a5aeb4eea5ada1a9ac80b5b5a5a5aea4eea4b3abaab3" rel="noreferrer noopener nofollow">[email protected]</a>  16113553761483526335       False      False
1    8                <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="fc8f9d9199d299919d9590bc9897959f96d29f" rel="noreferrer noopener nofollow">[email protected]</a>   5780217243550696535    8            <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="1162707c743f747c70787d51757a78727b3f72" rel="noreferrer noopener nofollow">[email protected]</a>   6939369575697951555        True      False
2  200           <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="5e3a3738383b2c3b302a703b333f37321e3d34362d703131" rel="noreferrer noopener nofollow">[email protected]</a>  13252009090739560311  200      differ<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="aacfc4de84cfc7cbc3c6eac0ced9c0ce84c9c5" rel="noreferrer noopener nofollow">[email protected]</a>   1942861278265138167       False      False

为什么来自不同 DataFrame 的相同电子邮件地址的这些哈希值彼此不同?

最佳答案

根据documentation ,默认的操作模式是在哈希计算中包含索引。因此,当两封相同的电子邮件具有不同的索引时,哈希值是不同的。

你可以尝试:

df1["hash 1"] = pd.util.hash_pandas_object(df1["email 1"].astype(str), index=False)
df2["hash 2"] = pd.util.hash_pandas_object(df2["email 2"].astype(str), index=False)

那么结果将是:

   ID1                           email 1               hash 1  ID2                       email 2                hash 2
0    6  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="03676a65656671666d772d666e626a6f43763b3b3b3b7666666d672d6770686970" rel="noreferrer noopener nofollow">[email protected]</a>  5185970979410096600    6  <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="dfbbb6b9b9baadbab1abf1bab2beb6b39faaaababab1bbf1bbacb4b5ac" rel="noreferrer noopener nofollow">[email protected]</a>  18338061231746973003
1    8                <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="2d5e4c40480348404c44416d4946444e47034e" rel="noreferrer noopener nofollow">[email protected]</a>  9881121729072933860    8            <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="a9dac8c4cc87ccc4c8c0c5e9cdc2c0cac387ca" rel="noreferrer noopener nofollow">[email protected]</a>   9881121729072933860
2  200           <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="65010c03030017000b114b0008040c0925060f0d164b0a0a" rel="noreferrer noopener nofollow">[email protected]</a>   742268446511091656  200      <a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="62060b04040710070c164c070f030b0e2208061108064c010d" rel="noreferrer noopener nofollow">[email protected]</a>    775994242592712264

哈希计算的其他方法是使用内置的 hash 函数:

df1["hash 1"] = df1["email 1"].apply(hash)
df2["hash 2"] = df2["email 2"].apply(hash)

关于python - 为什么不同 Pandas DataFrame 之间相同值的这些哈希值不同?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76936567/

相关文章:

python - 从文件 python 中提取 id 和相应的标记并附加到字典中

python - 从 JSON 中提取数据并使用 python 进行迭代

python - Pandas 数据框替换列中的子字符串,给出意外的结果

python - 如何获得 DataFrame 中列值的组合结果?

r - data.frames 列表中特定 data.frame 列的高效函数

python - 检查相同条件下的多个 hasattr

python - 如何减少 matplotlib 中的颜色条宽度?

python - 由不同数据帧的唯一值组成的新数据帧

python - 将组总计添加到 Pandas 中的数据框的最佳方法

scala - 添加包含按 df 分组的列数 og 的列