python - Pandas 会影响 Rapidfuzz 匹配的结果吗?

标签 python pandas

我在这方面碰壁了。如果我在 pandas 数据帧内运行 Rapidfuzz 以及单独运行它,它会为字符串分数相似性提供不同的结果吗?为什么地址相似度 2 和最后一行的结果不同?

from rapidfuzz import process, utils, fuzz
import pandas as pd
import numpy as np

address_a = 'high new technology development zones huainan city anhui province china anhui anhui any city'
address_b = 'industrial park of funan city'

test_anui_data = {'Processed Client Name': ['anhui jinhan clothing co ltd'], 'Processed Aruvio Name': ['anhui jinhan clothing co ltd'], 'Processed Client Address': [address_a], 'Processed Aruvio Address': [address_b],  'Name Similarity': [89.2857142857142],  'Address Similarity': [np.nan]}  
  
# Create DataFrame  
test_anui = pd.DataFrame(test_anui_data)  
test_anui

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui['Processed Client Address']), str(test_anui['Processed Aruvio Address']))
print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

最佳答案

该错误是由于您在应用模糊测试时调用了整个列。如果您执行以下操作,即将模糊应用于单独的行,您会得到相同的结果:

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.at[0,'Processed Client Address']), str(test_anui.at[0,'Processed Aruvio Address']))

print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

或者,使用.loc

test_anui= test_anui[(test_anui['Address Similarity'].isnull()) & (test_anui['Address Similarity']!='')]
test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[0,'Processed Client Address']), str(test_anui.loc[0,'Processed Aruvio Address']))

print('the address similarity is different? ', fuzz.token_sort_ratio(address_a, address_b))

数据框中的输出是:

    Processed Client Name         Processed Aruvio Name  \
0  anhui jinhan clothing co ltd  anhui jinhan clothing co ltd   

                            Processed Client Address  \
0  high new technology development zones huainan ...   

        Processed Aruvio Address  Name Similarity  Address Similarity  \
0  industrial park of funan city        89.285714                 NaN   

   Address Similarity 2  
0             28.099174  

fuzz.token_sort_ratio(address_a, address_b) 的值为 28.099173553719012

换句话说,您需要指定要从中提取字符串的行。我想您的数据框由多行组成,这意味着您必须对每一行执行此操作:

for i in len(test_anui):
    test_anui['Address Similarity 2'] = fuzz.token_sort_ratio(str(test_anui.loc[i,'Processed Client Address']), 
    str(test_anui.loc[i,'Processed Aruvio Address']))

关于python - Pandas 会影响 Rapidfuzz 匹配的结果吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68570948/

相关文章:

python - 向具有三个独立输入的函数添加输入验证

python - 如何根据 Pandas 的条件合并按时间顺序排列的连续行?

python - 如何使用 pandas 取消数据透视

python - 如何设置Google机器学习任务的输出目录?

python - Robot Framework 找不到元素

python - 得到一条直线上的n个点

python - 虽然循环在 Python 中无法正常工作

python - 将sqlite3数据库读入pandas DataFrame时如何修复SyntaxError

python - 在 Pandas DataFrame 中计算多个综合分数

python - Pandas DataFrame : replace all values in a column, 基于条件