我有两个数据框:
df1
district place year votes candidate
1 1 2000 25 bond james
1 1 2000 30 smith john peter
1 1 2000 10 caprio leonardo di
1 1 2001 5 bond james
df2
district place year money candidate
1 1 2000 500 bond james
1 1 2000 100 di caprio leonardo
1 1 2000 10 smith j.peter
1 1 2001 90 bond james
我想匹配两个数据框。 “地区”、“地点”、“年份”列完全匹配,但“候选人”列不完全匹配。问题出在“候选人”列中,名称并不总是完全匹配。 我尝试了以下代码:
import pandas as pd
import fuzzy_pandas fpd
df1["district"] = df1["district"].astype(str)
df1["place"] = df1["place"].astype(str)
df1["year"] = df1["year"].astype(str)
df1["candidate"] = df1["candidate"].astype(str)
df2["district"] = df2["district"].astype(str)
df2["place"] = df2["place"].astype(str)
df2["year"] = df2["year"].astype(str)
df2["candidate"] = df2["candidate"].astype(str)
data= fpd.fuzzy_merge(df1, df2,
left_on=['district', 'place', 'year', 'candidate'],
right_on=['district', 'place', 'year', 'candidate'],
method='levenshtein',
threshold=0.6,
join='left-outer')
这就是我想要获得的:
district place year votes candidate money
1 1 2000 25 bond james 500
1 1 2000 30 smith john peter 10
1 1 2000 10 caprio leonardo di 100
1 1 2001 5 bond james 90
但有时会因为“地区”、“地点”或“年份”列而出现错误匹配。我应该如何更正我的代码?
最佳答案
一个可能的解决方案,使用 difflib.get_close_matches
获取两个数据帧中的 candidate
列之间最接近的匹配,然后合并两个数据帧:
import difflib
df2.candidate = df2.candidate.map(
lambda x: difflib.get_close_matches(x, df1.candidate)[0])
df1.merge(df2, on= ['district', 'place', 'year', 'candidate'])
输出:
district place year votes candidate money
0 1 1 2000 25 bond james 500
1 1 1 2000 30 smith john peter 10
2 1 1 2000 10 caprio leonardo di 100
3 1 1 2001 5 bond james 90
关于Python合并两个数据框(模糊匹配,有些列完全匹配,而有些列不匹配),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77201460/