Python合并两个数据框(模糊匹配，有些列完全匹配，而有些列不匹配)

我有两个数据框:

df1

district place year votes candidate
1        1     2000 25    bond james
1        1     2000 30    smith john peter
1        1     2000 10    caprio leonardo di
1        1     2001 5     bond james

df2

district place year money candidate
1        1     2000 500   bond james
1        1     2000 100   di caprio leonardo
1        1     2000 10    smith j.peter
1        1     2001 90    bond james

我想匹配两个数据框。 “地区”、“地点”、“年份”列完全匹配，但“候选人”列不完全匹配。问题出在“候选人”列中，名称并不总是完全匹配。我尝试了以下代码:

import pandas as pd
import fuzzy_pandas fpd

df1["district"] = df1["district"].astype(str)
df1["place"] = df1["place"].astype(str)
df1["year"] = df1["year"].astype(str)
df1["candidate"] = df1["candidate"].astype(str)
df2["district"] = df2["district"].astype(str)
df2["place"] = df2["place"].astype(str)
df2["year"] = df2["year"].astype(str)
df2["candidate"] = df2["candidate"].astype(str)

data= fpd.fuzzy_merge(df1, df2,
                        left_on=['district', 'place', 'year', 'candidate'],
                        right_on=['district', 'place', 'year', 'candidate'],
                        method='levenshtein',
                        threshold=0.6,
                        join='left-outer')

这就是我想要获得的:

district place year votes candidate           money
1        1     2000 25    bond james          500
1        1     2000 30    smith john peter    10
1        1     2000 10    caprio leonardo di  100
1        1     2001 5     bond james          90

但有时会因为“地区”、“地点”或“年份”列而出现错误匹配。我应该如何更正我的代码？

最佳答案

一个可能的解决方案，使用 difflib.get_close_matches获取两个数据帧中的 candidate 列之间最接近的匹配，然后合并两个数据帧:

import difflib

df2.candidate = df2.candidate.map(
    lambda x: difflib.get_close_matches(x, df1.candidate)[0])

df1.merge(df2, on= ['district', 'place', 'year', 'candidate'])

输出:

   district  place  year  votes           candidate  money
0         1      1  2000     25          bond james    500
1         1      1  2000     30    smith john peter     10
2         1      1  2000     10  caprio leonardo di    100
3         1      1  2001      5          bond james     90

关于Python合并两个数据框(模糊匹配，有些列完全匹配，而有些列不匹配)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/77201460/

Python合并两个数据框(模糊匹配，有些列完全匹配，而有些列不匹配)

上一篇：typescript - 如何在 Vuetify 3 中导入 TypeScript 类型？

下一篇：c - 即使客户端不发送数据也会发生 UNIX 套接字读取