我想合并这些示例数据框:
- 如何在新的 df 中获取最接近的匹配项?
df1:
name age department
DJ Griffin 27 FD
Harris Smith 33 RD
df2:
name age department
D.J. Griffin III 27 FD
Harris Smith 33 RD
Miles Jones 58 RD
结果应如下所示:
df3:
name age department name_y
DJ Griffin 27 FD D.J. Griffin III
Harris Smith 33 RD Harris Smith
使用Difflib但出现错误,原因是dfs的长度不同。
import pandas as pd
import difflib
df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])
df2['name_y'] = df2['name']
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
结果:
IndexError: list index out of range
- 当存在另一个 45 岁的 Harris Smith 时,如何找到两列最接近的匹配项?
For the duplicate Harris Smith case
df1:
name age department
DJ Griffin 27 FD
Harris Smith 33 RD
Harris Smith 45 BA
df2:
name age department
D.J. Griffin III 27 FD
Harris Smith 33 RD
Harris Smith 45 BA
Miles Jones 58 RD
结果应如下所示:
df3:
name age department name_y
DJ Griffin 27 FD D.J. Griffin III
Harris Smith 33 RD Harris Smith
Harris Smith 45 BA Harris Smith
import pandas as pd
import difflib
df1 = pd.DataFrame([["DJ Griffin", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"]], columns=["name", "age", "department"])
df2 = pd.DataFrame([["D.J. Griffin III", 27, "FD"], ["Harris Smith", 33, "RD"], ["Harris Smith", 45, "BA"], ["Miles Jones", 58, "RD"]], columns=["name", "age", "department"])
df2['name_y'] = df2['name']
感谢您的帮助。
最佳答案
当您有零匹配时,就会出现问题,切片 [0]
是不可能的。
您可以使用:
df2['name'].apply(lambda x: next(iter(difflib.get_close_matches(x, df1['name'])), pd.NA))
或
df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])).str[0]
输出:
0 DJ Griffin
1 Harris Smith
2 NaN
Name: name, dtype: object
更新:
df1.merge(df2[['name', 'age']]
.assign(name_y=df2['name'],
name=df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])))
.explode('name')
.drop_duplicates(),
on=['name', 'age']
)
输出:
name age department name_y
0 DJ Griffin 27 FD D.J. Griffin III
1 Harris Smith 33 RD Harris Smith
2 Harris Smith 45 BA Harris Smith
关于python-3.x - 通过最接近的匹配合并不同长度的两列上的两个 Dataframe,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72612970/