我有两个 df:
df1 = pd.DataFrame({'name1': ['ID1', 'ID2','ID3','ID4','ID5','ID6','ID7'], 'col1': ["S3,S22,S44", "S133,S32,S334", "S13,S24,S45", "S1,S2,S4,S5", "S3,S4,S5", "S3,S2,S5", "S38,S42,S9"],'col2': ['ab', 'ac','ad','ae','af','an','ak'],})
df2 = pd.DataFrame({'name2': ['Ik3', 'Ik1','Ik2','Ik7','Ik5','Ik6','Ik5'], 'col1': ["S3, S44, S22,S54", "S133, S32,S334, S30", "S13, S24,S45", "S11, S21,S4, S5", "S3, S4,S5", "S3, S22,S5", "S3, S4,S9, S10, S13"],'col2': ['ab', 'ae','ac','ad','af','ab','ak']})
df1
name1 col1 col2
ID1 S3,S22,S44 ab
ID2 S133,S32,S334 ac
ID3 S13,S24,S45 ad
ID4 S1,S2,S4,S5 ae
ID5 S3,S4,S5 af
ID6 S3,S2,S5 an
ID7 S38,S42,S9 ak
df2
name2 col1 col2
Ik3 S3,S44,S22,S54 ab
Ik1 S133,S32,S334,S30 ae
Ik2 S13,S24,S45 ac
Ik7 S11,S21,S4,S5 ad
Ik5 S3,S4,S5 af
Ik6 S3,S22,S5 ab
Ik5 S3,S4,S9,S10,S13 ak
想要比较 col2 列表的两个 df 并合并匹配率超过 50% 的内容,并将其余部分留空:
期望的输出:
name1 col1 col2 M_name2 M_col1 M_col2 percentage
ID1 S3,S22,S44 ab Ik3 S3,S44,S22,S54 ab 75
ID1 S3,S22,S44 ab Ik6 S3, S22,S5 ab 75
ID2 S133,S32,S334 ac Ik1 S133,S32,S334,S30 ae 50
ID3 S13,S24,S45 ad Ik2 S13,S24,S45 ac 100
ID4 S1,S2,S4,S5 ae Ik5 S3,S4,S5 af 50
ID5 S3,S4,S5 af Ik5 S3,S4,S5 af 100
ID6 S3,S2,S5 an Ik5 S3,S4,S5 af 75
ID7 S38,S42,S9 ak NaN NaN NaN NaN
我尝试过 isin 函数:
df1[df1.col2.isin(df2.col2)]
但没有得到所需的输出。如有任何建议,我们将不胜感激。
最佳答案
您的描述和输出不匹配。不过,这里有一些代码希望可以帮助您入门。
def get_ratios(df1, df2):
for a,b in zip(df1.col2, df2.col2):
clean = lambda s: list(map(str.strip, s.split(',')))
vals1, vals2 = clean(a), clean(b)
inter = set(vals1).intersection(vals2)
ratio = len(inter)/max(len(vals1), len(vals2))
yield ratio
s = pd.Series(get_ratios(df1, df2))
然后合并
(df1.merge(df2.rename(columns={'col1': 'nma1'}),
on=['col3'])
.assign(percentage=s)) #.where(s > 0.5)
<小时/>
col1 col2_x col3 nma1 col2_y percentage
0 ID1 S3, S22,S44 ab Ik3 S3, S44, S22,S54 0.750000
1 ID2 S133, S32,S334 ac Ik1 S133, S32,S334, S30 0.750000
2 ID3 S13, S2 4,S45 ad Ik2 S13, S24,S45 0.666667
3 ID4 S1, S2,S4 S5 ae Ik7 S11, S21,S4, S5 0.000000
4 ID5 S3, S4,S5 af Ik5 S3, S4,S5 1.000000
5 ID6 S3, S2,S5 as Ik6 S3, S2,S5 1.000000
6 ID7 S3, S4,S9 ak Ik5 S3, S4,S9, S10, S13 0.600000
关于python - 合并 pandas 中超过 50% 匹配的两个 df 列表字符串列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58456499/