我正在尝试分析 Pandas 数据的趋势。我有两个表,如果该行中的 UID 和 PID 存在于另一个表中,我想在其中创建一个新的二进制列。我当前拥有的表格的一个示例是:
>>> df_a = pd.DataFrame({"UID": [123, 456, 789, 012], "PID": [12, 55, 56, 89], "TIM": [76, 54, 21, 25]})
>>> df_a
PID TIM UID
0 12 76 123
1 55 54 456
2 56 21 789
3 89 25 010
>>> df_b = pd.DataFrame({'UID': [221, 012, 653, 456], 'PID': [17, 89, 51, 55], 'FOO': [2347, 32447, 3234, 7999]})
>>> df_b
FOO PID UID
0 2347 17 221
1 32447 89 010
2 3234 51 653
3 7999 55 456
我希望最终结果是:
>>> df_a
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 010 1
但我不确定具体如何去做。我认为 left join
是可行的方法,但我也很难实现这一点。任何帮助将不胜感激
最佳答案
您可以将左连接与 join
或 merge
一起使用,然后测试 FOO
列(如果不是 NaN
)到 boolean mask
,然后转换为 0,1
作者:astype
:
df_a['PUR'] = df_a.join(df_b.set_index(['PID','UID']), on=['PID','UID'])['FOO']
.notnull().astype(int)
print (df_a)
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 12 1
df_a['PUR'] = pd.merge(df_a, df_b, how='left', on=['PID','UID'])['FOO'].notnull().astype(int)
print (df_a)
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 12 1
另一个解决方案是通过 isin
进行测试:
df_a['PUR'] = df_a.set_index('PID')['UID'].isin(df_b.set_index('PID')['UID'])
.astype(int).values
print (df_a)
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 12 1
编辑:
两列似乎都需要 drop_duplicates
:
#added duplicates
df_b = pd.DataFrame({'UID': [221, 12, 456, 456],
'PID': [17, 89, 55, 55],
'FOO': [2347, 32447, 3234, 7999]})
print (df_b)
FOO PID UID
0 2347 17 221
1 32447 89 12
2 3234 55 456 <-duplicates by both columns
3 7999 55 456 <-duplicates by both columns
df_b = df_b.drop_duplicates(['PID','UID'])
df_a['PUR'] = df_a.join(df_b.set_index(['PID','UID']), on=['PID','UID'])['FOO']
.notnull().astype(int)
print (df_a)
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 12 1
关于python - 如何从两个表创建二进制标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43132104/