数据框
df = pd.DataFrame({'A': [['gener'], ['gener'], ['system'], ['system'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum', 'toledo']], 'B': [['gutter'], ['gutter'], ['gutter', 'system'], ['gutter', 'guard', 'system'], ['ohio', 'gutter'], ['gutter', 'toledo'], ['toledo', 'gutter'], ['gutter'], ['gutter'], ['gutter'], ['how', 'to', 'instal', 'aluminum', 'gutter'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'color'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'adrian', 'ohio'], ['aluminum', 'gutter', 'bowl', 'green', 'ohio'], ['aluminum', 'gutter', 'maume', 'ohio'], ['aluminum', 'gutter', 'perrysburg', 'ohio'], ['aluminum', 'gutter', 'tecumseh', 'ohio'], ['aluminum', 'gutter', 'toledo', 'ohio']]}, columns=['A', 'B'])
它看起来像什么
我有一个包含两列列表的数据框。
A B
0 [gener] [gutter]
1 [gener] [gutter]
2 [system] [gutter, system]
3 [system] [gutter, guard, system]
4 [gutter] [ohio, gutter]
5 [gutter] [gutter, toledo]
6 [gutter] [toledo, gutter]
7 [gutter] [gutter]
8 [gutter] [gutter]
9 [gutter] [gutter]
10 [aluminum] [how, to, instal, aluminum, gutter]
11 [aluminum] [aluminum, gutter]
12 [aluminum] [aluminum, gutter, color]
13 [aluminum] [aluminum, gutter]
14 [aluminum] [aluminum, gutter, adrian, ohio]
15 [aluminum] [aluminum, gutter, bowl, green, ohio]
16 [aluminum] [aluminum, gutter, maume, ohio]
17 [aluminum] [aluminum, gutter, perrysburg, ohio]
18 [aluminum] [aluminum, gutter, tecumseh, ohio]
19 [aluminum, toledo] [aluminum, gutter, toledo, ohio]
问题
如果我有列表列,是否有一个 pandas 函数可以让我对整个列表数组进行操作以检查交集并返回 bool 值或相交值作为新系列?
例如,我希望 pandas 具有与此等效的内容:
def intersection(df, col1, col2, return_type='boolean'):
if return_type == 'boolean':
df = df[[col1, col2]]
s = []
for idx in df.iterrows():
s.append(any([phrase in idx[1][0] for phrase in idx[1][1]]))
S = pd.Series(s)
return S
elif return_type == 'word':
df = df[[col1, col2]]
s = []
for idx in df.iterrows():
s.append(', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))]))
S = pd.Series(s)
return S
#Create column C in df
df['C'] = intersection(df, 'A', 'B', 'word')
... 无需编写我自己的函数或求助于 for 循环。我觉得必须有一种更简单的方法来比较同一行两列中的列表,看看它们是否相交。
我可以用 for
循环来做,但它对我来说很难看
for
循环返回一个 boolean
系列:
for idx in df.iterrows():
any([phrase in idx[1][0] for phrase in idx[1][1]])
产生:
False
False
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
或者,使用 set
s 找到相交的词:
for idx in df.iterrows():
', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))])
''
''
'system'
'system'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'gutter'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'aluminum'
'toledo, aluminum'
最佳答案
检查 df.A
中的每一项是否都包含在 df.B
中:
>>> df.apply(lambda row: all(i in row.B for i in row.A), axis=1)
# OR: ~(df['A'].apply(set) - df['B'].apply(set)).astype(bool)
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
dtype: bool
获取联合:
df['intersection'] = [list(set(a).intersection(set(b)))
for a, b in zip(df.A, df.B)]
>>> df
A B intersection
0 [gener] [gutter] []
1 [gener] [gutter] []
2 [system] [gutter, system] [system]
3 [system] [gutter, guard, system] [system]
4 [gutter] [ohio, gutter] [gutter]
5 [gutter] [gutter, toledo] [gutter]
6 [gutter] [toledo, gutter] [gutter]
7 [gutter] [gutter] [gutter]
8 [gutter] [gutter] [gutter]
9 [gutter] [gutter] [gutter]
10 [aluminum] [how, to, instal, aluminum, gutter] [aluminum]
11 [aluminum] [aluminum, gutter] [aluminum]
12 [aluminum] [aluminum, gutter, color] [aluminum]
13 [aluminum] [aluminum, gutter] [aluminum]
14 [aluminum] [aluminum, gutter, adrian, ohio] [aluminum]
15 [aluminum] [aluminum, gutter, bowl, green, ohio] [aluminum]
16 [aluminum] [aluminum, gutter, maume, ohio] [aluminum]
17 [aluminum] [aluminum, gutter, perrysburg, ohio] [aluminum]
18 [aluminum] [aluminum, gutter, tecumseh, ohio] [aluminum]
19 [aluminum, toledo] [aluminum, gutter, toledo, ohio] [aluminum, toledo]
关于python - Pandas:如何在 DataFrame 中按行比较列表的列与 Pandas(不是 for 循环)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35616058/