python - 如何将 `str.contains` 的输出分配给 Pandas 列？

这一定在其他地方得到了回答，但我找不到链接。我有一个 df，其中包含一些任意文本和一个单词列表 W。我想为 df 分配一个新列，以便它包含 W 中匹配的单词。例如，给定 df

   T
   dog
   dog and meerkat
   cat

如果 W="dog"，那么我想要

   T
   dog                dog
   dog and meerkat    dog
   cat

我目前的情况是

df[df.T.str.contains('|'.join(W), case=False)]

但这只会给我匹配的行，即:

   T
   dog
   dog and meerkat

有什么想法、建议吗？

最佳答案

您可以使用 Series.where - 哪里不匹配得到 NaN:

W = 'dog'
df['new'] = df['T'].where(df['T'].str.contains('|'.join(W), case=False))
print (df)
                 T              new
0              dog              dog
1  dog and meerkat  dog and meerkat
2              cat              NaN

或DataFrame.loc :

W = 'dog'
df.loc[df['T'].str.contains('|'.join(W), case=False), 'new'] = df['T']
print (df)
                 T              new
0              dog              dog
1  dog and meerkat  dog and meerkat
2              cat              NaN

另一种可能的解决方案是 numpy.where如果不匹配，可以在哪里添加值:

W = 'dog'
df['new'] = np.where(df['T'].str.contains('|'.join(W), case=False), df['T'], 'nothing')
print (df)
                 T              new
0              dog              dog
1  dog and meerkat  dog and meerkat
2              cat          nothing

但是如果只需要匹配列表的值使用extract并为 groups 添加第一个和最后一个 ():

W = ['dog', 'rabbit']
df['new'] = df['T'].str.extract('('+'|'.join(W) + ')', expand=True)
print (df)
                 T  new
0              dog  dog
1  dog and meerkat  dog
2              cat  NaN

Extracting in docs .

关于python - 如何将 `str.contains` 的输出分配给 Pandas 列？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41707015/

上一篇：python - Unittest : How to ensure type equality, 不只是值相等？

下一篇：python - 如何使用索引和 bool 索引有效地设置 numpy 数组的值？

相关文章：

python - 根据表中的现有属性在 Python 中填充 SQL 表

pandas - 将每 n 行连接成一行 Pandas

python - 计算总和的结果连续多少次为正(或负)

python - 对列多个文件 Pandas 的操作

python - 从单个 View 提供多个模板(或者我应该使用多个 View ？)

python - 如何转换 pandas 数据框，使索引是唯一的一组值，数据是每个值的计数？

python - ElementTree 命名空间字典不适用于 find() 或 findall()

python - web2py如何制作列表:reference of an other table in db. py

python - 保留最大绝对值并返回具有重复索引的行的平均值

python - 在 multiindex 的第二个索引上使用 .loc