Python 搜索文本列，如果单词列表中有任何匹配的关键字，则返回

标签 python pandas

我有一个包含两列的数据框:message_id 和 msg_lower。我还有一个称为术语的关键字列表。我的目标是在 msg_lower 字段中搜索术语列表中的任何单词。如果它们匹配，我想返回一个包含 message_id 和关键字的元组。

数据如下所示:

|message_id|msg_lower                      |
|1116193453|text here that means something |
|9023746237|more text there meaning nothing|

terms = [text, nothing, there meaning]

术语也可以超过一个单词

对于给定的示例，我想返回:

[(1116193453, text),(9023746237,text),(9023746237,nothing),(9023746237,there meaning)]

理想情况下，我希望尽可能高效地做到这一点

最佳答案

您可以压缩两列以进行可能的按元组循环、按项循环并测试分割值中的成员资格:

terms = ['text', 'nothing']
a = [(x,i) for x, y in zip(df['message_id'],df['msg_lower']) for i in terms if i in y.split()]
print (a)
[(1116193453, 'text'), (9023746237, 'text'), (9023746237, 'nothing')]

编辑:

terms = ['text', 'nothing', 'there meaning']

a = [(x, i) for x, y in zip(df['message_id'],df['msg_lower']) for i in terms if i in y]
print (a)
[(1116193453, 'text'), (9023746237, 'text'), 
 (9023746237, 'nothing'), (9023746237, 'there meaning')]

另一个想法是使用 findall 和单词边界来提取值:

a = [(x, i) for x, y in zip(df['message_id'],df['msg_lower']) 
            for i in terms if re.findall(r"\b{}\b".format(i), y)]

关于Python 搜索文本列，如果单词列表中有任何匹配的关键字，则返回，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56239366/

上一篇：python - 在python中导入模块(3个模块)

下一篇：Python-Selenium : Not able to scrape image from html/javascript string

相关文章：

python - 删除 pandas 中的中文

python - 如何将 pandas 数据框列的值设置为列

python - 情节不会在 Jupyter 中显示

python - python中的数据框操作

python - 如何根据一系列 if\else 条件和匹配值从多个数据帧中 BEST 提取信息？ (需要指导!))

python - Tensorflow Keras 嵌入层错误 : Layer weight shape not compatible

python - Tkinter Entry - 当字段为空时输入新值时出现错误消息

使用 Foreman 的 API 更新主机的 Python 脚本

python - 为什么这个 dos 命令在 python 中不起作用？

python - 如何使用配置解析器解析 bool 值