regex - 使用正则表达式从 Twitter 数据中提取 "mentions"的问题

我正在尝试从 Twitter 中提取推文中的提及，即 @Google 或 @Apple。

到目前为止，这是我的代码，用于从列中提取提及项，然后使用提及项创建另一个列。

df_bdtu['mentions'] = df_bdtu['tweet_text'].str.findall('(?:^|\s)[＠ @]{1}([^\s#<>[\]|{}]+)')

大部分情况下它都能正常工作，但我在某些边缘情况下遇到了一些问题，例如以这条推文为例:

Check out @Dreams_n_Songs and give them a follow! I can't recommend their hoodies enough!Shop now  👉…

存储在下方 mentions 列中的提及是不正确的，因为它出于某种原因包含表情符号。

['Dreams_n_Songs', '👉…']

另一个问题是在提及之前有一个 .，例如这个例子:

.@ChelseaFC, @FCBayern, @VfL_Wolfsburg and more are among the latest names to be confirmed at -…

产生的提及不包括第一次提及。

[FCBayern,, VfL_Wolfsburg]

我该如何修复这个正则表达式？

最佳答案

你可以使用

[＠@]([^][\s#<>|{}]+)

参见 regex demo .因此，删除 (?:\s|^) 需要字符串的开头或匹配开始处的空格，并且您需要从 [@@ ] 字符类。

在 Pandas 代码中，你可以这样使用它:

df_bdtu['mentions'] = df_bdtu['tweet_text'].str.findall(r'[＠@]([^][\s#<>|{}]+)')

注意 r'...' 原始字符串文字表示法。

关于regex - 使用正则表达式从 Twitter 数据中提取 "mentions"的问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66028983/

相关文章：

python - 每行最小值，Python Pandas