python-3.x - 从字符串中过滤用户名

我正在尝试过滤推文中引用的用户名，如下例所示:

示例:

tw = 'TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'

所需的输出将是:

rt_unames = ['uname1', 'uname6']
mt_unames = ['uname2', 'uname3', 'uname4', 'uname5']

我可以像下面的 naïve solution 一样执行类似 for 循环的操作:

朴素的解决方案:

def find_end_idx(tw_part):
    end_space_idx = len(tw)
    try:
        end_space_idx = tw[start_idx:].index(' ')
    except Exception as e:
        pass
    end_dot_idx = len(tw)
    try:
        end_dot_idx = tw[start_idx:].index('.')
    except Exception as e:
        pass
    end_semi_idx = len(tw)
    try:
        end_semi_idx = tw[start_idx:].index(',')
    except Exception as e:
        pass
    return min(end_space_idx, end_dot_idx, end_semi_idx)

tw = 'RT @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
acc = ''
rt_unames = []
mt_unames = []
for i, c in enumerate(tw):
    acc += c
    if acc[::-1][:2][::-1] == 'RT':
        start_idx = i+2
        end_idx = find_end_idx(tw_part=tw[start_idx:])
        uname = tw[start_idx:start_idx+end_idx]
        if uname not in mt_unames:
            rt_unames.append(uname)
        acc = ''
    elif acc[::-1][:1]=='@':
        start_idx = i
        end_idx = find_end_idx(tw_part=tw[start_idx:])
        uname = tw[start_idx:start_idx+end_idx]
        if uname not in rt_unames:
            mt_unames.append(uname)
        acc = ''
rt_unames, mt_unames

哪些输出:

(['@uname1', '@uname6'], ['@uname2', '@uname3', '@uname4', '@uname5'])

问题: 因为我需要将它应用于 pandas.DataFrame 中的每条推文，所以我正在寻找一种更优雅、更快速的解决方案来获得此结果。

如果有任何建议，我将不胜感激。

最佳答案

让我们用 regex pattern 试试 re.findall::

import re

rt_unames = re.findall(r'(?<=TR |RT )@([^,]+)', tw)
mt_unames = re.findall(r'(?<!TR |RT )@([^,]+)', tw)

以类似的方式，你可以在dataframe中的列上使用str.findall方法:

df['rt_unames'] = df['tweet'].str.findall(r'(?<=TR |RT )@([^,]+)')
df['mt_unames'] = df['tweet'].str.findall(r'(?<!TR |RT )@([^,]+)')

结果:

['uname1', 'uname6']
['uname2', 'uname3', 'uname4', 'uname5']

关于python-3.x - 从字符串中过滤用户名，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/65050413/

python-3.x - 从字符串中过滤用户名

上一篇：python - 在一行中的 2 个数据框列上应用 Lambda

下一篇：java - 无法将 spring jpa 与 Oracle 连接