我正在尝试过滤推文中引用的用户名,如下例所示:
示例:
tw = 'TR @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
所需的输出将是:
rt_unames = ['uname1', 'uname6']
mt_unames = ['uname2', 'uname3', 'uname4', 'uname5']
我可以像下面的 naïve solution
一样执行类似 for 循环的操作:
朴素的解决方案:
def find_end_idx(tw_part):
end_space_idx = len(tw)
try:
end_space_idx = tw[start_idx:].index(' ')
except Exception as e:
pass
end_dot_idx = len(tw)
try:
end_dot_idx = tw[start_idx:].index('.')
except Exception as e:
pass
end_semi_idx = len(tw)
try:
end_semi_idx = tw[start_idx:].index(',')
except Exception as e:
pass
return min(end_space_idx, end_dot_idx, end_semi_idx)
tw = 'RT @uname1, @uname2, @uname3, text1, text2, @uname4, text3, @uname5, RT @uname6'
acc = ''
rt_unames = []
mt_unames = []
for i, c in enumerate(tw):
acc += c
if acc[::-1][:2][::-1] == 'RT':
start_idx = i+2
end_idx = find_end_idx(tw_part=tw[start_idx:])
uname = tw[start_idx:start_idx+end_idx]
if uname not in mt_unames:
rt_unames.append(uname)
acc = ''
elif acc[::-1][:1]=='@':
start_idx = i
end_idx = find_end_idx(tw_part=tw[start_idx:])
uname = tw[start_idx:start_idx+end_idx]
if uname not in rt_unames:
mt_unames.append(uname)
acc = ''
rt_unames, mt_unames
哪些输出:
(['@uname1', '@uname6'], ['@uname2', '@uname3', '@uname4', '@uname5'])
问题:
因为我需要将它应用于 pandas.DataFrame
中的每条推文,所以我正在寻找一种更优雅、更快速的解决方案来获得此结果。
如果有任何建议,我将不胜感激。
最佳答案
让我们用 regex pattern
试试 re.findall
::
import re
rt_unames = re.findall(r'(?<=TR |RT )@([^,]+)', tw)
mt_unames = re.findall(r'(?<!TR |RT )@([^,]+)', tw)
以类似的方式,你可以在dataframe中的列上使用str.findall
方法:
df['rt_unames'] = df['tweet'].str.findall(r'(?<=TR |RT )@([^,]+)')
df['mt_unames'] = df['tweet'].str.findall(r'(?<!TR |RT )@([^,]+)')
结果:
['uname1', 'uname6']
['uname2', 'uname3', 'uname4', 'uname5']
关于python-3.x - 从字符串中过滤用户名,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65050413/