我有大量的单词“组”。如果一组中的任何单词同时出现在 A 列和 B 列中,我想从两列中删除该组中的单词。如何循环遍历所有组(即遍历列表中的子列表)?
下面有缺陷的代码仅删除最后一组中的常用单词,而不是 stuff 中的所有三组(列表)。 [如果组中的一个单词在字符串中,我首先创建一个指示符,然后如果两个字符串都有该组中的单词,则创建另一个指示符。仅对于 A 和 B 对都包含该组中的单词的情况,我会删除特定的组单词。]
如何正确指定循环?
编辑: 在我建议的代码中,每个循环都以原始列重新启动,而不是在从前一组中删除单词的列上循环。
解决方案建议更加优雅和简洁,但如果这些单词是另一个单词的一部分,则将其删除(例如,单词“foo”被正确地从“foo hello”中删除,但也错误地从“foobar”中删除。
# Input data:
data = {'A': ['summer time third grey abc', 'yellow sky hello table', 'fourth autumnwind'],
'B': ['defg autumn times fourth table', 'not red skies second garnet', 'first blue chair winter']
}
df = pd.DataFrame (data, columns = ['A', 'B'])
A B
0 summer time third grey abc defg autumn times fourth table
1 yellow sky hello table not red skies second garnet
2 fourth autumnwind first blue chair winter
# Groups of words to be removed:
colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']
stuff = [colors, seasons, numbers]
# Code below only removes the last list in stuff (numbers):
def fA(S,y):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
y = 1
return y
def fB(T,y):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', T):
y = 1
return y
def fARemove(S):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', S):
S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
return S
def fBRemove(T):
for word in listed:
if re.search(r'\b' + re.escape(word) + r'\b', T):
T=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', T)
return T
for listed in stuff:
df['A_Ind'] = 0
df['B_Ind'] = 0
df['A_Ind'] = df.apply(lambda x: fA(x.A, x.A_Ind), axis=1)
df['B_Ind'] = df.apply(lambda x: fB(x.B, x.B_Ind), axis=1)
df['inboth'] = 0
df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1
df['A_new'] = df['A']
df['B_new'] = df['B']
df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fARemove(x.A), axis=1)
df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fBRemove(x.B), axis=1)
del df['inboth']
del df['A_Ind']
del df['B_Ind']
df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
df['A_new'] = df['A_new'].str.strip()
df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
df['B_new'] = df['B_new'].str.strip()
预期输出是:
A_new B_new
0 grey abc defg table
1 hello table no second garnet
2 autumnwind blue chair winter
最佳答案
import re
flatten_list = lambda l: [item for subl in l for item in subl]
def remove_recursive(s, l):
while len(l) > 0:
s = s.replace(l[0], '')
l = l[1:]
return re.sub(r'\ +', ' ', s).strip()
df['A_new'] = df.apply(lambda x: remove_recursive(x.A, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)
df['B_new'] = df.apply(lambda x: remove_recursive(x.B, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)
df.head()
# A_new B_new
# 0 time grey abc defg table
# 1 hello table not second garnet
# 2 wind blue chair
这与注释中的代码类似,使用递归 lambda 来匹配单词,并使用扁平列表来计算列表中两列中匹配的单词。
关于python - 如何从Python列表中删除常用单词?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66344523/