python - 如何从Python列表中删除常用单词？

我有大量的单词“组”。如果一组中的任何单词同时出现在 A 列和 B 列中，我想从两列中删除该组中的单词。如何循环遍历所有组(即遍历列表中的子列表)？

下面有缺陷的代码仅删除最后一组中的常用单词，而不是 stuff 中的所有三组(列表)。 [如果组中的一个单词在字符串中，我首先创建一个指示符，然后如果两个字符串都有该组中的单词，则创建另一个指示符。仅对于 A 和 B 对都包含该组中的单词的情况，我会删除特定的组单词。]

如何正确指定循环？

编辑: 在我建议的代码中，每个循环都以原始列重新启动，而不是在从前一组中删除单词的列上循环。

解决方案建议更加优雅和简洁，但如果这些单词是另一个单词的一部分，则将其删除(例如，单词“foo”被正确地从“foo hello”中删除，但也错误地从“foobar”中删除。


# Input data:

data = {'A': ['summer time third grey abc', 'yellow sky hello table', 'fourth autumnwind'],
        'B': ['defg autumn times fourth table', 'not red skies second garnet', 'first blue chair winter']
}
df = pd.DataFrame (data, columns = ['A', 'B'])  

                            A                               B
0  summer time third grey abc  defg autumn times fourth table
1      yellow sky hello table     not red skies second garnet
2           fourth autumnwind         first blue chair winter

# Groups of words to be removed:

colors = ['red skies', 'red sky', 'yellow sky', 'yellow skies', 'red', 'blue', 'black', 'yellow', 'green', 'grey']
seasons = ['summer times', 'summer time', 'autumn times', 'autumn time', 'spring', 'summer', 'winter', 'autumn']
numbers = ['first', 'second', 'third', 'fourth']

stuff = [colors, seasons, numbers]



# Code below only removes the last list in stuff (numbers):

def fA(S,y):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', S):
            y = 1
    return y


def fB(T,y):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', T):
            y = 1
    return y



def fARemove(S):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', S):
            S=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', S)
    return S



def fBRemove(T):
    for word in listed:
        if re.search(r'\b' + re.escape(word) + r'\b', T):
            T=re.sub(r'\b{}\b'.format(re.escape(word)), ' ', T)
    return T

for listed in stuff:

    df['A_Ind'] = 0
    df['B_Ind'] = 0

    df['A_Ind'] = df.apply(lambda x: fA(x.A, x.A_Ind), axis=1)
    df['B_Ind'] = df.apply(lambda x: fB(x.B, x.B_Ind), axis=1)

    df['inboth'] = 0
    df.loc[((df.A_Ind == 1) & (df.B_Ind == 1)), 'inboth'] = 1

    df['A_new'] = df['A']
    df['B_new'] = df['B']

    df.loc[df.inboth == 1, 'A_new'] = df.apply(lambda x: fARemove(x.A), axis=1)
    df.loc[df.inboth == 1, 'B_new'] = df.apply(lambda x: fBRemove(x.B), axis=1)


    del df['inboth']
    del df['A_Ind']
    del df['B_Ind']
    
    df['A_new'] = df['A_new'].str.replace('\s{2,}', ' ')
    df['A_new'] = df['A_new'].str.strip()
    df['B_new'] = df['B_new'].str.replace('\s{2,}', ' ')
    df['B_new'] = df['B_new'].str.strip()

预期输出是:

         A_new              B_new
0     grey abc         defg table
1  hello table   no second garnet
2   autumnwind  blue chair winter

最佳答案

import re

flatten_list = lambda l: [item for subl in l for item in subl]
def remove_recursive(s, l):
    while len(l) > 0:
        s = s.replace(l[0], '')
        l = l[1:]

    return re.sub(r'\ +', ' ', s).strip()


df['A_new'] = df.apply(lambda x: remove_recursive(x.A, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)
df['B_new'] = df.apply(lambda x: remove_recursive(x.B, flatten_list([l for l in stuff if (len([e for e in l if e in x.A]) > 0 and len([e for e in l if e in x.B]) > 0)])), axis = 1)

df.head()

#            A_new              B_new
# 0  time grey abc         defg table
# 1    hello table  not second garnet
# 2           wind         blue chair

这与注释中的代码类似，使用递归 lambda 来匹配单词，并使用扁平列表来计算列表中两列中匹配的单词。

关于python - 如何从Python列表中删除常用单词？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66344523/

python - 如何从Python列表中删除常用单词？

上一篇：python - 计算三维数据数组(纬度、经度、时间)中跨时间连续值的最长序列

下一篇：MySQL LIKE 区分大小写，但我不希望它如此