python - 在 Python 中过滤文本数据

我无法理解我在这里做错了什么。我有下面的代码(相当简单)。

def compileWordList(textList, wordDict):
    '''Function to extract words from text lines exc. stops,
        and add associated line nums'''
    i = 0;
    for row in textList:
        i = i + 1
        words = re.split('\W+', row)
        for wordPart in words:
            word = repr(wordPart)
            word = word.lower()
            if not any(word in s for s in stopsList):
                if word not in wordDict:
                    x = wordLineNumsContainer()
                    x.addLineNum(i)
                    wordDict[word] = x
                elif word in wordDict:
                    lineNumValues = wordDict[word]
                    lineNumValues.addLineNum(i)
                    wordDict[word] = lineNumValues
            elif any(word in s for s in stopsList):
                print(word)

代码从列表中获取一个字符串(句子)。然后，它使用 re.split() 方法将字符串拆分为整个单词，返回字符串列表(单词)。

然后我强制字符串小写。然后我想让它检查这个词是否存在于我拥有的停用词列表中(英语中太常见的词而不必理会)。检查 word 是否在 stopsList 中的部分似乎从来没有工作过，因为停止词每次都在我的 wordDict 中结束。我还添加了底部的 print(word) 语句以检查它是否捕获了它们，但是什么也没有打印出来:(

在经过的字符串中使用了数百个停用词。

请哪位大侠指教一下？为什么字符串永远不会因为是停用词而被过滤？

非常感谢，亚历克斯

最佳答案

那个呢？

from collections import defaultdict
import re

stop_words = set(['a', 'is', 'and', 'the', 'i'])
text = [ 'This is the first line in my text'
       , 'and this one is the second line in my text'
       , 'I like texts with three lines, so I added that one'
       ]   
word_line_dict = defaultdict(list)

for line_no, line in enumerate(text, 1): 
    words = set(map(str.lower, re.split('\W+', line)))
    words_ok = words.difference(stop_words)
    for wok in words_ok:
        word_line_dict[wok].append(line_no)

print word_line_dict

非常感谢 Gnibbler:编写 for 循环的更好方法和处理第一次插入字典的更 Pythonic 方法。

打印(除了字典的格式)

{ 'added': [3]
, 'like': [3]
, 'that': [3]
, 'this': [1, 2]
, 'text': [1, 2]
, 'lines': [3]
, 'three': [3]
, 'one': [2, 3]
, 'texts': [3]
, 'second': [2]
, 'so': [3]
, 'in': [1, 2]
, 'line': [1, 2]
, 'my': [1, 2]
, 'with': [3]
, 'first': [1]
}

关于python - 在 Python 中过滤文本数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/6528666/

python - 在 Python 中过滤文本数据

上一篇：python - 在 python 中使用 subprocess.Popen 运行 git 命令

下一篇：python - 在 Python 中处理大文本文件