python - 从 python 中的计数器中删除停用词列表

我在 NLTK 中有一个函数来生成索引列表，它看起来像

concordanceList = ["this is a concordance string something", 
               "this is another concordance string blah"]

我还有另一个函数，它返回一个 Counter 字典，其中包含 concordanceList 中每个单词的计数

def mostCommonWords(concordanceList):
  finalCount = Counter()
  for line in concordanceList:
    words = line.split(" ")
    currentCount = Counter(words)
    finalCount.update(currentCount)
  return finalCount

我遇到的问题是如何最好地从生成的计数器中删除停用词，这样，当我调用

mostCommonWords(concordanceList).most_common(10)

结果不仅仅是 {"the": 100, "is": 78, "that": 57}。

我认为预处理文本以删除停用词已经过时了，因为我仍然需要索引字符串作为语法语言的实例。基本上，我想知道是否有比为停用词创建一个停用词计数器、将值设置得较低，然后像这样制作另一个计数器更简单的方法:

stopWordCounter = Counter(the=1, that=1, so=1, and=1)
processedWordCounter = mostCommonWords(concordanceList) & stopWordCounter

应该将所有停用词的计数值设置为 1，但它看起来很老套。

编辑:此外，我在实际制作这样一个 stopWordCounter 时遇到了麻烦，因为如果我想包含保留字，如“and”，我会收到无效的语法错误。计数器有易于使用的并集和交集方法，这将使任务变得相当简单；字典有等效的方法吗？

最佳答案

您可以在标记化过程中删除停用词...

stop_words = frozenset(['the', 'a', 'is'])
def mostCommonWords(concordanceList):
    finalCount = Counter()
    for line in concordanceList:
        words = [w for w in line.split(" ") if w not in stop_words]
        finalCount.update(words)  # update final count using the words list
    return finalCount

关于python - 从 python 中的计数器中删除停用词列表，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20723133/

python - 从 python 中的计数器中删除停用词列表

上一篇：python - 在 python 2.7 中创建类的实例

下一篇：javascript - 我怎么能 json.loads ("""") 4 个钉子？