python - NLTK 替换停用词

我正在使用 NLTK 将所有停用词替换为字符串 "QQQQQ"。问题是，如果输入句子(我从中删除了停用词)有多个句子，那么它就无法正常工作。

我有以下代码:

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized=word_tokenize(ex_text)

stop_words=set(stopwords.words('english'))
stop_words.add(".")  #Since I do not need punctuation, I added . and ,
stop_words.add(",")

# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w in stop_words:    
        stopword_pos.append(tokenized.index(w))

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]]='QQQQQ'  

print(tokenized)

该代码给出以下输出:

['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']

正如您可能注意到的，它不会替换“is”和“.”等停用词。 (我在集合中添加了句号，因为我不需要标点符号)。

但请记住"is"和“。”第一句中的被替换，但 'is' 和 '.'在第二句话中不要。

发生的另一个奇怪的事情是，当我打印 stopword_pos 时，我得到以下输出:

[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]

正如您可能注意到的，数字似乎是按升序排列的，但突然间，列表中“20”后面出现了“1”，它应该保存停用词的位置。此外，“29”后有“0”，“25”后有“20”。也许这可以说明问题所在。

所以，问题是在第一句话之后，停用词不会被“QQQQQ”替换。这是为什么？

非常感谢任何为我指明正确方向的事情。我不知道如何解决这个问题。

最佳答案

问题是，.index 不会返回所有索引，因此，您将需要与其他 question 中提到的类似的内容。

stopword_pos_set = set() # creating set so that index is not added twice
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = list(stopword_pos_set) # convert to list

在上面，我创建了 stopword_pos_set，因此相同的索引不会添加两次，它只会分配相同的值两次，但是当您在没有 set 的情况下打印 stopword_pos 时 您将看到重复的值。

一个建议是，在上面的代码中，我将其更改为 if w.lower() in stop_words:，这样当您检查 stopwords 时不区分大小写，否则 'This' 与 'this' 不同。

其他建议是使用 .update 方法更新 stop_words 中使用 stop_words.update([".", ","] 设置的多个项目) 而不是多次 .add。

您可以尝试如下:

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized = word_tokenize(ex_text)
stop_words = set(stopwords.words('english'))
stop_words.update([".", ","])  #Since I do not need punctuation, I added . and ,

stopword_pos_set = set()
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = sorted(list(stopword_pos_set)) # set to list

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]] = 'QQQQQ'  

print(tokenized)
print(stopword_pos)

关于python - NLTK 替换停用词，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51686618/

python - NLTK 替换停用词

上一篇：python - 带有 Scikit-Learn 的 Google 云 ML 引发 : 'dict' object has no attribute 'lower'

下一篇：python - 如何使 pytest 驱动程序实例在我的测试用例中可用？