我是 Python 的新手,我正在尝试使用 NLTK 删除我文件的停用词。 该代码正在运行,但是它是分隔标点符号,如果我的文本是一 strip 有提及 (@user) 的推文,我会得到“@user”。 稍后我需要做一个词频,我需要提及和主题标签才能正常工作。 我的代码:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import codecs
arquivo = open('newfile.txt', encoding="utf8")
linha = arquivo.readline()
while linha:
stop_word = set(stopwords.words("portuguese"))
word_tokens = word_tokenize(linha)
filtered_sentence = [w for w in word_tokens if not w in stop_word]
filtered_sentence = []
for w in word_tokens:
if w not in stop_word:
filtered_sentence.append(w)
fp = codecs.open("stopwords.txt", "a", "utf-8")
for words in (filtered_sentence):
fp.write(words + " ")
fp.write("\n")
linha= arquivo.readline()
编辑 不确定这是否是最好的方法,但我是这样修复的:
for words in (filtered_sentence):
fp.write(words)
if words not in string.punctuation:
fp.write(" ")
fp.write("\n")
最佳答案
您可以使用 Twitter-aware tokenizer 而不是 word_tokenize
由 nltk 提供:
from nltk.tokenize import TweetTokenizer
...
tknzr = TweetTokenizer()
...
word_tokens = tknzr.tokenize(linha)
关于Python - NLTK 分隔标点符号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39402983/