我是 Python 文本处理的新手,我正在尝试提取文本文档中的单词,大约有 5000 行。
我写了下面的脚本
from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
def Description_to_words(raw_Description ):
# 1. Remove HTML
Description_text = BeautifulSoup(raw_Description).get_text()
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", Description_text)
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
# 5. stem words
words = ([stemmer.stem(w) for w in words])
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
clean_Description = Description_to_words(train["Description"][15])
但是当我测试结果单词没有词干时,任何人都可以帮助我知道问题是什么,我在“Description_to_words”函数中做错了
而且,当我像下面这样单独执行 stem 命令时,它会起作用。
from nltk.tokenize import sent_tokenize, word_tokenize
>>> words = word_tokenize("MOBILE APP - Unable to add reading")
>>>
>>> for w in words:
... print(stemmer.stem(w))
...
mobil
app
-
unabl
to
add
read
最佳答案
这是您的函数的每个步骤,已修复。
删除 HTML。
Description_text = BeautifulSoup(raw_Description).get_text()
删除非字母,但暂时不要删除空格。您还可以稍微简化您的正则表达式。
letters_only = re.sub("[^\w\s]", " ", Description_text)
转换为小写,拆分为单个单词:我建议再次使用
word_tokenize
,在这里。from nltk.tokenize import word_tokenize words = word_tokenize(letters_only.lower())
删除停用词。
stops = set(stopwords.words("english")) meaningful_words = [w for w in words if not w in stops]
词干。这是另一个问题。词干
meaningful_words
,而不是words
。return ' '.join(stemmer.stem(w) for w in meaningful_words])
关于python - 用 NLTK (python) 词干词干,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45670532/