python - 如何创建一个对单词进行标记和词干处理的函数

我的代码

def tokenize_and_stem(text):

    tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]

    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]

    stems = stemmer.stem(filtered_tokens)

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)

我收到此错误

AttributeError Traceback(最近一次调用最后一次) 在 13 个返回杆 14 ---> 15words_stemmed = tokenize_and_stem("今天(2016年5月19日)是他唯一女儿的婚礼。") 16 print(words_stemmed)

tokenize_and_stem(文本)中的

9
10 # 提取filtered_tokens ---> 11 个词干 = Stemmer.stem(filtered_tokens) 12
13个返回词干提取

/usr/local/lib/python3.6/dist-packages/nltk/stem/snowball.py 在stem(self, word)中 1415 第1416章 -> 1417 字 = word.lower() 1418 第1419章

AttributeError:'list'对象没有属性'lower'

最佳答案

您的代码

def tokenize_and_stem(text):

tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]

filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]

stems = stemmer.stem(filtered_tokens)

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's 
wedding.")
print(words_stemmed)

错误提示“”“word = word.lower()...如果 self.stopwords 或 len(word) <= 2 中的单词:列表对象没有属性 'lower'"""

错误不仅是因为 .lower() 还因为长度如果您尝试在不更改第 5 行的 filtered_tokens 的情况下运行它，不改变意味着使用你的。你不会得到任何错误，但输出将是这样的:

[“今天(2016年5月19日)是他唯一女儿的婚礼。”,“今天(2016年5月19日)是他唯一女儿的婚礼。”,“今天(2016年5月19日)是他唯一女儿的婚礼。”女儿的婚礼。", "今天(2016年5月19日)是他唯一女儿的婚礼。", "今天(2016年5月19日)是他唯一女儿的婚礼。", "今天(2016年5月19日)是他唯一女儿的婚礼.", "今天(2016年5月19日)是他唯一的女儿的婚礼。", "今天(2016年5月19日)是他唯一的女儿的婚礼。", "今天(2016年5月19日)是他唯一的女儿的婚礼。", "今天(2016年5月19日)是他唯一的女儿的婚礼。", "今天(2016年5月19日)是他唯一的女儿的婚礼。", "今天(2016年5月19日)是他唯一的女儿的婚礼。", "今天(2016年5月19日)是他唯一的女儿的婚礼。", "今天(2016年5月19日)是他唯一的女儿的婚礼。"]

这是您的固定代码。

def tokenize_and_stem(text):

    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]

    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]

    stems = [stemmer.stem(t) for t in filtered_tokens if len(t) > 0]

    return stems

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)

所以，我只更改了第 3 行和第 7 行

关于python - 如何创建一个对单词进行标记和词干处理的函数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58956995/

python - 如何创建一个对单词进行标记和词干处理的函数

上一篇：python - Python 中的 CSV 文件删除括号、引号和 u

下一篇：python - 如何将基于 tkinter 的应用程序与自定义图标捆绑在一起？