python - 在 python 数据框中发送 token 后,Word_tokenize 不起作用

标签 python nlp

我尝试使用sent_tokenize和word_tokenize通过数据进行标记。

下面是我的虚拟数据

**text**
Hello world, how are you
I am fine, thank you!

我正在尝试使用下面的代码对其进行标记

import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
Corpus=pd.read_csv(r"C:\Users\Desktop\NLP\corpus.csv",encoding='utf-8')

Corpus['text']=Corpus['text'].apply(sent_tokenize)
Corpus['text_new']=Corpus['text'].apply(word_tokenize)

但出现以下错误

Traceback (most recent call last):
  File "C:/Users/gunjit.bedi/Desktop/NLP Project/Topic Classification.py", line 24, in <module>
    Corpus['text_new']=Corpus['text'].apply(word_tokenize)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 3192, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/src\inference.pyx", line 1472, in pandas._libs.lib.map_infer
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\__init__.py", line 95, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1241, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1291, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1291, in <listcomp>
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1281, in span_tokenize
    for sl in slices:
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1322, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 313, in _pair_iter
    prev = next(it)
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\tokenize\punkt.py", line 1295, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or bytes-like object

我确实尝试了很多事情,例如如果我评论 sent_tokenize ,则 word_tokenize 可以工作,但两者不能一起工作

最佳答案

您收到错误,因为 nltk.word_tokenize 期望输入为 字符串

当您在文本上应用nltk.sent_tokenize时,它会将其转换为列表。

text = ['Hey. Hello','hello world!! I am akshay','I m fine']

df['text']=df['text'].apply(sent_tokenize)
print(df['text'])

输出:

                           text
0                 [Hey., Hello]
1  [hello world!!, I am akshay]
2                    [I m fine]

试试这个

df['sent']=df['text'].apply(lambda x :sent_tokenize(str(x)))

df['text_new']= [word_tokenize(str(i)) for i in df['sent']]

关于python - 在 python 数据框中发送 token 后,Word_tokenize 不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52832950/

相关文章:

python - 使用 openpyxl 创建工作簿和工作表

python - 如何使用 PyASN1 建模递归 ASN.1 规范?

java - 计算 solr 和 java 文档中单词的 TF-IDF

programming-languages - 对于编程语言来说,与 "natural language"的相似性是一个令人信服的卖点吗?

python - NN VBD IN DT NNS RB 在 NLTK 中是什么意思?

python - 调用 tf.session.run 变慢

python - 如何使用pygame和openGL修改相机的 View

python - 无效语法 - python

python - 从 Graph 类中提取度数、平均度数

python - 预处理脚本不删除标点符号