python - 在句子标记化之前删除 python 中的多个\n

标签 python web-scraping nlp nltk data-cleaning

我是编程新手,我正在通过书本和 Stack Overflow 自学。我正在尝试删除聊天语料库中\n 的多个实例,然后对句子进行标记。如果我不删除\n,字符串将如下所示:

['answers for 10-19-20sUser139 ... hi 10-19-20sUser101 ;)\n\n\n\n\n\n\n\n\n\nI like it when you do it, 10-19-20sUser83\n\n\n\n\n\n\n\n\n\n\n\niamahotnipwithpics\n\n\n\n10-19-20sUser20 go plan the wedding!']

我尝试了几种不同的方法,如 chomps、line、rstrip 等,但似乎都不起作用。可能是我用错了它们。整个代码如下所示:

import nltk, re, pprint
from nltk.corpus import nps_chat
chat= nltk.Text(nps_chat.words())
from nltk.corpus import NPSChatCorpusReader
from bs4 import BeautifulSoup
chat=nltk.corpus.nps_chat.raw()
soup= BeautifulSoup(chat)
soup.get_text()
text =soup.get_text()
print(text[:40])
print(len(text))
from nltk.tokenize import sent_tokenize
sent_chat = sent_tokenize(text)
len(sent_chat)
text[:] = [line.rstrip('\n') for line in text]
print(len(sent_chat))
print(sent_chat[:40])

当我使用 line 方法时,出现此错误:

Traceback (most recent call last):
File "C:\Python34\Lib\idlelib\testsubjects\sentencelen.py", line 57, in <module>
text[:] = [line.rstrip('\n') for line in text]
TypeError: 'str' object does not support item assignment

帮忙?

最佳答案

>>> x = 'answers for 10-19-20sUser139 ... hi 10-19-20sUser101 ;)\n\n\n\n\n\n\n\n\n\nI like it when you do it, 10-19-20sUser83\n\n\n\n\n\n\n\n\n\n\n\niamahotnipwithpics\n\n\n\n10-19-20sUser20 go plan the wedding!'
>>> y = "".join([i if i !="\n" else "\t" for i in x])
>>> z = [i for i in y.split('\t') if i]
>>> z
['answers for 10-19-20sUser139 ... hi 10-19-20sUser101 ;)', 'I like it when you do it, 10-19-20sUser83', 'iamahotnipwithpics', '10-19-20sUser20 go plan the wedding!']

关于python - 在句子标记化之前删除 python 中的多个\n,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26661256/

相关文章:

java - 约会的自然语言解析?

python - 根据单词得分对句子进行评分

python - 如何在 scikit-learn 的管道内对转换参数进行网格搜索

python - 使用/BeautifulSoup : Assign H4 Header ID to Elements in a List 进行网页抓取

python - Beautiful Soup 查找标签是否存在

python - 使用 python/selenium 保存完整的网页(包括 css,图像)

python - Django /Jinja2 : How to use the index value in a for-loop statement to display two lists?

python - 在 mac OSX Lion 上 pip install virtualenwrapper 权限被拒绝错误

python - 每个表的 Django 模型与每个选择的模型

search - 用于搜索查询更正的英语词典