python - 包含引号的文本的句子标记化

代码:

from nltk.tokenize import sent_tokenize           
pprint(sent_tokenize(unidecode(text)))

输出:

[After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',]

输入:

After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

引号应包含在前一句中。而不是 " Li.
它在 ." 处失败如何解决这个问题？

编辑:
解释文本的提取。

html = open(path, "r").read()                           #reads html code
article = extractor.extract(raw_html=html)              #extracts content
text = unidecode(article.cleaned_text)                  #changes encoding

在这里， article.cleaned_text 是 unicode。使用它来改变字符“到”背后的想法。

解决方案@alvas 不正确的结果:

['After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"',
 'Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.'
]

编辑 2:
(更新) nltk 和 python 版本

python -c "import nltk; print nltk.__version__"
3.0.4
python -V
Python 2.7.9

最佳答案

我不确定所需的输出是什么，但我认为您可能需要在 nltk.sent_tokenize 之前进行一些段落分割， IE。:

>>> text = """After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
... 
... Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015."""
>>> from nltk import sent_tokenize
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     for sent in sent_tokenize(pg):
...             print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

可能，您可能想要 strings within the double quotes同样，如果是这样，你可以试试这个:

>>> import re
>>> str_in_doublequotes = r'"([^"]*)"'
>>> re.findall(str_in_doublequotes, text)
['Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.']

或者你可能需要这个:

>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

从文件中读取时，尝试使用 io 包裹:

alvas@ubi:~$ echo -e """After Du died of suffocation, her boyfriend posted a heartbreaking message online: \"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.\"\n\nLi Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" > in.txt
alvas@ubi:~$ cat in.txt 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from nltk import sent_tokenize
>>> text = io.open('in.txt', 'r', encoding='utf8').read()
>>> print text
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

>>> for sent in sent_tokenize(text):
...     print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

并使用段落和引用提取技巧:

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

对于将引用前的句子与引号连接起来的魔法(不要眨眼，它看起来与上面的完全一样):

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent,
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online:  "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

上面代码的问题在于它仅限于像这样的句子:

After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

并且无法处理:

"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you," her boyfriend posted a heartbreaking message online after Du died of suffocation.

为了确保，我的 python/nltk 版本是:

$ python -c "import nltk; print nltk.__version__"
'3.0.3'
$ python -V
Python 2.7.6

除了文本处理的计算方面，问题中文本的语法也有一些微妙的不同。

引用后跟分号的事实 :是不典型的传统英语语法。这可能在中文新闻中很流行，因为在中文中:

啊杜窒息死亡后，男友在网上发了令人心碎的消息: "..."

在非常规范的语法意义上的传统英语中，它应该是:

After Du died of suffocation, her boyfriend posted a heartbreaking message online, "..."

并且引用后语句将由结束逗号而不是句号表示，例如:

"...," her boyfriend posted a heartbreaking message online after Du died of suffocation.

关于python - 包含引号的文本的句子标记化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32003294/

python - 包含引号的文本的句子标记化

上一篇：asp.net - 如果缺少 cookie，如何使用 ARR 重写 url？

下一篇：apache-spark - Apache Spark MLlib 模型文件格式