python - NLTK PunktSentenceTokenizer 省略号拆分

标签 python python-2.7 nltk tokenize

我正在使用 NLTK PunktSentenceTokenizer我正面临这样一种情况,其中包含由 ellipsis character (...) 分隔的多个句子的文本.这是我正在处理的示例:

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']

如您所见,句子没有分开。有没有办法让它像我预期的那样工作(即返回包含四个项目的列表)?

附加信息:我尝试使用 debug_decisions 函数来尝试理解为什么做出这样的决定。我得到以下结果:

>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")

>>> [x for x in g]
[{'break_decision': None,
  'collocation': False,
  'period_index': 27,
  'reason': 'default decision',
  'text': 'service... Cashier',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'cashier',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 47,
  'reason': 'default decision',
  'text': 'rude... Drive',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'drive',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 72,
  'reason': 'default decision',
  'text': 'hours... The',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'the',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'}]

不幸的是,我无法理解这些字典的含义,尽管分词器似乎确实检测到了省略号,但出于某种原因决定用这些符号拆分句子。任何想法?

谢谢!

最佳答案

你为什么不直接使用 the split function? str.split('...')

编辑:我通过使用路透社语料库训练函数来实现它,我想你可以使用你的训练它:

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))

结果:

>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']

关于python - NLTK PunktSentenceTokenizer 省略号拆分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29970846/

相关文章:

python - 如何将此代码从 Python 2.7 转换为 Python 3.5 以修复 ---> AttributeError : '_io.TextIOWrapper' object has no attribute 'next'

python nltk 从句子中提取关键字

python - 按位循环右移

python - 如何解决django中的循环导入错误?

python - 保存和加载模型优化器状态

python-2.7 - 无法设置neo4jDjango图形数据库: object has no attribute 'db_type'

python-2.7 - 在 Python 2.7.3/Raspberry Pi 中使用特殊字符转义 HTML

python - 如何将字符串列表中的反向字符串与python中的原始字符串列表进行比较?

python - 从默认 ~/ntlk_data 更改 nltk.download() 路径目录

Python TextBlob 翻译问题