python - NLTK PunktSentenceTokenizer 省略号拆分

我正在使用 NLTK PunktSentenceTokenizer我正面临这样一种情况，其中包含由 ellipsis character (...) 分隔的多个句子的文本.这是我正在处理的示例:

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> pst = PunktSentenceTokenizer()
>>> pst.sentences_from_text("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")
['Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...']

如您所见，句子没有分开。有没有办法让它像我预期的那样工作(即返回包含四个项目的列表)？

附加信息:我尝试使用 debug_decisions 函数来尝试理解为什么做出这样的决定。我得到以下结果:

>>> g = pst.debug_decisions("Horrible customer service... Cashier was rude... Drive thru took hours... The tables were not clean...")

>>> [x for x in g]
[{'break_decision': None,
  'collocation': False,
  'period_index': 27,
  'reason': 'default decision',
  'text': 'service... Cashier',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'cashier',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 47,
  'reason': 'default decision',
  'text': 'rude... Drive',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'drive',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'},
 {'break_decision': None,
  'collocation': False,
  'period_index': 72,
  'reason': 'default decision',
  'text': 'hours... The',
  'type1': '...',
  'type1_in_abbrs': False,
  'type1_is_initial': False,
  'type2': 'the',
  'type2_is_sent_starter': False,
  'type2_ortho_contexts': set(),
  'type2_ortho_heuristic': 'unknown'}]

不幸的是，我无法理解这些字典的含义，尽管分词器似乎确实检测到了省略号，但出于某种原因决定不用这些符号拆分句子。任何想法？

谢谢!

最佳答案

你为什么不直接使用 the split function? str.split('...')

编辑:我通过使用路透社语料库训练函数来实现它，我想你可以使用你的训练它:

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import reuters
pst = PunktSentenceTokenizer()
pst.train(reuters.raw())
text = "Batts did not take questions or give details of the report's findings... He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office. Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April."
print(pst.sentences_from_text(text))

结果:

>>> ["Batts did not take questions or give details of the report's findings...", "He did say that the city's police department would continue to work on the case under the direction of the prosecutor's office.", 'Gray was injured around the time he was arrested by Baltimore police and put in a police van on 12 April.']

关于python - NLTK PunktSentenceTokenizer 省略号拆分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29970846/

python - NLTK PunktSentenceTokenizer 省略号拆分

上一篇：python - 使用 Django (Python) 在 DynamoDB 中自动生成 key

下一篇：python - 混合 Cython 类和 SqlAlchemy