python - 从散文中拆分句子

标签 python regex parsing text nlp

我正在尝试拆分句子,并保留对话标记。所以像

这样的句子

“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.” This is a non-dialogue sentence!

应该返回列表

[
    "“Dirty, Mr. Jones?”",
    "“Look at my shoes!”",
    "“Not a speck on them.”",
    "This is a non-dialogue sentence!"
]

我正在努力保留句尾标点符号,同时保留 Mr. 上的句号。我也在努力插入引号,因为当前返回的列表是 ['“Dirty, Mr. Jones”', '“看看我的鞋子”', '“上面没有一点”', ' “”', '这是一个非对话句子', ''] 我不知道为什么会得到两个空元素。我该如何解决这些问题?

这是我的代码(最终这将解析整本书,但现在我正在一个短语上测试它):

def get_all_sentences(corpus):

  sentences_in_paragraph = []

  dialogue = False
  dialogue_sentences = ""
  other_sentences = ""

  example_paragraph = "“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.”  This is a non-dialogue sentence!"

  example_paragraph = example_paragraph.replace("\n", "") # remove newline

  for character in example_paragraph:
    if character == "“":
        dialogue = True
        continue
    if character == "”":
        dialogue = False
        continue

    if dialogue:
        dialogue_sentences += character
    else:
        other_sentences += character

  sentences_in_paragraph  = list(map(lambda x: "“" + x.strip() + "”", re.split("(?<!Mr|Ms)(?<!Mrs)[.!?]", dialogue_sentences))) 
  sentences_in_paragraph += list(map(lambda x: x.strip(), re.split("(?<!Mr|Ms)(?<!Mrs)[.!?]", other_sentences)))

  print(sentences_in_paragraph)

最佳答案

如果添加 print语句显示中间步骤,可以看到问题是在哪里引入的:

sentence_splitter_regex = "(?<!Mr|Ms)(?<!Mrs)[.!?]"
dialogue_sentences_list = re.split(sentence_splitter_regex, dialogue_sentences)
print("dialogue sentences:", dialogue_sentences_list)
other_sentences_list = re.split(sentence_splitter_regex, other_sentences)
print("other sentences:", other_sentences_list)

sentences_in_paragraph  = list(map(lambda x: "“" + x.strip() + "”", dialogue_sentences_list)) 
sentences_in_paragraph += list(map(lambda x: x.strip(), other_sentences_list))
dialogue sentences ['Dirty, Mr. Jones', ' Look at my shoes', ' Not a speck on them', '']
other sentences ['    This is a non-dialogue sentence', '']

re.split在末尾留下一个空元素。您可以通过使用 for 处理结果来解决此问题。理解 if不包含空字符串的子句:

[sentence for sentence in sentences_with_whitespace if sentence.strip() != '']

您应该将此代码放入新函数 split_sentences_into_list 中让你的代码井井有条。移动 .strip() 也是有意义的处理来自get_all_sentences通过更改 for 的第一部分进入此函数理解sentence.strip() .

import re

def split_sentences_into_list(sentences_string):
    sentence_splitter_regex = "(?<!Mr|Ms)(?<!Mrs)[.!?]"
    sentences_with_whitespace = re.split(sentence_splitter_regex, sentences_string)
    return [sentence.strip() for sentence in sentences_with_whitespace if sentence.strip() != '']

def get_all_sentences(corpus):
    sentences_in_paragraph = []

    dialogue = False
    dialogue_sentences = ""
    other_sentences = ""

    example_paragraph = "“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.”    This is a non-dialogue sentence!"

    example_paragraph = example_paragraph.replace("\n", "") # remove newline

    for character in example_paragraph:
        if character == "“":
            dialogue = True
            continue
        if character == "”":
            dialogue = False
            continue

        if dialogue:
            dialogue_sentences += character
        else:
            other_sentences += character

    dialogue_sentences_list = split_sentences_into_list(dialogue_sentences)
    other_sentences_list = split_sentences_into_list(other_sentences)

    sentences_in_paragraph  = list(map(lambda x: "“" + x + "”", dialogue_sentences_list)) 
    sentences_in_paragraph += other_sentences_list

    print(sentences_in_paragraph)

get_all_sentences(None)

这有预期的输出:

['“Dirty, Mr. Jones”', '“Look at my shoes”', '“Not a speck on them”', 'This is a non-dialogue sentence']

顺便说一下,标准Python风格是使用for理解而不是 maplambda如果可能。在这种情况下,它会使您的代码更短:

# from
sentences_in_paragraph  = list(map(lambda x: "“" + x + "”", dialogue_sentences_list)) 
# to
sentences_in_paragraph  = ["“" + x + "”" for x in dialogue_sentences_list]

关于python - 从散文中拆分句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49780473/

相关文章:

python - 从 Binance API 中提取数据并将其转换为 PANDAS Dataframe

javascript - 正则表达式更改字符串中的某些单词

regex - txt文件删除url到最后 "/"得到文件

parsing - 使用 yacc 时,如何告诉 yyparse() 要停止解析?

python - 视频捕获后整个帧旋转

python - 如何使用 GridSearchCV 的结果绘制验证曲线?

python - 重命名 virtualenv 文件夹而不破坏它

javascript - RegExp 不会产生预期的结果,但它在其他地方都会产生

python - 在 Python 中使用 ElementTree 解析 XML

ruby - DateTime 解析未按预期工作