我正在尝试拆分句子,并保留对话标记。所以像
这样的句子“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.” This is a non-dialogue sentence!
应该返回列表
[
"“Dirty, Mr. Jones?”",
"“Look at my shoes!”",
"“Not a speck on them.”",
"This is a non-dialogue sentence!"
]
我正在努力保留句尾标点符号,同时保留 Mr.
上的句号。我也在努力插入引号,因为当前返回的列表是 ['“Dirty, Mr. Jones”', '“看看我的鞋子”', '“上面没有一点”', ' “”', '这是一个非对话句子', '']
我不知道为什么会得到两个空元素。我该如何解决这些问题?
这是我的代码(最终这将解析整本书,但现在我正在一个短语上测试它):
def get_all_sentences(corpus):
sentences_in_paragraph = []
dialogue = False
dialogue_sentences = ""
other_sentences = ""
example_paragraph = "“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.” This is a non-dialogue sentence!"
example_paragraph = example_paragraph.replace("\n", "") # remove newline
for character in example_paragraph:
if character == "“":
dialogue = True
continue
if character == "”":
dialogue = False
continue
if dialogue:
dialogue_sentences += character
else:
other_sentences += character
sentences_in_paragraph = list(map(lambda x: "“" + x.strip() + "”", re.split("(?<!Mr|Ms)(?<!Mrs)[.!?]", dialogue_sentences)))
sentences_in_paragraph += list(map(lambda x: x.strip(), re.split("(?<!Mr|Ms)(?<!Mrs)[.!?]", other_sentences)))
print(sentences_in_paragraph)
最佳答案
如果添加 print
语句显示中间步骤,可以看到问题是在哪里引入的:
sentence_splitter_regex = "(?<!Mr|Ms)(?<!Mrs)[.!?]"
dialogue_sentences_list = re.split(sentence_splitter_regex, dialogue_sentences)
print("dialogue sentences:", dialogue_sentences_list)
other_sentences_list = re.split(sentence_splitter_regex, other_sentences)
print("other sentences:", other_sentences_list)
sentences_in_paragraph = list(map(lambda x: "“" + x.strip() + "”", dialogue_sentences_list))
sentences_in_paragraph += list(map(lambda x: x.strip(), other_sentences_list))
dialogue sentences ['Dirty, Mr. Jones', ' Look at my shoes', ' Not a speck on them', '']
other sentences [' This is a non-dialogue sentence', '']
re.split
在末尾留下一个空元素。您可以通过使用 for
处理结果来解决此问题。理解 if
不包含空字符串的子句:
[sentence for sentence in sentences_with_whitespace if sentence.strip() != '']
您应该将此代码放入新函数 split_sentences_into_list
中让你的代码井井有条。移动 .strip()
也是有意义的处理来自get_all_sentences
通过更改 for
的第一部分进入此函数理解sentence.strip()
.
import re
def split_sentences_into_list(sentences_string):
sentence_splitter_regex = "(?<!Mr|Ms)(?<!Mrs)[.!?]"
sentences_with_whitespace = re.split(sentence_splitter_regex, sentences_string)
return [sentence.strip() for sentence in sentences_with_whitespace if sentence.strip() != '']
def get_all_sentences(corpus):
sentences_in_paragraph = []
dialogue = False
dialogue_sentences = ""
other_sentences = ""
example_paragraph = "“Dirty, Mr. Jones? Look at my shoes! Not a speck on them.” This is a non-dialogue sentence!"
example_paragraph = example_paragraph.replace("\n", "") # remove newline
for character in example_paragraph:
if character == "“":
dialogue = True
continue
if character == "”":
dialogue = False
continue
if dialogue:
dialogue_sentences += character
else:
other_sentences += character
dialogue_sentences_list = split_sentences_into_list(dialogue_sentences)
other_sentences_list = split_sentences_into_list(other_sentences)
sentences_in_paragraph = list(map(lambda x: "“" + x + "”", dialogue_sentences_list))
sentences_in_paragraph += other_sentences_list
print(sentences_in_paragraph)
get_all_sentences(None)
这有预期的输出:
['“Dirty, Mr. Jones”', '“Look at my shoes”', '“Not a speck on them”', 'This is a non-dialogue sentence']
顺便说一下,标准Python风格是使用for
理解而不是 map
和lambda
如果可能。在这种情况下,它会使您的代码更短:
# from
sentences_in_paragraph = list(map(lambda x: "“" + x + "”", dialogue_sentences_list))
# to
sentences_in_paragraph = ["“" + x + "”" for x in dialogue_sentences_list]
关于python - 从散文中拆分句子,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49780473/