根据标点符号将文本更改为正确的大小写并修复格式(空格等),哪种方法最有效?
the qUiCk BROWN fox:: jumped. over , the lazy dog.
期望的结果:
The quick brown fox: jumped. Over, the lazy dog.
最佳答案
您将问题标记为“正则表达式”,但我不建议使用正则表达式来尝试解决此问题。这最好用一个简单的状态机来处理。
这是一个足以处理您的示例的简单状态机。如果您在其他文本上尝试它,您可能会发现它无法处理的情况;我希望您会发现它的设计很清晰,并且您可以轻松地修改它以适应您的目的。
import string
s = "the qUiCk BROWN fox:: jumped. over , the lazy dog."
s_correct = "The quick brown fox: jumped. Over, the lazy dog."
def chars_from_lines(lines):
for line in lines:
for ch in line:
yield ch
start, in_sentence, saw_space = range(3)
punct = set(string.punctuation)
punct_non_repeat = punct - set(['.', '-'])
end_sentence_chars = set(['.', '!', '?'])
def edit_sentences(seq):
state = start
ch_punct_last = None
for ch in seq:
ch = ch.lower()
if ch == ch_punct_last:
# Don't pass repeated punctuation.
continue
elif ch in punct_non_repeat:
ch_punct_last = ch
else:
# Not punctuation to worry about, so forget the last.
ch_punct_last = None
if state == start and ch.isspace():
continue
elif state == start:
state = in_sentence
yield ch.upper()
elif state == in_sentence and ch in end_sentence_chars:
state = start
yield ch
yield ' '
elif state == in_sentence and not ch.isspace():
yield ch
elif state == in_sentence and ch.isspace():
state = saw_space
continue
elif state == saw_space and ch.isspace():
# stay in state saw_space
continue
elif state == saw_space and ch in punct:
# stay in state saw_space
yield ch
elif state == saw_space and ch.isalnum():
state = in_sentence
yield ' '
yield ch
#with open("input.txt") as f:
# s_result = ''.join(ch for ch in edit_sentences(chars_from_lines(f)))
s_result = ''.join(ch for ch in edit_sentences(s))
print(s_result)
print(s_correct)
关于python - 根据标点符号区分文本大小写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13756794/