python - 根据标点符号区分文本大小写

标签 python text formatting state-machine

根据标点符号将文本更改为正确的大小写并修复格式(空格等),哪种方法最有效?

the qUiCk BROWN fox:: jumped. over , the lazy    dog.

期望的结果:

The quick brown fox: jumped. Over, the lazy dog.

最佳答案

您将问题标记为“正则表达式”,但我不建议使用正则表达式来尝试解决此问题。这最好用一个简单的状态机来处理。

这是一个足以处理您的示例的简单状态机。如果您在其他文本上尝试它,您可能会发现它无法处理的情况;我希望您会发现它的设计很清晰,并且您可以轻松地修改它以适应您的目的。

import string

s = "the qUiCk BROWN fox:: jumped. over , the lazy    dog."
s_correct = "The quick brown fox: jumped. Over, the lazy dog."


def chars_from_lines(lines):
    for line in lines:
        for ch in line:
            yield ch

start, in_sentence, saw_space = range(3)

punct = set(string.punctuation)
punct_non_repeat = punct - set(['.', '-'])
end_sentence_chars = set(['.', '!', '?'])

def edit_sentences(seq):
    state = start
    ch_punct_last = None

    for ch in seq:
        ch = ch.lower()

        if ch == ch_punct_last:
            # Don't pass repeated punctuation.
            continue
        elif ch in punct_non_repeat:
            ch_punct_last = ch
        else:
            # Not punctuation to worry about, so forget the last.
            ch_punct_last = None

        if state == start and ch.isspace():
            continue
        elif state == start:
            state = in_sentence
            yield ch.upper()

        elif state == in_sentence and ch in end_sentence_chars:
            state = start
            yield ch
            yield ' '
        elif state == in_sentence and not ch.isspace():
            yield ch
        elif state == in_sentence and ch.isspace():
            state = saw_space
            continue

        elif state == saw_space and ch.isspace():
            # stay in state saw_space
            continue
        elif state == saw_space and ch in punct:
            # stay in state saw_space
            yield ch
        elif state == saw_space and ch.isalnum():
            state = in_sentence
            yield ' '
            yield ch

#with open("input.txt") as f:
#    s_result = ''.join(ch for ch in edit_sentences(chars_from_lines(f)))

s_result = ''.join(ch for ch in edit_sentences(s))

print(s_result)
print(s_correct)

关于python - 根据标点符号区分文本大小写,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13756794/

相关文章:

Python Plotly 将轴数格式化为 %

python - 将数据存储到具有空字段的命名元组中以添加其他内容

python - "from __future__ import annotations"在 VSCode 中产生 "annotations is not defined"

python - 构建-pysnmp-mib : convert cisco mib files to a python fails on Ubuntu 14. 04

file - 打开后关闭 URL 1 by 1(从文本文件读取 URL,循环)

android - 我可以像日文一样垂直书写文本吗?

swift - Xcode 8 中行的中点对齐;扩大?

android - 如何在图像上绘制文本?

r - 在 R 中定义场合(按季节)而不基于年份

gridview - 不同列的条件格式