python - 正则表达式在缩写后添加逗号

标签 python python-3.x regex regex-group

我想在缩写词后添加一个逗号和一个空格 , ,这些缩写词被定义为单个或多个字母,后跟一个点,后跟一个或多个字母,重复两次或更多次。例如,这些被视为缩写 A.b.C. a.b. ab.cd. ab.cde. ab. cd.ef.gh. 而这些不是缩写 a.bA。 B 我不想添加逗号:

  • 如果缩写的最后一个点是给定文本的结尾,
  • 如果缩写后面有可选空格和大写字母,或者
  • 如果缩写后面有可选空格和另一个标点符号。

给出以下测试句子:

test_str = """This is an example e.g. sentence and this is with i.e. text and two abbreviations S.T.R. and K.LM.NO.P. as example with acronym.
            but in here it shouldn't catch it because after that there is space and dot g.k. . Also here it shouldn't detect because the next sentence starts with capital A.BC.D.
            And this is a normal sentence. Followed by another normal sentence. This contains only one letter A. and is not abbreviation.
            This shouldn't match i.e., since it contains already a comma. I like to read books such as e.g. book 1 or i.e. book2.
            A.B.C. is an abbreviation that should match. A.B.! is an abbreviation that shouldn't match because it has ! after the abbreviation. 
            A.B.? is an abbreviation that shouldn't match because it has ? after the abbreviation. 
            A.B. ; is an abbreviation that shouldn't match because it has a space and ; after the abbreviation.
            a.b.c.d. is an abbreviation that should match.
            a.b.c., is an abbreviation that shouldn't match because it already has a comma. A.B is not an abbreviation because it contains only one dot.
            Another abbreviation that should not match j.j.L.o.U.h."""

我希望输出如下:

output_text = """This is an example e.g., sentence and this is with i.e., text and two abbreviations S.T.R., and K.LM.NO.P., as example with acronym.
            but in here it shouldn't catch it because after that there is space and dot g.k. . Also here it shouldn't detect because the next sentence starts with capital A.BC.D.
            And this is a normal sentence. Followed by another normal sentence. This contains only one letter A. and is not abbreviation.
            This shouldn't match i.e., since it contains already a comma. I like to read books such as e.g., book 1 or i.e., book2.
            A.B.C., is an abbreviation that should match. A.B.! is an abbreviation that shouldn't match because it has ! after the abbreviation. 
            A.B.? is an abbreviation that shouldn't match because it has ? after the abbreviation. 
            A.B. ; is an abbreviation that shouldn't match because it has a space and ; after the abbreviation.
            a.b.c.d., is an abbreviation that should match.
            a.b.c., is an abbreviation that shouldn't match because it already has a comma. A.B is not an abbreviation because it contains only one dot.
            Another abbreviation that should not match j.j.L.o.U.h."""

我现在使用的是以下内容:

regex = r"(\b(?:[A-Za-z]\.){2,}(?!\s*[,.;?!-]))"

但它会产生以下输出:

This is an example e.g., sentence and this is with i.e., text and two abbreviations S.T.R., and K.LM.NO.P. as example with acronym. but in here it shouldn't catch it because after that there is space and dot g.k. . Also here it shouldn't detect because the next sentence starts with capital A.BC.D. And this is a normal sentence. Followed by another normal sentence. This contains only one letter A. and is not abbreviation. This shouldn't match i.e., since it contains already a comma. I like to read books such as e.g., book 1 or i.e., book2. A.B.C., is an abbreviation that should match. A.B.! is an abbreviation that shouldn't match because it has ! after the abbreviation. A.B.? is an abbreviation that shouldn't match because it has ? after the abbreviation. A.B. ; is an abbreviation that shouldn't match because it has a space and ; after the abbreviation. a.b.c.d., is an abbreviation that should match. a.b., c., is an abbreviation that shouldn't match because it already has a comma. A.B is not an abbreviation because it contains only one dot.
Another abbreviation that should not match j.j.L.o.U.h.,

我的正则表达式失败的情况以粗体显示。它们应该是 K.LM.NO.P., a.b.c.,j.j.L.o.U.h.,因为第一个应该被检测为缩写,第二个应该被检测为缩写。其中一个在最后一个点之后已经包含标点符号,最后一个是给定文本的结尾。

有办法实现这一点吗?非常感谢任何帮助!

最佳答案

您可以使用此正则表达式进行匹配:

(?<=\.[a-zA-Z])\.(?=\s[a-z])

并替换为字符串 ., .

RegEx Demo

正则表达式详细信息:

  • (?<=\.[a-zA-Z]) :在匹配点之前断言我们有一个点和一个字母
  • \. : 匹配一个点
  • (?=\s[a-z]) :断言匹配一个点后我们有一个空格和一个小写字母

关于python - 正则表达式在缩写后添加逗号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76015812/

相关文章:

python - celery :如何只为最后一个链节存储结果?

针对 fuzzywuzzy 的 Python 多处理列表

python - 你能在 python 中循环创建类吗?

python - 如何在基于字符分隔符将列表拆分为子列表时跳过空子字符串

python - 如何将两个参数传递给 Pool.starmap()?

Python 3.5 Pyperclip 模块导入失败

java - 正则表达式: Zip Code only 9 or 10 Digits with Dash

python - 用条件参数替换Python中的DataFrame索引值

PHP 正则表达式 : each word must end with dot

java - 正则表达式适合所有不以已知后缀列表结尾的字符串(不是字符,而是单词)