python - 通过正则表达式使用替代方法连接术语

标签 python regex string regex-group python-re

Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word and in regex as partition to separate two groups of the sentence. For example:

Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'

Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'

What Regex I have tried:

import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin." 
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))

正则表达式能够捕获组,但我从 substitute 方法行收到错误,如 TypeError: 'NoneType' object is not subscriptable。任何类型的建议或帮助执行上述问题将不胜感激。

最佳答案

拆分解决方案

虽然这不是正则表达式解决方案,但它确实有效:

from string import punctuation

x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
    if word == "and":
        # strip punctuation or we will get skin. instead of skin
        x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))

输出是:

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

此解决方案避免直接插入到列表中,因为这会在您迭代时导致索引出现问题。相反,我们将列表中的第一个“and”替换为“synthesis and”,将第二个“and”替换为“skin and”,然后重新加入拆分后的字符串。

正则表达式解决方案

如果您坚持使用正则表达式解决方案,我建议使用 re.findall 和包含单个 and 的模式,因为这对于问题更通用:

from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

我们再次使用 strip(punctuation) 因为 skin. 被捕获:我们不想丢失 end 的标点符号的句子,但我们确实想在句子中丢失它。

这是我们的模式:

(.*?)\sand\s(.*?)\s([^\s]+)
  1. (.*?)\s:捕获“and”之前的所有内容,包括空格
  2. \s(.*?)\s:捕获紧跟在“and”之后的单词
  3. ([^\s]+):捕获下一个空格(即“and”之后的第二个单词)之前不是空格的任何内容。这确保我们也能捕获标点符号。

关于python - 通过正则表达式使用替代方法连接术语,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67881649/

相关文章:

javascript - 如何按范围替换字符串?

java - 创建初始重复数据的二维字符串数组的最有效方法是什么?

python - SIGTERM 处理程序被多次调用

c# - 如何使用正则表达式将数字格式化为金钱

php - 奇怪的preg_replace行为将字符变成数字

c++ - 拆分 QString 直到第 n 个逗号

python - 用于支付另一个 paypal 帐户的 Paypal REST Api

python - 后验概率 python 示例

python - 将两个字符串连接成可调用字符串 'moduleA' + 'func1' 进入 moduleA.func1()

c - 我需要帮助如何从较大的字符串中获取较小的字符串?在C中