python - 通过正则表达式使用替代方法连接术语

Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word and in regex as partition to separate two groups of the sentence. For example:

Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'

Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'

What Regex I have tried:

import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin." 
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))

正则表达式能够捕获组，但我从 substitute 方法行收到错误，如 TypeError: 'NoneType' object is not subscriptable。任何类型的建议或帮助执行上述问题将不胜感激。

最佳答案

拆分解决方案

虽然这不是正则表达式解决方案，但它确实有效:

from string import punctuation

x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
    if word == "and":
        # strip punctuation or we will get skin. instead of skin
        x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))

输出是:

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

此解决方案避免直接插入到列表中，因为这会在您迭代时导致索引出现问题。相反，我们将列表中的第一个“and”替换为“synthesis and”，将第二个“and”替换为“skin and”，然后重新加入拆分后的字符串。

正则表达式解决方案

如果您坚持使用正则表达式解决方案，我建议使用 re.findall 和包含单个 and 的模式，因为这对于问题更通用:

from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)

Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.

我们再次使用 strip(punctuation) 因为 skin. 被捕获:我们不想丢失 end 的标点符号的句子，但我们确实想在句子中丢失它。

这是我们的模式:

(.*?)\sand\s(.*?)\s([^\s]+)

(.*?)\s:捕获“and”之前的所有内容，包括空格
\s(.*?)\s:捕获紧跟在“and”之后的单词
([^\s]+):捕获下一个空格(即“and”之后的第二个单词)之前不是空格的任何内容。这确保我们也能捕获标点符号。

关于python - 通过正则表达式使用替代方法连接术语，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67881649/

python - 通过正则表达式使用替代方法连接术语

拆分解决方案

正则表达式解决方案

上一篇：python - ModuleNotFoundError : No module named 'werkzeug.posixemulation'

下一篇：python - 使用正则表达式将字符串拆分成组？