Summary of problem: I have written the generic regex to capture two groups from the sentence. Further I need to concatenate the 3rd term of 2nd group to the 1st group. I have used the word
and
in regex as partition to separate two groups of the sentence. For example:
Input = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
Output = 'Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.'
What Regex I have tried:
import re
string_ = "Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin."
regex_pattern = re.compile(r"\b([A-Za-z]*-\d+\s*|[A-Za-z]+\s*)\s+(and\s*[A-Za-z]*-\d+\s*[A-Za-z]*|and\s*[A-Za-z]+\s*[A-Za-z]+)?")
print(regex_pattern.findall(string_))
print(regex_pattern.sub(lambda x: x.group(1) + x.group(2)[2], string_))
正则表达式能够捕获组,但我从 substitute
方法行收到错误,如 TypeError: 'NoneType' object is not subscriptable
。任何类型的建议或帮助执行上述问题将不胜感激。
最佳答案
拆分解决方案
虽然这不是正则表达式解决方案,但它确实有效:
from string import punctuation
x = 'Since, the genetic cells of SAC-1 and RbC-27 synthesis was not caused by WbC-2 of acnes in human face and animals skin.'
x = x.split()
for idx, word in enumerate(x):
if word == "and":
# strip punctuation or we will get skin. instead of skin
x[idx] = x[idx + 2].strip(punctuation) + " and"
print(' '.join(x))
输出是:
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
此解决方案避免直接插入到列表中,因为这会在您迭代时导致索引出现问题。相反,我们将列表中的第一个“and”替换为“synthesis and”,将第二个“and”替换为“skin and”,然后重新加入拆分后的字符串。
正则表达式解决方案
如果您坚持使用正则表达式解决方案,我建议使用 re.findall
和包含单个 and 的模式,因为这对于问题更通用:
from string import punctuation
import re
pattern = re.compile("(.*?)\sand\s(.*?)\s([^\s]+)")
result = ''.join([f"{match[0]} {match[2].strip(punctuation)} and {match[1]} {match[2]}" for match in pattern.findall(x)])
print(result)
Since, the genetic cells of SAC-1 synthesis and RbC-27 synthesis was not caused by WbC-2 of acnes in human face skin and animals skin.
我们再次使用 strip(punctuation)
因为 skin.
被捕获:我们不想丢失 end 的标点符号的句子,但我们确实想在句子中丢失它。
这是我们的模式:
(.*?)\sand\s(.*?)\s([^\s]+)
(.*?)\s
:捕获“and”之前的所有内容,包括空格\s(.*?)\s
:捕获紧跟在“and”之后的单词([^\s]+)
:捕获下一个空格(即“and”之后的第二个单词)之前不是空格的任何内容。这确保我们也能捕获标点符号。
关于python - 通过正则表达式使用替代方法连接术语,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67881649/