我正在开发一个转译器,并希望用 Python 的标记替换我的语言的标记。替换是这样完成的:
for rep in reps:
pattern, translated = rep;
# Replaces every [pattern] with [translated] in [transpiled]
transpiled = re.sub(pattern, translated, transpiled, flags=re.UNICODE)
其中 reps
是 (要替换的正则表达式,要替换的字符串)
有序对的列表,transpiled
是要转换的文本被转译。
但是,我似乎找不到一种方法来从替换过程中排除引号之间的文本。请注意,这是针对一种语言的,因此它也应该适用于转义引号和单引号。
最佳答案
这可能取决于您定义模式的方式,但一般来说,您始终可以使用前向组和后向组包围您的模式
,以确保引号之间的文本不匹配:
import re
transpiled = "A foo with \"foo\" and single quoted 'foo'. It even has an escaped \\'foo\\'!"
reps = [("foo", "bar"), ("and", "or")]
print(transpiled) # before the changes
for rep in reps:
pattern, translated = rep
transpiled = re.sub("(?<=[^\"']){}(?=\\\\?[^\"'])".format(pattern),
translated, transpiled, flags=re.UNICODE)
print(transpiled) # after each change
这将产生:
A foo with "foo" and single quoted 'foo'. It even has an escaped \'foo\'! A bar with "foo" and single quoted 'foo'. It even has an escaped \'foo\'! A bar with "foo" or single quoted 'foo'. It even has an escaped \'foo\'!
UPDATE: If you want to ignore whole quoted swaths of text, not just a quoted word, you'll have to do a bit more work. While you could do it by looking for repeated quotations the whole lookahead/lookbehind mechanism would get really messy and probably far from optimal - it's just easier to separate the quoted from non-quoted text and do replacements only in the former, something like:
import re
QUOTED_STRING = re.compile("(\\\\?[\"']).*?\\1") # a pattern to match strings between quotes
def replace_multiple(source, replacements, flags=0): # a convenience replacement function
if not source: # no need to process empty strings
return ""
for r in replacements:
source = re.sub(r[0], r[1], source, flags=flags)
return source
def replace_non_quoted(source, replacements, flags=0):
result = [] # a store for the result pieces
head = 0 # a search head reference
for match in QUOTED_STRING.finditer(source):
# process everything until the current quoted match and add it to the result
result.append(replace_multiple(source[head:match.start()], replacements, flags))
result.append(match[0]) # add the quoted match verbatim to the result
head = match.end() # move the search head to the end of the quoted match
if head < len(source): # if the search head is not at the end of the string
# process the rest of the string and add it to the result
result.append(replace_multiple(source[head:], replacements, flags))
return "".join(result) # join back the result pieces and return them
您可以将其测试为:
print(replace_non_quoted("A foo with \"foo\" and 'foo', says: 'I have a foo'!", reps))
# A bar with "foo" or 'foo', says: 'I have a foo'!
print(replace_non_quoted("A foo with \"foo\" and foo, says: \\'I have a foo\\'!", reps))
# A bar with "foo" or bar, says: \'I have a foo\'!
print(replace_non_quoted("A foo with '\"foo\" and foo', says - I have a foo!", reps))
# A bar with '"foo" and foo', says - I have a bar!
作为额外的好处,这还允许您定义完全限定的正则表达式模式作为替换:
print(replace_non_quoted("My foo and \"bar\" are like 'moo' and star!",
(("(\w+)oo", "oo\\1"), ("(\w+)ar", "ra\\1"))))
# My oof and "bar" are like 'moo' and rast!
但是,如果您的替换不涉及模式并且只需要简单的替换,您可以将 replace_multiple()
辅助函数中的 re.sub()
替换为显着的更快的原生 str.replace()
。
最后,如果不需要复杂的模式,您可以完全摆脱正则表达式:
QUOTE_STRINGS = ("'", "\\'", '"', '\\"') # a list of substring considered a 'quote'
def replace_multiple(source, replacements): # a convenience multi-replacement function
if not source: # no need to process empty strings
return ""
for r in replacements:
source = source.replace(r[0], r[1])
return source
def replace_non_quoted(source, replacements):
result = [] # a store for the result pieces
head = 0 # a search head reference
eos = len(source) # a convenience string length reference
quote = None # last quote match literal
quote_len = 0 # a convenience reference to the current quote substring length
while True:
if quote: # we already have a matching quote stored
index = source.find(quote, head + quote_len) # find the closing quote
if index == -1: # EOS reached
break
result.append(source[head:index + quote_len]) # add the quoted string verbatim
head = index + quote_len # move the search head after the quoted match
quote = None # blank out the quote literal
else: # the current position is not in a quoted substring
index = eos
# find the first quoted substring from the current head position
for entry in QUOTE_STRINGS: # loop through all quote substrings
candidate = source.find(entry, head)
if head < candidate < index:
index = candidate
quote = entry
quote_len = len(entry)
if not quote: # EOS reached, no quote found
break
result.append(replace_multiple(source[head:index], replacements))
head = index # move the search head to the start of the quoted match
if head < eos: # if the search head is not at the end of the string
result.append(replace_multiple(source[head:], replacements))
return "".join(result) # join back the result pieces and return them
关于python - 正则表达式:替换文本,除非它位于引号之间,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49641089/