我有一个文本文件(没有标点符号),文件大小约为 100MB - 1GB,这是一些示例行:
please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
...
并带有替换 token 列表,如下所示:
check in -> check_in
full hd -> full_hd
bye bye -> bye_bye
ctrl c -> ctrl_c
...
替换文本文件后我想要的输出如下:
please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
...
我目前的做法
replace_tokens = {'ctrl c': 'ctrl_c', ...} # a python dictionary
for line in open('text_file'):
for token in replace_tokens:
line = re.sub(r'\b{}\b'.format(token), replace_tokens[token])
# Save line to file
此解决方案有效,但对于大量替换标记和大型文本文件来说,这非常慢。有没有更好的解决方案?
最佳答案
您至少可以通过执行以下操作来消除内部循环的复杂性:
import re
tokens={"check in":"check_in", "full hd":"full_hd",
"bye bye":"bye_bye","ctrl c":"ctrl_c"}
regex=re.compile("|".join([r"\b{}\b".format(t) for t in tokens]))
with open(your_file) as f:
for line in f:
line=regex.sub(lambda m: tokens[m.group(0)], line.rstrip())
print(line)
打印:
please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
关于python - 替换文本文件中的标记列表的最佳方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62441317/