python - 替换文本文件中的标记列表的最佳方法

我有一个文本文件(没有标点符号)，文件大小约为 100MB - 1GB，这是一些示例行:

please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
...

并带有替换 token 列表，如下所示:

check in -> check_in
full hd -> full_hd
bye bye -> bye_bye
ctrl c -> ctrl_c
...

替换文本文件后我想要的输出如下:

please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
...

我目前的做法

replace_tokens = {'ctrl c': 'ctrl_c', ...} # a python dictionary
for line in open('text_file'):
  for token in replace_tokens:
    line = re.sub(r'\b{}\b'.format(token), replace_tokens[token])
    # Save line to file

此解决方案有效，但对于大量替换标记和大型文本文件来说，这非常慢。有没有更好的解决方案？

最佳答案

您至少可以通过执行以下操作来消除内部循环的复杂性:

import re 

tokens={"check in":"check_in", "full hd":"full_hd",
"bye bye":"bye_bye","ctrl c":"ctrl_c"}

regex=re.compile("|".join([r"\b{}\b".format(t) for t in tokens]))

with open(your_file) as f:
    for line in f:
        line=regex.sub(lambda m: tokens[m.group(0)], line.rstrip())
        print(line)

打印:

please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help

关于python - 替换文本文件中的标记列表的最佳方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62441317/

上一篇：c - 如何告诉gcc不要在堆栈上对齐函数参数？

下一篇：html - 如何制作不和谐的链接预览

相关文章：

python - C++ OpenCV 中的快速索引

javascript - 如何在 ajax header 中为每个用户发送 api token ？

c# - IdentityServer4 中 token 端点的无效 HTTP 请求

c - C 中的字符串标记化和 strstr - 段错误

python - Flask: TypeError: blog() 有一个意外的关键字参数 'user'

python - 在Mandelbrot中实现平滑着色

python - 检查字符串是否仅包含 Python 中的某些字母

python - Python any 的传递生成器与列表

java - 不兼容的类型；字符串无法转换为 double : JAVA TOKEN

c++ - 如何使用字符串参数而不是字符数组来使用 strtok()？