我试图在非常大的文本文件中找到一些拼写错误并纠正它们。基本上,我运行这段代码:
ocr = open("text.txt")
text = ocr.readlines()
clean_text = []
for line in text:
last = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,", "1\\2\t\\3\\4,", line)
clean_text.append(last)
new_text = open("new_text.txt", "w", newline="\n")
for line in clean_text:
new_text.write(line)
new_text.close()
实际上,我使用“re.sub”函数超过 1500 次,“text.txt”有 100.000 行。 我可以将文本分成几部分并为不同部分使用不同的核心吗?
最佳答案
这会将文本处理函数(当前使用问题中的 re.sub
)应用于输入文本文件的 NUM_CORES
个相同大小的 block ,然后将它们写出(保留原始文本输入文件中的顺序)。
from multiprocessing import Pool, cpu_count
import numpy as np
import re
NUM_CORES = cpu_count()
def process_text(input_textlines):
clean_text = []
for line in input_textlines:
cleaned = re.sub("^(\\|)([0-9])(\\s)([A-Z][a-z]+[a-z])\\,", "1\\2\t\\3\\4,", line)
clean_text.append(cleaned)
return "".join(clean_text)
# read in data and convert to sequence of equally-sized chunks
with open('data/text.txt', 'r') as f:
lines = f.readlines()
num_lines = len(lines)
text_chunks = np.array_split(lines, NUM_CORES)
# process each chunk in parallel
pool = Pool(NUM_CORES)
results = pool.map(process_text, text_chunks)
# write out results
with open("new_text.txt", "w", newline="\n") as f:
for text_chunk in results:
f.write(text_chunk)
关于python - 大文本文件的并行计算,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59185357/