我有一个 ~40gb 和 1800000 行的 csv 文件。
我想随机抽取 10,000 行并将它们打印到一个新文件中。
现在,我的方法是将 sed 用作:
(sed -n '$vars' < input.txt) > output.txt
其中 $vars
是随机生成的行列表。 (例如:1p;14p;1700p;...;10203p)
虽然这可行,但每次执行大约需要 5 分钟。这不是一个很大的时间,但我想知道是否有人对如何让它更快有想法?
最佳答案
具有相同长度的行的最大优点是您无需查找换行符即可知道每行的起始位置。文件大小约为 40GB,包含 1.8M 行,行长约为 20KB/行。如果您想对 10K 行进行采样,则行与行之间有 ~40MB。这几乎肯定比磁盘上的 block 大小大三个数量级。因此,寻找下一个读取位置比读取文件中的每个字节要高效得多。
Seeking 可以处理行长不等的文件(例如,UTF-8 编码中的非 ascii 字符),但需要对该方法进行少量修改。如果你有不相等的行,你可以寻找一个估计的位置,然后扫描到下一行的开头。这仍然非常有效,因为每需要阅读 ~20KB 就会跳过 ~40MB。您的采样均匀性会受到轻微影响,因为您将选择字节位置而不是行位置,并且您无法确定您正在读取的行号。
您可以使用生成行号的 Python 代码直接实现您的解决方案。以下是如何处理所有具有相同字节数(通常是 ascii 编码)的行的示例:
import random
from os.path import getsize
# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
file_size = getsize(file_name)
with open(file_name) as file:
# Read the first line to get the length
file.readline()
line_size = file.tell()
# You don't have to seek(0) here: if line #0 is selected,
# the seek will happen regardless later.
# Assuming you are 100% sure all lines are equal, this might
# discard the last line if it doesn't have a trailing newline.
# If that bothers you, use `math.round(file_size / line_size)`
line_count = file_size // line_size
# This is just a trivial example of how to generate the line numbers.
# If it doesn't work for you, just use the method you already have.
# By the way, this will just error out (ValueError) if you try to
# select more lines than there are in the file, which is ideal
selection_indices = random.sample(range(line_count), selection_count)
selection_indices.sort()
# Now skip to each line before reading it:
prev_index = 0
for line_index in selection_indices:
# Conveniently, the default seek offset is the start of the file,
# not from current position
if line_index != prev_index + 1:
file.seek(line_index * line_size)
print('Line #{}: {}'.format(line_index, file.readline()), end='')
# Small optimization to avoid seeking consecutive lines.
# Might be unnecessary since seek probably already does
# something like that for you
prev_index = line_index
如果您愿意牺牲(非常)少量的行号分布的一致性,您可以轻松地将类似的技术应用于行长不等的文件。您只需生成随机字节偏移量,然后跳到偏移量后的下一个完整行。在下面的实现中,假设您知道没有一行的长度超过 40KB。如果您的 CSV 具有以 UTF-8 编码的非 ascii unicode 字符,则您将不得不执行类似的操作,因为即使所有行都包含相同数量的字符,它们也会包含不同数量的字节。在这种情况下,您必须以二进制模式打开文件,否则当您跳到一个随机字节时可能会遇到解码错误,如果该字节恰好是中间字符:
import random
from os.path import getsize
# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes, not chars
# This serves two purposes:
# 1. It determines the margin to use from the end of the file
# 2. It determines the closest two offsets are allowed to be and
# still be 100% guaranteed to be in different lines
max_line_bytes = 40000
file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples, at least `max_line_bytes` apart from each
# other, in the range [0, file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count, file_size, max_line_bytes)
with open(file_name, 'rb') as file:
for offset in selection_offsets:
# Skip to each offset
file.seek(offset)
# Readout to the next full line
file.readline()
# Print the next line. You don't know the number.
# You also have to decode it yourself.
print(file.readline().decode('utf-8'), end='')
这里的所有代码都是 Python 3。
关于python - 从文件中随机抽取行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48050639/