python - 处理大文件时如何快速获取一行中的多个列？

假设我有一个包含 5000 列和 1,000,000 行的巨大文件。一行中的列由 \t 分隔，每个单元格都是长度几乎随机的字符串。我想到达每行中的特定列并评估它们。通常的方法太慢了。我编写了这样的代码来加快到达细胞的速度:

def amk(theLine, delimiter, columnList):
    ind = -1
    for col in columnList:
        for _ in range(col):
            ind = theLine.find(delimiter, ind + 1)
        yield theLine[ind + 1: theLine.find(delimiter, ind + 1)]

def columnListProcessor(columnList):
    columnList.sort(reverse=False)
    return [columnList[0]] + [columnList[i] - columnList[i - 1] for i in range(1,len(columnList))]

# Let's use a random columns to process for here.
# Amount of column can be more than 500
columnList = columnListProcessor([1, 3, 31, 232, 443, 514, 801, 1032, 1500, 2540, 2983, 3500, 4000, 4441, 4982])

with open("hugeFile.txt", "r") as theFile:
    theLine = theFile.readline()
    while theLine:
        for k in amk(theLine, "\t", columnList):
            if condition:
                foo()
        theLine = theFile.readline()

我可以说这实际上相当快。然而，我意识到函数 amk 可以更好。当它产生结果时，它会执行theLine.find(delimiter, ind + 1)，以便找到下一个\t。但是，它不会保存下一个 \t 的索引，因此下次调用它以生成列表中的下一列时，它会执行 theLine.find(delimiter, ind + 1 ) 再次查找下一个 \t。我的意思是它会找出下一个 \t 两次，这可能会导致我的代码运行速度变慢。

我尝试创建一个新的索引生成器，其中包含 theLine.find(delimiter, ind + 1) 但它并没有加快进程，尽管我可能写得不好。我无法解决这个问题，我无法固定代码，尽管它显然可以更快地工作。

最佳答案

如果您想要 5000 列中的 500 列，则用分隔符拆分所有列似乎更合适:

def amk(line, delimiter, column_list):
    split_line = line.split(delimiter)
    for col in column_list:
        yield split_line[col]

column_list = [1, 3, 31, 232, 443, 514, 801, 1032, 1500, 2540, 2983, 3500, 4000, 4441, 4982]

with open("hugeFile.txt", "r") as fobj:
    for line in fobj:
        for k in amk(line, "\t", column_list):
            print(k)

字符串的.split()方法是用C实现的。因此，它非常快。尽管您使用 .find() 进行的搜索可能较少，但您仍需要从 Python 中多次调用它。与 C 中对一个函数(方法)的一次调用相比，多次 Python 函数调用速度很慢。尽管 .find() 方法本身也是用 C 实现的，但与调用 .split() 的次数相比，您需要从 Python 中调用它多次>。

一般来说，您总是需要测量运行时间。通常，哪种方法对于您的用例来说更快并不是那么明显。

关于python - 处理大文件时如何快速获取一行中的多个列？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41790533/

python - 处理大文件时如何快速获取一行中的多个列？

上一篇：python - python中两个浮点值的比较

下一篇：python - 向 Dynamics CRM Web API 发出请求