python - 有没有办法提高大文件解析日期的速度？

<分区>

我正在读取一个大约有 1B 行的大 csv 文件。我在解析日期时遇到了问题。 Python 的处理速度很慢。

文件中的一行如下所示， '20170427,20:52:01.510,ABC,USD/MXN,1,OFFER,19.04274,9000000,9@15@8653948257753368229,0.0\n'

如果我只看数据，需要1分钟。

t0 = datetime.datetime.now()
i = 0
with open(r"QuoteData.txt") as file:
    for line in file:
        i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)

129908976
0:01:09.871744

但如果我尝试解析日期时间，则需要 8 分钟。

t0 = datetime.datetime.now()
i = 0
with open(r"D:\FxQuotes\ticks.log.20170427.txt") as file:
    for line in file:
        strings = line.split(",")

        datetime.datetime(
            int(strings[0][0:4]), # %Y
            int(strings[0][4:6]), # %m
            int(strings[0][6:8]), # %d
            int(strings[1][0:2]), # %H
            int(strings[1][3:5]), # %M
            int(strings[1][6:8]), # %s
            int(strings[1][9:]), # %f
        )    

        i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)

129908976
0:08:13.687000

split()耗时约1分钟，日期解析耗时约6分钟。我可以做些什么来改善这一点吗？

最佳答案

@TemporalWolf 提出了使用 ciso8601 的绝妙建议.我从未听说过它，所以我想我会试一试。

首先，我使用您的 sample 线对我的笔记本电脑进行了基准测试。我制作了一个包含 1000 万行精确行的 CSV 文件，读取所有内容大约需要 6 秒。使用您的日期解析代码将时间延长了 48 秒，这是有道理的，因为您还报告说它花费了 8 倍的时间。然后我将文件缩小到 100 万行，我可以在 0.6 秒内读取它并在 4.8 秒内解析日期，所以一切看起来都正确。

然后我切换到 ciso8601，几乎就像变魔术一样，100 万行的时间从 4.8 秒减少到大约 1.9 秒:

import datetime
import ciso8601

t0 = datetime.datetime.now()
i = 0
with open('input.csv') as file:
    for line in file:
        strings = line.split(",")
        d = ciso8601.parse_datetime('%sT%s' % (strings[0], strings[1]))
        i+=1
print(i)
t1 = datetime.datetime.now() - t0
print(t1)

请注意，您的数据几乎已经是 iso8601 格式了。我只需要将日期和时间与中间的“T”粘在一起即可。

关于python - 有没有办法提高大文件解析日期的速度？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43726661/

上一篇：python - 如何在 PySpark 中从 RDD 创建数据框？

下一篇：python - beautifulsoup 转 csv : putting paragraph of text into one line

相关文章：

performance - 高效的服务器端自动完成

python - 使用 Cython + MinGW 构建独立应用程序

python - 在 Cython 中优化字符串

python - 数据框未正确附加

python - 为什么 C++ 代码实现的性能并不比 Python 实现更好？

C# DLLImport 导致过多的 IO 操作

python - memoryview 上的微积分(python 数组)

python 3 : `else` statement get executed even `if` statement was true

python - 测试单元在 TDD 中的效率

python - vtkCellLocator : FindClosestPoint usage in python