python - python读取大量文件

考虑到我有大量 json 文件，但大小很小(大约 20000 个文件，大约 100 Mbs)，第一次使用代码片段读取它们:

from time import perf_counter
from glob import glob

def load_all_jsons_serial():
    t_i = perf_counter()
    json_files = glob("*json")
    for file in json_files:
        with open(file,"r") as f:
            f.read()
    t_f = perf_counter()
    return t_f-t_i
load_all_jsons_serial()

大约需要 50 秒。

但是，如果我重新运行代码，只需不到一秒即可完成!有人可以吗:

解释这一观察结果。为什么第一次运行的时间较长，而下一次运行的时间较短？
如何减少首次加载时间？

我使用的是 Windows 11 计算机，并在 VSCode 的笔记本扩展中运行代码。谢谢。

最佳答案

您可以与aiofiles并行读取。这是一个完整的示例，其中我在文件夹 jsonfiles\async\ 和 jsonfiles\sync\ 中有 1000 个 json 文件(每个 200kb)，以防止任何硬盘或操作系统级缓存。每次运行后删除文件并重新创建 JSON 文件。

from glob import glob
import aiofiles
import asyncio
from time import perf_counter

###
# Synchronous file operation:
###
def load_all_jsons_serial():
    json_files = glob("jsonfiles\\sync\\*.json")
    for file in json_files:
        with open(file,"r") as f:
            f.read()
    return

t_i = perf_counter()
load_all_jsons_serial()
t_f = perf_counter()
print(f"Synchronous: {t_f - t_i}")


###
# Async file operation
###
async def load_async(files: list[str]):
    for file in files:
        async with aiofiles.open(file, "r") as f:
            await f.read()
    return
        
async def main():
    json_files = glob("jsonfiles\\async\\*.json")
    no_of_tasks = 10
    files_per_task = len(json_files)//no_of_tasks + 1
    
    tasks = []
    for i in range(no_of_tasks):
        tasks.append(
            asyncio.create_task(load_async(
                json_files[i*files_per_task : i*files_per_task+files_per_task]))
        )
    await asyncio.gather(*tasks)
    return

t_i = perf_counter()
asyncio.run(main())
t_f = perf_counter()
print(f"Asynchronous: {t_f - t_i}")

这并不完全是科学，但您可以看到性能有了显着提升:

Synchronous: 13.353551400010474
Asynchronous: 3.1800755000440404

关于python - python读取大量文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/74888801/

python - python读取大量文件

上一篇：tla+ - 如何在TLA+中以这种方式获得一套？

下一篇：android - 有没有办法检测我的应用程序是否在沙盒模式下克隆？