python - 如何提高HDF5读取数据的性能？

我需要的是获取unique date尽快实现值(value)。

我使用代码df = store.df.date.drop_duplicates()检索。这行代码取 6 seconds 。但是，如果我使用mysql并将相同的数据保存到mysql，则我在indexing之后的日期列使用mysql。，使用sql:select distinct date from table ，只需要80ms检索唯一的date值，即60 times比 HDF5 更快。

有没有办法让函数read_unique_date阅读HDF5比MySQL更快uses indexes ？

我的代码如下:

import pandas as pd
import numpy as np
from itertools import product
from time import time


def generate_data():
    np.random.seed(202108)

    # date = pd.date_range(start="19900101", end="20210723", freq="D")
    #The above is my original code, you can use the following code to speed up the operation.
    date = pd.date_range(start="20210101", end="20210723", freq="D")
    date = pd.DataFrame(date, columns=["date"])

    # code = pd.DataFrame(range(5000), columns=["code"])
    #The above is my original code, you can use the following code to speed up the operation.
    code = pd.DataFrame(range(50), columns=["code"])

    # generate product of the two columns:
    df = pd.DataFrame(product(date["date"], code["code"]), columns=["date", "code"])
    df['data'] = np.random.random(len(df))
    return df


def save_data(filename, df):
    store = pd.HDFStore(filename)
    store['df'] = df
    store.close()
    return


def read_unique_date(file_name):
    store = pd.HDFStore(file_name)
    start = time()
    df = store.df.date.drop_duplicates()
    store.close()
    stop = time()
    print(stop - start)
    return df


def main():
    path = 'd:\\'
    file = 'large data.h5'
    file_name = path + file
    df = generate_data()
    save_data(file_name, df)
    df1 = read_unique_date(file_name)
    print(df1)
    return df1


if __name__ == '__main__':
    main()

结果是:

0.015624761581420898
0       2021-01-01
50      2021-01-02
100     2021-01-03
150     2021-01-04
200     2021-01-05
           ...    
9950    2021-07-19
10000   2021-07-20
10050   2021-07-21
10100   2021-07-22
10150   2021-07-23
Name: date, Length: 204, dtype: datetime64[ns]

%timeit df1 = read_unique_date(file_name)
16.9 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我的原始代码的结果:

%timeit df1 = read_unique_date(file_name)
4.89 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

最佳答案

简短的回答是，除了数据集键(数据集名称)和连续数组索引之外，HDF5 根本没有索引。因此，您可以预期以次线性方式直接从文件执行的唯一查找操作是检索数据集中的第 N 个值。要直接从文件中查找唯一值，HDF5 必须读取整个文件。您也许可以用 HDF5 组和引用编写一些困惑但实用的东西，但随后您只是自己实现索引，我不建议走这条路。

另一方面，Pandas 使用哈希表、树和二分搜索来加速各种查找操作。您可以将数据加载到数据框中以进行查询。但通常您希望在某个时候迁移到正确的数据库。 Pandas 和 HDF5 只能带您到目前为止。

关于python - 如何提高HDF5读取数据的性能？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68680027/

python - 如何提高HDF5读取数据的性能？

上一篇：android - 下载大文件时 Dart/Flutter HTTP 响应不完整

下一篇：express - 如何在express中将参数传递给回调函数？