Pandas HDFStore : slow on query for non-matching string

我的问题是，当我尝试查找不包含在 DataFrame(存储在 hdf5 文件中)中的字符串时，需要很长时间才能完成查询。例如:

我有一个包含 2*10^9 行的 df。它存储在 HDF5 文件中。我有一个名为“code”的字符串列，它被标记为“data_column”(因此它被编入索引)。

当我搜索数据集 (store.select('df', 'code=valid_code')) 中存在的代码时，大约需要 10 秒才能获得 70K 行。

但是，当我搜索数据集中不存在的代码 (store.select('df', 'code=not_valid_code') ) 时，大约需要 980 秒才能获得查询结果(0 行) .

我像这样创建商店: store = pd.HDFStore('data.h5', complevel=1, complib='zlib') 第一个附加是这样的: store.append('df', chunk, data_columns=['code'], expectedrows=2318185498)

这种行为是正常的还是出了什么问题？

谢谢!

PS:这个问题可能与this other question有关

更新:

按照 Jeff 的建议，我复制了他的实验，并在 Mac 上得到了以下结果。这是生成的表:

!ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['A'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['A']]
/df/table (Table(50000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (8192,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [A_dtype := 'string64',
    A_kind := ['A'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'A',
    NROWS := 50000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

结果如下:

In [8]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 277 ms per loop

In [9]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 391 ms per loop

In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 533 ms per loop

In [11]: %timeit pd.read_hdf('test_zlib2.h5','df',where='A = "bar"')
1 loops, best of 3: 504 ms per loop

由于差异可能不够大，我尝试了相同的实验，但使用了更大的数据框。另外，我在另一台装有 Linux 的机器上做了这个实验。

这是代码(我只是将原始数据集乘以 10):

import pandas as pd

df = pd.DataFrame({'A' : [ 'foo%05d' % i for i in range(500000) ]})

df = pd.concat([ df ] * 20)

store = pd.HDFStore('test.h5',mode='w')

for i in range(50):
    print "%s" % i
    store.append('df',df,data_columns=['A'])

这是表格:

!ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['A'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['A']]
/df/table (Table(500000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": StringCol(itemsize=9, shape=(), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (15420,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [A_dtype := 'string72',
    A_kind := ['A'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'A',
    NROWS := 500000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

这些是文件:

-rw-rw-r-- 1 user user 8.2G Oct  5 14:00 test.h5
-rw-rw-r-- 1 user user 9.9G Oct  5 14:30 test_zlib.h5

结果如下:

In [9]:%timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 1.02 s per loop

In [10]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 980 ms per loop

In [11]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 7.02 s per loop

In [12]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
1 loops, best of 3: 7.27 s per loop

这些是我的 Pandas 和 Pytables 版本:

user@host:~/$ pip show tables
---
Name: tables
Version: 3.1.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: 

user@host:~/$ pip show pandas
---
Name: pandas
Version: 0.14.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: python-dateutil, pytz, numpy

虽然我很确定这个问题与 Pandas 无关，因为我在不使用 Pandas 的情况下仅使用 Pytables 时观察到类似的行为。

更新 2:

我已切换到 Pytables 3.0.0，问题已解决。这使用与 Pytables 3.1.1 生成的相同文件。

In [4]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 205 ms per loop

In [4]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 101 ms per loop

最佳答案

我认为您的问题是我们不久前提交的一个错误here和 PyTables 的家伙一起。本质上，当使用压缩存储并指定预期行并使用索引列时会导致错误索引。

解决方案就是不使用 expectedrows，而是使用指定的 block 形状(或 AUTO)对文件进行 ptrepack。无论如何，这是一个很好的做法。此外，不确定您是否预先指定压缩，但恕我直言，最好通过 ptrepack 执行此操作，请参阅文档 here .他们也是关于这个的问题(现在找不到它，基本上如果你正在创建文件，不要预先索引但是当你完成附加时，如果可以的话)。

无论如何，创建一个测试商店:

In [1]: df = DataFrame({'A' : [ 'foo%05d' % i for i in range(50000) ]})

In [2]: df = pd.concat([ df ] * 20)

追加 5000 万行。

In [4]: store = pd.HDFStore('test.h5',mode='w')

In [6]: for i in range(50):
   ...:     print "%s" % i
   ...:     store.append('df',df,data_columns=['A'])
   ...:

这是表格

In [9]: !ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['A'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['A']]
/df/table (Table(50000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (8192,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [A_dtype := 'string64',
    A_kind := ['A'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'A',
    NROWS := 50000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

创建 blosc 和 zlib 版本。

In [12]: !ptrepack --complib blosc --chunkshape auto --propindexes test.h5 test_blosc.h5

In [13]: !ptrepack --complib zlib --chunkshape auto --propindexes test.h5 test_zlib.h5

In [14]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback users 866182540 Oct  4 20:31 test.h5
-rw-rw-r-- 1 jreback users 976674013 Oct  4 20:36 test_blosc.h5
-rw-rw-r-- 1 jreback users 976674013 Oct  4  2014 test_zlib.h5

Perf 非常相似(对于找到的行)

In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 337 ms per loop

In [15]: %timeit pd.read_hdf('test_blosc.h5','df',where='A = "foo00002"')
1 loops, best of 3: 345 ms per loop

In [16]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 347 ms per loop

还有丢失的行(尽管压缩的在这里表现更好)。

In [11]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
10 loops, best of 3: 82.4 ms per loop

In [17]: %timeit pd.read_hdf('test_blosc.h5','df',where='A = "bar"')
10 loops, best of 3: 32.2 ms per loop

In [18]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 32.3 ms per loop

所以。尝试不使用预期的行说明符，并使用 ptrepack。

另一种可能性，如果您希望此列的条目密度相对较低(例如，唯一条目的数量较少)。是选择整个列，store.select_column('df','A').unique() 在这种情况下，并将其用作快速查找机制(因此您不会搜索全部)。

关于 Pandas HDFStore : slow on query for non-matching string，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26197622/

Pandas HDFStore : slow on query for non-matching string

上一篇：OPEN 语句中的 Fortran 语法错误

下一篇：susy-sass - Susy 2.0 跨度未按预期工作