python - 合并具有数百万行的磁盘表时出现问题

标签 python python-2.7 pandas pytables hdfstore

TypeError: Cannot serialize the column [date] because its data contents are [empty] object dtype.

你好!目前有两个大型 HDFStore,其中每个节点都包含一个节点,这两个节点都不适合内存。节点不包含 NaN 值。现在我想使用 this 合并这两个节点。首先测试了一个小型商店,其中所有数据都放在一个 block 中,并且运行正常。但现在对于必须逐 block 合并的情况,它给了我以下错误:TypeError:无法序列化列[date],因为它的数据内容是[空]对象dtype

这是我正在运行的代码。

>>> import pandas as pd
>>> from pandas import HDFStore
>>> print pd.__version__
0.12.0rc1

>>> h5_1 ='I:/Data/output/test8\\var1.h5'
>>> h5_3 ='I:/Data/output/test8\\var3.h5'
>>> h5_1temp = h5_1.replace('.h5','temp.h5')

>>> A = HDFStore(h5_1)
>>> B = HDFStore(h5_3)
>>> Atemp = HDFStore(h5_1temp)

>>> print A
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1.h5
/var1            frame_table  (shape->12626172)
>>> print B
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var3.h5
/var3            frame_table  (shape->6313086)

>>> nrows_a = A.get_storer('var1').nrows
>>> nrows_b = B.get_storer('var3').nrows
>>> a_chunk_size = 500000
>>> b_chunk_size = 500000
>>> for a in xrange(int(nrows_a / a_chunk_size) + 1):
...     a_start_i = a * a_chunk_size
...     a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)
...     a = A.select('var1', start = a_start_i, stop = a_stop_i)
...     for b in xrange(int(nrows_b / b_chunk_size) + 1):
...         b_start_i = b * b_chunk_size
...         b_stop_i = min((b + 1) * b_chunk_size, nrows_b)
...         b = B.select('var3', start = b_start_i, stop = b_stop_i)
...         Atemp.append('mergev13', pd.merge(a, b , left_index=True, right_index=True,how='inner'))

... 
Traceback (most recent call last):
  File "<interactive input>", line 9, in <module>
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 658, in append
    self._write_to_group(key, value, table=True, append=True, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 923, in _write_to_group
    s.write(obj = value, append=append, complib=complib, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 3251, in write
    return super(AppendableMultiFrameTable, self).write(obj=obj.reset_index(), data_columns=data_columns, **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2983, in write
    **kwargs)
  File "D:\Python27\lib\site-packages\pandas\io\pytables.py", line 2715, in create_axes
    raise e
TypeError: Cannot serialize the column [date] because
its data contents are [empty] object dtype

我注意到的事情是,它提到我正在使用 pandas_version:= '0.10.1',但是我的 pandas 版本是 0.12.0rc1。还有一些更具体的节点信息:

>>> A.select_column('var1','date').unique()
array([2006001, 2006009, 2006017, 2006025, 2006033, 2006041, 2006049,
       2006057, 2006065, 2006073, 2006081, 2006089, 2006097, 2006105,
       2006113, 2006121, 2006129, 2006137, 2006145, 2006153, 2006161,
       2006169, 2006177, 2006185, 2006193, 2006201, 2006209, 2006217,
       2006225, 2006233, 2006241, 2006249, 2006257, 2006265, 2006273,
       2006281, 2006289, 2006297, 2006305, 2006313, 2006321, 2006329,
       2006337, 2006345, 2006353, 2006361], dtype=int64)

>>> B.select_column('var3','date').unique()
array([2006001, 2006017, 2006033, 2006049, 2006065, 2006081, 2006097,
       2006113, 2006129, 2006145, 2006161, 2006177, 2006193, 2006209,
       2006225, 2006241, 2006257, 2006273, 2006289, 2006305, 2006321,
       2006337, 2006353], dtype=int64)

>>> A.get_storer('var1').levels
['x', 'y', 'date']

>>> A.get_storer('var1').attrs
/var1._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'y', 'x'],
    index_cols := [(0, 'index')],
    levels := ['x', 'y', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['x', 'y', 'date', 'var1'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'y', 'x']]

>>> A.get_storer('var1').table
/var1/table (Table(12626172,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "y": Int64Col(shape=(), dflt=0, pos=3),
  "x": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

>>> B.get_storer('var3').levels
['x', 'y', 'date']

>>> B.get_storer('var3').attrs
/var3._v_attrs (AttributeSet), 12 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'y', 'x'],
    index_cols := [(0, 'index')],
    levels := ['x', 'y', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['x', 'y', 'date', 'var3'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'y', 'x']]

>>> B.get_storer('var3').table
/var3/table (Table(6313086,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "y": Int64Col(shape=(), dflt=0, pos=3),
  "x": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "y": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "x": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

>>> print Atemp
<class 'pandas.io.pytables.HDFStore'>
File path: I:/Data/output/test8\var1temp.h5
/mergev13            frame_table  (shape->823446)

由于 chunksize 为 500000 并且 Atemp 中的节点形状为 823446,因此我至少合并了一个 block 。但我无法弄清楚错误来自哪里,而且我也没有线索试图找出到底哪里出了问题。非常感谢任何帮助..

编辑

通过减少我的测试存储的 block 大小,它会给出相同的错误。当然不好,但现在让我有机会分享。点击here代码+ HDFStores。

最佳答案

合并后的框架可能没有行。附加 len-zero 帧是一个错误(尽管应该更能提供信息)。

添加前检查长度

df = pd.merge(a, b , left_index=True, right_index=True,how='inner')

if len(df):
    Atemp.append('mergev46', df)

您提供的数据集的结果

<class 'pandas.io.pytables.HDFStore'>
File path: var4.h5
/var4            frame_table  (shape->1334)
<class 'pandas.io.pytables.HDFStore'>
File path: var6.h5
/var6            frame_table  (shape->667)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1334 entries, (928, 310, 2006001) to (1000, 238, 2006361)
Data columns (total 1 columns):
var4    1334  non-null values
dtypes: float64(1)
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 667 entries, (928, 310, 2006001) to (1000, 238, 2006353)
Data columns (total 1 columns):
var6    667  non-null values
dtypes: float64(1)
<class 'pandas.io.pytables.HDFStore'>
File path: var4temp.h5
/mergev46            frame_table  (shape->977)

使用完文件后,您应该关闭这些文件(仅供引用)

Closing remaining open files: var6.h5... done var4.h5... done var4temp.h5... done

关于python - 合并具有数百万行的磁盘表时出现问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17691912/

相关文章:

python - 如何从 Pandas 数据框中删除没有数据的列

python - 当用户在 Django 管理中时如何获取当前应用程序?

python - 如何使用python自动通过网络发送文件?

Python:与 `datetime.date.isocalendar()` 相反

python - Tk() 类的每个实例是否彼此独立运行?

python - 逐行比较2个文件

python - 尽管安装了 python 模块,但仍无法导入它们

python - 使用 kivy 将加速度计数据存储在 json 文件中时出错。

Python/Pandas - 转置数据框的一部分和分组键

Python月度变化计算(Pandas)