python - Pandas HDFStore 从嵌套列中选择

标签 python pandas hdfstore

我有以下 DataFrame,它作为名为 data 的frame_table 存储在 HDFStore 对象中:

      shipmentid qty            
catid              1  2  3  4  5
0              0   0  0  0  0  0
1              1   0  0  0  2  0
2              2   2  0  0  0  0
3              3   0  4  0  0  0
0              0   0  0  0  0  0

我想做store.select('data','shipmentid==2'),但我收到“shipmentid”未定义的错误:

ValueError: The passed where expression: shipmentid==2
            contains an invalid variable reference
            all of the variable refrences must be a reference to
            an axis (e.g. 'index' or 'columns'), or a data_column
            The currently defined references are: columns,index

编写此选择语句的正确方法是什么?

编辑:添加示例代码

import pandas as pd
from pandas import *
import random

def createFrame():
    data = {
             ('shipmentid',''):{1:1,2:2,3:3},
             ('qty',1):{1:5,2:5,3:5},
             ('qty',2):{1:6,2:6,3:6},
             ('qty',3):{1:7,2:7,3:7}
           }
    frame = pd.DataFrame(data)

    return frame

def createStore():
    store = pd.HDFStore('sample.h5',format='table')
    return store    

frame = createFrame()
print(frame)
print('\n')
print(frame.info())

store = createStore()
store.put('data',frame,format='t')
print('\n')
print(store)

results = store.select('data','shipmentid == 2')

store.close()

最佳答案

我敢打赌您已经使用过类似的东西来创建您的商店,

In [207]:

data = pd.DataFrame(np.random.randn(8,2), columns=['shipmentid', 'qty'])
store = pd.HDFStore('borrar')
store.put('data', data, format='t')

如果您随后尝试执行选择,您确实会收到您所描述的错误,

In [208]:

store.select('data', 'shipmentid>0')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-211-5d0c4082cdcf> in <module>()
----> 1 store.select('data', 'shipmentid>0')

...

ValueError: The passed where expression: shipmentid>0
            contains an invalid variable reference
            all of the variable refrences must be a reference to

相反,您可以这样创建它:

In [209]:

data = pd.DataFrame(np.random.randn(8,2), columns=['shipmentid', 'qty'])
data.to_hdf('borrar2', 'data', append=True, mode='w', data_columns=['shipmentid', 'qty'])
In [210]:

pd.read_hdf('borrar2', 'data', where='shipmentid>0')
Out[210]:
shipmentid  qty
1   0.778225    -1.008529
5   0.264075    -0.651268
7   0.908880    0.153306

(老实说,我不知道为什么它的一种方式有效而另一种则不然,我的猜测是在第一种方式中你不能指定数据列。但它是其中之一可以驱动你疯狂...)

编辑: 更新发布的代码后,数据帧具有 MultiIndex。类似的更新代码将类似于:

In [273]:

import pandas as pd
from pandas import *
import random

def createFrame():
    data = {
             ('shipmentid',''):{1:1,2:2,3:3},
             ('qty',1):{1:5,2:5,3:5},
             ('qty',2):{1:6,2:6,3:6},
             ('qty',3):{1:7,2:7,3:7}
           }
    frame = pd.DataFrame(data)

    return frame 

frame = createFrame()
print(frame)
print('\n')
print(frame.info())

frame.to_hdf('sample.h5', 'data', append=True, mode='w', data_columns=['shipmentid'], format='table')
pd.read_hdf('sample.h5','data', 'shipmentid == 2')

但是我收到一个错误(我猜你也得到了同样的结果):

  qty       shipmentid
    1  2  3           
1   5  6  7          1
2   5  6  7          2
3   5  6  7          3


<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 1 to 3
Data columns (total 4 columns):
(qty, 1)          3 non-null int64
(qty, 2)          3 non-null int64
(qty, 3)          3 non-null int64
(shipmentid, )    3 non-null int64
dtypes: int64(4)
memory usage: 120.0 bytes
None
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-273-e10e811fc7c0> in <module>()
     23 print(frame.info())
     24 
---> 25 frame.to_hdf('sample.h5', 'data', append=True, mode='w', data_columns=['shipmentid'], format='table')
     26 pd.read_hdf('sample.h5','data', 'shipmentid == 2')
.....
stack trace
.....
ValueError: cannot use a multi-index on axis [1] with data_columns ['shipmentid']

我浏览了一下,但无法为此提供解决方案。我的印象是通过查看 code in github问题是选项 data_columns 不能与 MultiIndex 结合使用。我能想到的唯一解决方案是像代码中那样写入 HDFStore ,然后不带任何条件地读取完整的数据帧并进行搜索。即:

new_frame = store.get('data')
print new_frame[new_frame['shipmentid'] == 2]



<class 'pandas.io.pytables.HDFStore'>
File path: sample.h5
/data            frame_table  (typ->appendable,nrows->3,ncols->4,indexers->[index])
  qty       shipmentid
    1  2  3           
2   5  6  7          2

关于python - Pandas HDFStore 从嵌套列中选择,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29497694/

相关文章:

python - 使用 sum 代替 bfill 或 ffill 重新索引 Pandas DataFrame

python - 如何定义获取第一个元素的 python lambda?

hadoop - Hadoop文件系统是物理文件系统还是虚拟文件系统

python - Pandas 如何使用 read_fwf 读取填充为 0 的数字?

python - pytest Monkeypatch 终端大小

python - 如何将具有重复索引条目的数据帧与具有唯一索引条目的数据帧合并?

python - HDF5:有没有办法重命名现有 HDF5 表中的列名称?

python - 如何通过gdb获取python eventlet堆栈

python - 如何在一系列箱形图中的箱形图旁边显示数值平均值和标准值?

python - 如何减少 HDFStore 的大小开销?