python - 使用带有 DateTimeIndex 项的 select 从 HDFStore 检索 Pandas DataFrame 时缺少一个值

标签 python pandas hdf5 pytables

我正在尝试使用 Pandas、使用 select 和 terms 从 HDFStore 检索存储的数据。一个简单的 select()(不带条件)会返回所有数据。但是,当我尝试根据 DateTimeIndex 过滤数据时,将返回除最后一行之外的所有内容。

我怀疑时间戳的内部存储方式及其精度存在一些可疑之处,但我不明白它为什么不起作用或者我能做些什么。任何指示都会有帮助,因为我对此很陌生。

我创建了一个小型“单元测试”来调查......

import os
import tempfile
import uuid
import pandas as pd
import numpy as np
import time
import unittest
import sys


class PandasTestCase(unittest.TestCase):
    def setUp(self):
        print "Pandas version: {0}".format(pd.version.version)
        print "Python version: {0}".format(sys.version)
        self._filename = os.path.join(tempfile.gettempdir(), '{0}.{1}'.format(str(uuid.uuid4()), 'h5'))
        self._store = pd.HDFStore(self._filename)

    def tearDown(self):
        self._store.close()
        if os.path.isfile(self._filename):
            os.remove(self._filename)

    def test_filtering(self):
        t_start = time.time() * 1e+9
        t_end = t_start + 1e+9 # 1 second later, i.e. 10^9 ns
        sample_count = 1000

        timestamps = np.linspace(t_start, t_end, num=sample_count).tolist()
        data = {'channel_a': range(sample_count)}

        time_index = pd.to_datetime(timestamps, utc=True, unit='ns')
        df = pd.DataFrame(data, index=time_index, dtype=long)

        key = 'test'
        self._store.append(key, df)

        retrieved_df = self._store.select(key)
        retrieved_timestamps = np.array(retrieved_df.index.values, dtype=np.uint64).tolist()
        print "Retrieved {0} timestamps, w/o filter.".format(len(retrieved_timestamps))

        self.assertItemsEqual(retrieved_timestamps, timestamps)

        stored_time_index = self._store[key].index

        # Create a filter based on first and last values of index, i.e. from <= index <= to.
        from_filter = pd.Term('index>={0}'.format(pd.to_datetime(stored_time_index[0], utc=True, unit='ns')))
        to_filter = pd.Term('index<={0}'.format(pd.to_datetime(stored_time_index[-1], utc=True, unit='ns')))

        retrieved_df_interval = self._store.select(key, [from_filter, to_filter])
        retrieved_timestamps_interval = np.array(retrieved_df_interval.index.values, dtype=np.uint64).tolist()
        print "Retrieved {0} timestamps, using filter".format(len(retrieved_timestamps_interval))

        self.assertItemsEqual(retrieved_timestamps_interval, timestamps)


if __name__ == '__main__':
    unittest.main()

...输出以下内容:

Pandas version: 0.12.0
Python version: 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3]
Retrieved 1000 timestamps, w/o filter.
Retrieved 999 timestamps, using filter
F
======================================================================
FAIL: test_filtering (__main__.PandasTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pandastest.py", line 53, in test_filtering
    self.assertItemsEqual(retrieved_timestamps_interval, timestamps)
AssertionError: Element counts were not equal:
First has 1, Second has 0:  1.377701660170978e+18

----------------------------------------------------------------------
Ran 1 test in 0.039s

FAILED (failures=1)

Process finished with exit code 1

更新:修改术语的创建后,使用备用构造函数,一切正常。就像这样:

    # Create a filter based on first and last values of index, i.e. from <= index <= to.
    #from_filter = pd.Term('index>={0}'.format(pd.to_datetime(stored_time_index[0], utc=True, unit='ns')))
    from_filter = pd.Term('index','>=', stored_time_index[0])
    #to_filter = pd.Term('index<={0}'.format(pd.to_datetime(stored_time_index[-1], utc=True, unit='ns')))
    to_filter = pd.Term('index','<=', stored_time_index[-1])

最佳答案

时间戳上的字符串格式默认为小数点后 6 位(这就是术语上的格式所执行的操作)

ns 为 9 个位置,使用 Term 构造函数的替代形式

Term("index","<=",stamp)

这是一个例子

In [2]: start = Timestamp('20130101 9:00:00')

In [3]: start.value
Out[3]: 1357030800000000000

In [5]: index = pd.to_datetime([ start.value + i for i in list(ran
Out[8]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 09:00:00, ..., 2013-01-01 09:00:00.000000999]
Length: 1000, Freq: None, Timezone: None

In [9]: df = DataFrame(randn(1000,2),index=index)

In [10]: df.to_hdf('test.h5','df',mode='w',fmt='t')

In [12]: pd.read_hdf('test.h5','df')
Out[12]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2013-01-01 09:00:00 to 2013-01-01 09:00:00
Data columns (total 2 columns):
0    1000  non-null values
1    1000  non-null values
dtypes: float64(2)

In [15]: pd.read_hdf('test.h5','df',where=[pd.Term('index','<=',index[-1])])
Out[15]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1000 entries, 2013-01-01 09:00:00 to 2013-01-01 09:00:00
Data columns (total 2 columns):
0    1000  non-null values
1    1000  non-null values
dtypes: float64(2)

In [16]: pd.read_hdf('test.h5','df',where=[pd.Term('index','<=',index[-1].value-1)])
Out[16]: 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 999 entries, 2013-01-01 09:00:00 to 2013-01-01 09:00:00
Data columns (total 2 columns):
0    999  non-null values
1    999  non-null values
dtypes: float64(2)

请注意,在 0.13 中(本示例使用 master),这会更容易(您可以直接包含它,如: 'index<=index[-1]' (表达式的 rhs 上的索引实际上是局部变量索引

关于python - 使用带有 DateTimeIndex 项的 select 从 HDFStore 检索 Pandas DataFrame 时缺少一个值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18491609/

相关文章:

python - 无法在 jsonschema 中使用日期验证

c++ - C++中system()函数调用的返回值,用于运行Python程序

java - 将 HDF4 数组转储到 ascii,并具有源文件的完全精度

python - 更新 UI PyQt5 中的时钟和文本

python - 编写一个在 matlab 中可读的 3d numpy 数组

python - 获取时间戳在不规则时间间隔内的行 pandas (Time Series)

python - 检查 pandas 行中是否存在值,如果存在,在哪些列中

python - 模块未找到错误: No module named 'pandas' in GitHub Actions

Windows 操作系统无法打开适用于 linux 的文件名

python - 达斯克/HDF5 : Read by group?