python - 子集数据框 Pandas 时间序列

标签 python pandas time-series subset

**根据提供的答案更新了代码** 实现的解决方案不会对原始数据帧进行子集化。



    In [1]: thresh_eval.head()

    Out[1]:
        WDIR    WSPD    GDR     GST     GTIME
    TX_DTTM                     
    2010-01-01 05:50:00     235     10.9    238     13.4    540
    2010-01-02 00:20:00     329     10.6    NaN     NaN     NaN
    2010-01-02 00:30:00     329     10.8    NaN     NaN     NaN
    2010-01-02 00:40:00     329     12.1    NaN     NaN     NaN
    2010-01-02 00:50:00     332     12.2    330     14.8    46

    In [2]: len(thresh_eval)

    Out[2]: 5503

    In [3]: unique(thresh_eval.index.date)

    Out[3]:

    array([datetime.date(2010, 1, 1), datetime.date(2010, 1, 2),
           datetime.date(2010, 1, 3), datetime.date(2010, 1, 4),
           datetime.date(2010, 1, 6), datetime.date(2010, 1, 8),
           datetime.date(2010, 1, 9), datetime.date(2010, 1, 12),
           datetime.date(2010, 1, 16), datetime.date(2010, 1, 17),
           datetime.date(2010, 1, 18), datetime.date(2010, 1, 21),
           datetime.date(2010, 1, 22), datetime.date(2010, 1, 23),
           datetime.date(2010, 1, 24), datetime.date(2010, 1, 25),
           datetime.date(2010, 1, 26), datetime.date(2010, 1, 27),
           datetime.date(2010, 1, 29), datetime.date(2010, 1, 30),
           datetime.date(2010, 1, 31), datetime.date(2010, 2, 1),
           datetime.date(2010, 2, 2), datetime.date(2010, 2, 3),
           datetime.date(2010, 2, 4), datetime.date(2010, 2, 5),
           datetime.date(2010, 2, 6), datetime.date(2010, 2, 7),
           datetime.date(2010, 2, 9), datetime.date(2010, 2, 10),
           datetime.date(2010, 2, 11), datetime.date(2010, 2, 12),
           datetime.date(2010, 2, 13), datetime.date(2010, 2, 14),
           datetime.date(2010, 2, 15), datetime.date(2010, 2, 16),
           datetime.date(2010, 2, 17), datetime.date(2010, 2, 18),
           datetime.date(2010, 2, 22), datetime.date(2010, 2, 25),
           datetime.date(2010, 2, 26), datetime.date(2010, 2, 27),
           datetime.date(2010, 2, 28), datetime.date(2010, 3, 2),
           datetime.date(2010, 3, 3), datetime.date(2010, 3, 12),
           datetime.date(2010, 3, 13), datetime.date(2010, 3, 14),
           datetime.date(2010, 3, 15), datetime.date(2010, 3, 18),
           datetime.date(2010, 3, 21), datetime.date(2010, 3, 22),
           datetime.date(2010, 3, 23), datetime.date(2010, 3, 26),
           datetime.date(2010, 3, 27), datetime.date(2010, 3, 28),
           datetime.date(2010, 3, 29), datetime.date(2010, 3, 30),
           datetime.date(2010, 4, 9), datetime.date(2010, 4, 17),
           datetime.date(2010, 4, 18), datetime.date(2010, 4, 25),
           datetime.date(2010, 4, 26), datetime.date(2010, 4, 27),
           datetime.date(2010, 4, 28), datetime.date(2010, 5, 3),
           datetime.date(2010, 5, 8), datetime.date(2010, 5, 9),
           datetime.date(2010, 5, 17), datetime.date(2010, 5, 24),
           datetime.date(2010, 5, 25), datetime.date(2010, 5, 26),
           datetime.date(2010, 6, 2), datetime.date(2010, 6, 3),
           datetime.date(2010, 6, 6), datetime.date(2010, 6, 7),
           datetime.date(2010, 6, 16), datetime.date(2010, 6, 28),
           datetime.date(2010, 7, 2), datetime.date(2010, 7, 3),
           datetime.date(2010, 7, 10), datetime.date(2010, 7, 16),
           datetime.date(2010, 7, 22), datetime.date(2010, 7, 26),
           datetime.date(2010, 7, 28), datetime.date(2010, 7, 30),
           datetime.date(2010, 8, 1), datetime.date(2010, 8, 7),
           datetime.date(2010, 8, 23), datetime.date(2010, 8, 24),
           datetime.date(2010, 9, 2), datetime.date(2010, 9, 12),
           datetime.date(2010, 9, 27), datetime.date(2010, 9, 29),
           datetime.date(2010, 9, 30), datetime.date(2010, 10, 2),
           datetime.date(2010, 10, 3), datetime.date(2010, 10, 15),
           datetime.date(2010, 10, 16), datetime.date(2010, 10, 25),
           datetime.date(2010, 10, 26), datetime.date(2010, 10, 27),
           datetime.date(2010, 10, 29), datetime.date(2010, 11, 2),
           datetime.date(2010, 11, 3), datetime.date(2010, 11, 4),
           datetime.date(2010, 11, 5), datetime.date(2010, 11, 6),
           datetime.date(2010, 11, 7), datetime.date(2010, 11, 9),
           datetime.date(2010, 11, 12), datetime.date(2010, 11, 16),
           datetime.date(2010, 11, 17), datetime.date(2010, 11, 26),
           datetime.date(2010, 11, 27), datetime.date(2010, 11, 28),
           datetime.date(2010, 11, 29), datetime.date(2010, 11, 30),
           datetime.date(2010, 12, 1), datetime.date(2010, 12, 2),
           datetime.date(2010, 12, 4), datetime.date(2010, 12, 5),
           datetime.date(2010, 12, 6), datetime.date(2010, 12, 7),
           datetime.date(2010, 12, 11), datetime.date(2010, 12, 12),
           datetime.date(2010, 12, 13), datetime.date(2010, 12, 14),
           datetime.date(2010, 12, 16), datetime.date(2010, 12, 17),
           datetime.date(2010, 12, 18), datetime.date(2010, 12, 19),
           datetime.date(2010, 12, 20), datetime.date(2010, 12, 22),
           datetime.date(2010, 12, 23), datetime.date(2010, 12, 24),
           datetime.date(2010, 12, 26), datetime.date(2010, 12, 27),
           datetime.date(2010, 12, 28)], dtype=object)

    In [4]: ais.head()

    Out[4]:
        MMSI    LAT     LON     COURSE_OVER_GROUND  NAV_STATUS  POS_ACCURACY    RATE_OF_TURN    SPEED_OVER_GROUND   HEADING
    TX_DTTM                                     
    2010-01-01 00:00:19     12345678    32.834746   -79.929589  1820    0   0   128     71  NaN
    2010-01-01 00:00:29     12345678    32.834384   -79.929602  1832    0   0   128     71  NaN
    2010-01-01 00:00:40     12345678    32.834058   -79.929619  1836    0   0   128     70  NaN
    2010-01-01 00:00:50     12345678    32.833703   -79.929647  1847    0   0   128     70  NaN
    2010-01-01 00:01:00     12345678    32.833386   -79.929689  1897    0   0   128     69  NaN

    In [5]: unique(ais.index.date)

    Out[5]:

    array([datetime.date(2010, 1, 1), datetime.date(2010, 1, 4),
           datetime.date(2010, 1, 5), datetime.date(2010, 1, 6),
           datetime.date(2010, 1, 7), datetime.date(2010, 1, 8),
           datetime.date(2010, 1, 9), datetime.date(2010, 1, 10),
           datetime.date(2010, 1, 11), datetime.date(2010, 1, 12),
           datetime.date(2010, 1, 13), datetime.date(2010, 1, 14),
           datetime.date(2010, 1, 15), datetime.date(2010, 1, 16),
           datetime.date(2010, 1, 17), datetime.date(2010, 1, 18),
           datetime.date(2010, 1, 19), datetime.date(2010, 1, 20),
           datetime.date(2010, 1, 21), datetime.date(2010, 1, 22),
           datetime.date(2010, 1, 23), datetime.date(2010, 1, 24),
           datetime.date(2010, 1, 25), datetime.date(2010, 1, 26),
           datetime.date(2010, 1, 27), datetime.date(2010, 1, 28),
           datetime.date(2010, 1, 29), datetime.date(2010, 1, 30),
           datetime.date(2010, 1, 31), datetime.date(2010, 2, 1)], dtype=object)

    In [6]: len(ais)

    Out[6]: 2750499

    In [7]: ais[Index(ais.index.date).isin(Index(thresh_eval.index.date))]

    Out[7]:
        MMSI    LAT     LON     COURSE_OVER_GROUND  NAV_STATUS  POS_ACCURACY    RATE_OF_TURN    SPEED_OVER_GROUND   HEADING
    TX_DTTM                                     
    2010-01-01 00:00:19     12345678    32.834746   -79.929589  1820    0   0   128     71  NaN
    2010-01-01 00:00:29     12345678    32.834384   -79.929602  1832    0   0   128     71  NaN
    2010-01-01 00:00:40     12345678    32.834058   -79.929619  1836    0   0   128     70  NaN
    2010-01-01 00:00:50     12345678    32.833703   -79.929647  1847    0   0   128     70  NaN
    2010-01-01 00:01:00     12345678    32.833386   -79.929689  1897    0   0   128     69  NaN
    2010-01-01 00:01:06     12345678    32.833106   -79.929757  1934    0   0   128     69  NaN
    2010-01-01 00:01:16     12345678    32.832830   -79.929850  1978    0   0   128     69  NaN
    2010-01-01 00:01:26     12345678    32.832495   -79.929990  2010    0   0   128     69  NaN

    In [8]: len(ais)

    Out[8]: 2750499

    In [9]: unique(ais.index.date)

    Out[9]:

    array([datetime.date(2010, 1, 1), datetime.date(2010, 1, 4),
           datetime.date(2010, 1, 5), datetime.date(2010, 1, 6),
           datetime.date(2010, 1, 7), datetime.date(2010, 1, 8),
           datetime.date(2010, 1, 9), datetime.date(2010, 1, 10),
           datetime.date(2010, 1, 11), datetime.date(2010, 1, 12),
           datetime.date(2010, 1, 13), datetime.date(2010, 1, 14),
           datetime.date(2010, 1, 15), datetime.date(2010, 1, 16),
           datetime.date(2010, 1, 17), datetime.date(2010, 1, 18),
           datetime.date(2010, 1, 19), datetime.date(2010, 1, 20),
           datetime.date(2010, 1, 21), datetime.date(2010, 1, 22),
           datetime.date(2010, 1, 23), datetime.date(2010, 1, 24),
           datetime.date(2010, 1, 25), datetime.date(2010, 1, 26),
           datetime.date(2010, 1, 27), datetime.date(2010, 1, 28),
           datetime.date(2010, 1, 29), datetime.date(2010, 1, 30),
           datetime.date(2010, 1, 31), datetime.date(2010, 2, 1)], dtype=object)

**原始问题:** 我试图根据数据帧的日期时间索引与另一个数据帧的日期时间索引之间的比较来对数据帧进行子集化。 df1 是用作过滤器的下采样时间序列的数据帧。 df2是要过滤的记录的数据帧,它具有更高的时间分辨率,并且每个日期出现在df1中的多个记录:

In [1]: df1
    Out[1]:
                 WSPD        cd
    date                           
    2010-07-10  11.325645  0.000019
    2010-08-23  12.258462  0.000019
    2010-11-09  10.771429  0.000019
    2010-11-12  10.650000  0.000019
    2010-11-16  11.939535  0.000019
    ...

    In [2]: df2
    Out[2]:
                             ID   Latitude  Longitude  Course  RateOfTurn  
    TimeStamp                                                                  
    2010-06-26 22:36:11  311425000  32.832500 -79.929000       3           0   
    2010-06-26 22:36:21  311425000  32.832845 -79.929037       3           0   
    2010-06-26 22:36:32  311425000  32.833333 -79.929000       3           0   
    2010-06-26 22:36:42  311425000  32.833666 -79.929000       3           0 
    2010-07-10 07:37:21  548723000  32.832333 -79.929000     1.0           0   
    2010-07-10 07:37:31  548723000  32.832666 -79.929000     1.0           0   
    2010-07-10 07:37:40  548723000  32.833000 -79.929000     2.0           0   
    2010-07-10 07:37:51  548723000  32.833333 -79.929000     1.0           0   
    2010-07-10 07:38:04  548723000  32.833666 -79.929000     0.0           0   
    2010-08-23 09:29:48  311425000  32.832590 -79.928985     0.0           0   
    2010-08-23 09:30:00  311425000  32.833053 -79.928970     1.0           0   
    2010-08-23 09:30:10  311425000  32.833443 -79.928957     1.0           0   
    2010-08-23 09:30:18  311425000  32.833746 -79.928944     2.0           0   
    ...

    In [3]: list = []
            for i,item in enumerate(df2.index.date): 
                if item in df1.index.date:
                    list.append(item)

    In [4]: list
    out[4]: [datetime.date(2010, 8, 23),
     datetime.date(2010, 8, 23),
     datetime.date(2010, 8, 23),
     datetime.date(2010, 8, 23),
     datetime.date(2010, 7, 10),
     datetime.date(2010, 7, 10),
     datetime.date(2010, 7, 10),
     datetime.date(2010, 7, 10),
     datetime.date(2010, 7, 10)]

我正在丢失索引之外的内容。我真的很想要 df2 中的记录子集,包括所有数据,其日期时间与 df1 的日频率匹配,例如:


    2010-07-10 07:37:21  548723000  32.832333 -79.929000     1.0           0   
    2010-07-10 07:37:31  548723000  32.832666 -79.929000     1.0           0   
    2010-07-10 07:37:40  548723000  32.833000 -79.929000     2.0           0   
    2010-07-10 07:37:51  548723000  32.833333 -79.929000     1.0           0   
    2010-07-10 07:38:04  548723000  32.833666 -79.929000     0.0           0   
    2010-08-23 09:29:48  311425000  32.832590 -79.928985     0.0           0   
    2010-08-23 09:30:00  311425000  32.833053 -79.928970     1.0           0   
    2010-08-23 09:30:10  311425000  32.833443 -79.928957     1.0           0   
    2010-08-23 09:30:18  311425000  32.833746 -79.928944     2.0           0   

如有任何帮助,我们将不胜感激!

最佳答案

使用isin方法:

In [33]: import datetime

In [34]: import pandas as pd

In [35]: from pandas import DataFrame, Index

In [36]: from numpy.random import randn, unique, array

In [37]: df1 = DataFrame({'lat': randn(48), 'long': randn(48)}, index=pd.date_range('2013-01-02',periods=4
8,freq='H'))

In [38]: df2 = DataFrame({'lat': randn(72), 'long': randn(72)}, index=pd.date_range('2013-01-02',periods=7
2,freq='H'))

In [39]: df1.head()
Out[39]:
                        lat    long
2013-01-02 00:00:00  0.7310  0.3083
2013-01-02 01:00:00  1.8540  0.7355
2013-01-02 02:00:00  0.3097 -0.1834
2013-01-02 03:00:00  0.8455  0.8350
2013-01-02 04:00:00  0.4017  0.0559

[5 rows x 2 columns]

In [40]: df2.head()
Out[40]:
                        lat    long
2013-01-02 00:00:00  1.4248  0.2289
2013-01-02 01:00:00 -0.5055  0.1072
2013-01-02 02:00:00 -1.8265 -1.0651
2013-01-02 03:00:00  0.5888  0.3992
2013-01-02 04:00:00 -1.5210  0.0710

[5 rows x 2 columns]

In [41]: df2[Index(df2.index.date).isin(Index(df1.index.date))]
Out[41]:
                        lat    long
2013-01-02 00:00:00  1.4248  0.2289
2013-01-02 01:00:00 -0.5055  0.1072
2013-01-02 02:00:00 -1.8265 -1.0651
2013-01-02 03:00:00  0.5888  0.3992
2013-01-02 04:00:00 -1.5210  0.0710
2013-01-02 05:00:00  0.8382 -1.5569
2013-01-02 06:00:00 -0.7878  0.9253
2013-01-02 07:00:00 -0.1686 -1.0128
2013-01-02 08:00:00 -0.2481 -0.4247
2013-01-02 09:00:00  0.0794 -0.1947
2013-01-02 10:00:00 -0.5046 -0.1535
2013-01-02 11:00:00  0.0696 -1.5125
2013-01-02 12:00:00  1.1984 -0.1880
2013-01-02 13:00:00  0.8251 -0.2588
2013-01-02 14:00:00  1.5858 -1.2998
2013-01-02 15:00:00  0.2727 -0.3030
2013-01-02 16:00:00  0.9459 -0.8018
2013-01-02 17:00:00 -1.5055 -1.1344
2013-01-02 18:00:00  0.3970  0.7449
2013-01-02 19:00:00 -1.0256  0.2245
2013-01-02 20:00:00  0.8322  0.6473
2013-01-02 21:00:00  0.2759  1.4096
2013-01-02 22:00:00 -0.5167  1.5676
2013-01-02 23:00:00  0.4620  0.4936
2013-01-03 00:00:00  1.4400  0.5696
                        ...     ...

[48 rows x 2 columns]

您可以通过比较来检查结果是否仅包含与日频率重叠的日期索引

In [42]: unique(df2[Index(df2.index.date).isin(Index(df1.index.date))].index.date)
Out[42]: array([datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)], dtype=object)

In [43]: unique(df1.index.date)
Out[43]: array([datetime.date(2013, 1, 2), datetime.date(2013, 1, 3)], dtype=object)

关于python - 子集数据框 Pandas 时间序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23346905/

相关文章:

python - 如何使用 Python 重命名文件

python - Pandas:标记重叠日期,但如果满足条件则排除某些行

python - 如果每个值都相等,则删除 pandas 数据框行

machine-learning - 基于小序列的后续序列预测

python - 有没有办法将时间序列数据重新采样为 x 小时并以 One-Hot 编码格式获得输出?

time-series - 按列从 TimeArray 中提取另一个

python - 优达学城 CS101 : Not sure if this nested dictionary question is wrong or I'm missing something

python - 是否可以在 virtualenv 中安装 supervisord?

Python 编程 3.4.2,零与叉。检查如何查看游戏是否获胜的最佳方法?

python - 如何以特殊形式添加一行