python - 有什么快速方法可以使用 pandas 获得时间序列数据的正确聚合输出?

标签 python pandas time-series

我使用了 Redfin 房地产数据,其中记录了芝加哥地区每个地区多年来的每月房屋销售价格。我想先计算该城市的年平均房屋销售价格,同时我还想获得每个区域的年房屋销售价格变化,然后我想将每个区域的年销售价格变化与各自的年平均房屋销售价格进行比较在城市中,我想引入新的列,其中对于一年中的每个区域都有二进制值 (1, 0)。如果各地区房屋销售价格变化大于该变化的年均房屋销售价格变化,则加1,否则加0。

例如,2012年2月~2013年2月期间,奥斯汀地区的年房屋销售价格变化为5%,芝加哥地区的年平均房屋销售价格为7%,所以我可以添加值0 进入 price_label 列。

如何轻松地对时间序列数据进行这种聚合?有什么办法可以完成这个任务吗?

我多次发布了我的问题,同时我尝试拥有自己的问题,但没有得到正确的输出。谁能指出我如何获得正确的解决方案?谢谢

示例数据:

dicts = {'Region': {0: 'Chicago, IL metro area',
  1: 'Chicago, IL',
  2: 'Chicago, IL - Albany Park',
  3: 'Chicago, IL - Andersonville'},
 Timestamp('2012-02-01 00:00:00'): {0: 88.4, 1: 95.1, 2: 76.8, 3: 193.4},
 Timestamp('2012-03-01 00:00:00'): {0: 93.3, 1: 103.6, 2: 77.9, 3: 169.2},
 Timestamp('2012-04-01 00:00:00'): {0: 97.6, 1: 120.4, 2: 80.9, 3: 157.3},
 Timestamp('2012-05-01 00:00:00'): {0: 102.0, 1: 130.6, 2: 98.4, 3: 156.8},
 Timestamp('2012-06-01 00:00:00'): {0: 110.7, 1: 150.8, 2: 109.8, 3: 175.4},
 Timestamp('2012-07-01 00:00:00'): {0: 109.3, 1: 133.6, 2: 102.6, 3: 188.8},
 Timestamp('2012-08-01 00:00:00'): {0: 106.9, 1: 140.5, 2: 89.0, 3: 194.8},
 Timestamp('2012-09-01 00:00:00'): {0: 103.4, 1: 137.5, 2: 87.5, 3: 206.9},
 Timestamp('2012-10-01 00:00:00'): {0: 98.5, 1: 121.4, 2: 98.7, 3: 209.2},
 Timestamp('2012-11-01 00:00:00'): {0: 97.8, 1: 125.0, 2: 94.1, 3: 211.5},
 Timestamp('2012-12-01 00:00:00'): {0: 97.1, 1: 120.9, 2: 93.3, 3: 183.8},
 Timestamp('2013-01-01 00:00:00'): {0: 94.4, 1: 110.7, 2: 89.4, 3: 181.4},
 Timestamp('2013-02-01 00:00:00'): {0: 91.1, 1: 104.8, 2: 95.1, 3: 177.2},
 Timestamp('2013-03-01 00:00:00'): {0: 94.7, 1: 123.0, 2: 94.9, 3: 180.6},
 Timestamp('2013-04-01 00:00:00'): {0: 100.9, 1: 126.8, 2: 101.4, 3: 203.0},
 Timestamp('2013-05-01 00:00:00'): {0: 109.3, 1: 156.1, 2: 127.9, 3: 218.0},
 Timestamp('2013-06-01 00:00:00'): {0: 116.8, 1: 165.2, 2: 125.0, 3: 218.0},
 Timestamp('2013-07-01 00:00:00'): {0: 120.8, 1: 168.2, 2: 120.8, 3: 220.3},
 Timestamp('2013-08-01 00:00:00'): {0: 119.8, 1: 164.7, 2: 113.6, 3: 208.3},
 Timestamp('2013-09-01 00:00:00'): {0: 114.2, 1: 158.5, 2: 115.3, 3: 209.7},
 Timestamp('2013-10-01 00:00:00'): {0: 116.0, 1: 156.9, 2: 127.9, 3: 205.4},
 Timestamp('2013-11-01 00:00:00'): {0: 110.0, 1: 135.3, 2: 130.5, 3: 215.0},
 Timestamp('2013-12-01 00:00:00'): {0: 112.6, 1: 146.0, 2: 126.6, 3: 212.5},
 Timestamp('2014-01-01 00:00:00'): {0: 105.2, 1: 127.9, 2: 112.3, 3: 205.7},
 Timestamp('2014-02-01 00:00:00'): {0: 104.2, 1: 126.9, 2: 106.7, 3: 202.9},
 Timestamp('2014-03-01 00:00:00'): {0: 107.1, 1: 138.5, 2: 114.3, 3: 200.0},
 Timestamp('2014-04-01 00:00:00'): {0: 114.8, 1: 155.9, 2: 119.3, 3: 210.9},
 Timestamp('2014-05-01 00:00:00'): {0: 120.1, 1: 179.4, 2: 134.5, 3: 215.4},
 Timestamp('2014-06-01 00:00:00'): {0: 126.4, 1: 186.8, 2: 141.5, 3: 225.5},
 Timestamp('2014-07-01 00:00:00'): {0: 127.2, 1: 187.5, 2: 152.1, 3: 225.5},
 Timestamp('2014-08-01 00:00:00'): {0: 128.8, 1: 186.1, 2: 156.9, 3: 222.1},
 Timestamp('2014-09-01 00:00:00'): {0: 122.2, 1: 183.3, 2: 145.1, 3: 213.2},
 Timestamp('2014-10-01 00:00:00'): {0: 120.8, 1: 161.6, 2: 147.7, 3: 228.8},
 Timestamp('2014-11-01 00:00:00'): {0: 116.7, 1: 151.3, 2: 144.4, 3: 226.3},
 Timestamp('2014-12-01 00:00:00'): {0: 117.2, 1: 154.0, 2: 145.1, 3: 238.8},
 Timestamp('2015-01-01 00:00:00'): {0: 113.4, 1: 145.8, 2: 137.2, 3: 221.6},
 Timestamp('2015-02-01 00:00:00'): {0: 108.7, 1: 139.8, 2: 140.7, 3: 232.0}}

这是字典中时间序列数据的示例数据片段:

我的尝试:

import numpy as np
import pandas as pd

df_= pd.DataFrame([dicts.keys(), dicts.values()])
df_.columns=df_.columns.astype(str)
house_df=house_df.set_index('Region')
house_df.columns=pd.to_datetime(df_.columns)

def ratio(df):
    return np.exp(np.log(df).diff()) - 1

keys = ['Region']
pd.concat([df_, df_.groupby('Region')[df_.columns].apply(ratio)],
          axis=1, keys=keys)

但是上述尝试没有返回正确的预期聚合结果。我应该怎么办?有什么想法可以实现这一点吗?我尝试了很多方法但仍然没有得到我想要的。谁能指出我如何做到这一点?

更新

或者,我想将多年来的每月变化与每个地区的年平均变化进行比较。有什么可能的想法可以让这种聚合发生吗?谢谢

所需输出

我想要获取数据框,其中如果单个城市的房价变化大于该城市的平均每年房价变化,则每个单独地区的年度房价百分比将添加为新列,那么我将添加二进制值例如 1,否则为 0。

expected_output = pd.DataFrame({'Year': ['2012', '2013', '2014', '2015', '2012', '2013', '2014', '2015', '2012', '2013', '2014', '2015'], 
                     'Area': ['Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park'],'yearly_price_change': ['5%', '10%', '7%','21%', '15%', '12%', '2%','21%', '10%', '11%', '12%','6%'],
                     'price_label':[0, 1, 0,1,1,1,0,1,1,1,1,0]})

enter image description here

有什么想法可以完成这件事吗?如何获得像我预期的数据帧那样的正确聚合?我怎样才能做到这一点?有什么想法吗?谢谢

最佳答案

这是我的看法:

# prepare the data frame
df = pd.DataFrame(dicts).set_index('Region')
df.columns = pd.to_datetime(df.columns)

df = df.stack().reset_index()
df.columns = ['Region', 'date', 'price']
df.head()

#    Region                  date                   price
#--  ----------------------  -------------------  -------
# 0  Chicago, IL metro area  2012-02-01 00:00:00     88.4
# 1  Chicago, IL metro area  2012-03-01 00:00:00     93.3
# 2  Chicago, IL metro area  2012-04-01 00:00:00     97.6
# 3  Chicago, IL metro area  2012-05-01 00:00:00    102
# 4  Chicago, IL metro area  2012-06-01 00:00:00    110.7

# get the price change over month, as I understand from the question
df['price_change'] = df.groupby('Region').price.apply(lambda x: x.diff().abs()/x)

# compute mean over the years and regions
new_df = df.groupby(['Region', df.date.dt.year])[['price_change']].mean()

# compute the price_label
new_df['price_label'] = new_df.groupby(level=0).apply(lambda x: (x>x.mean()).astype(int))
new_df

#                                     price_change
#date  Region                     
#2012  Chicago, IL                    0.082864
#      Chicago, IL - Albany Park      0.074394
#      Chicago, IL - Andersonville    0.066074
#      Chicago, IL metro area         0.035153
#2013  Chicago, IL                    0.074208
#      Chicago, IL - Albany Park      0.055192
#      Chicago, IL - Andersonville    0.032249
#      Chicago, IL metro area         0.040750
#2014  Chicago, IL                    0.063483
#      Chicago, IL - Albany Park      0.056466
#      Chicago, IL - Andersonville    0.030612
#      Chicago, IL metro area         0.032648
#2015  Chicago, IL                    0.049580
#      Chicago, IL - Albany Park      0.041228
#      Chicago, IL - Andersonville    0.061222
#      Chicago, IL metro area         0.038374
#Name: price_change, dtype: float64

# here we compute the average across the years for each region
# groupby(level=1) will gather all the records of same region (level 1)
# if you want average across the regions for each year,
# change to groupby(level=0), i.e. gather all records of same year.
new_df['price_label'] = new_df.groupby(level=1).apply(lambda x: (x>x.mean()).astype(int))

new_df

输出:

+------------------------------+-------+---------------+-------------+
|                              |       | price_change  | price_label |
+------------------------------+-------+---------------+-------------+
| Region                       | date  |               |             |
+------------------------------+-------+---------------+-------------+
| Chicago, IL                  | 2012  | 0.082864      |           1 |
|                              | 2013  | 0.074208      |           1 |
|                              | 2014  | 0.063483      |           0 |
|                              | 2015  | 0.049580      |           0 |
| Chicago, IL - Albany Park    | 2012  | 0.074394      |           1 |
|                              | 2013  | 0.055192      |           0 |
|                              | 2014  | 0.056466      |           0 |
|                              | 2015  | 0.041228      |           0 |
| Chicago, IL - Andersonville  | 2012  | 0.066074      |           1 |
|                              | 2013  | 0.032249      |           0 |
|                              | 2014  | 0.030612      |           0 |
|                              | 2015  | 0.061222      |           1 |
| Chicago, IL metro area       | 2012  | 0.035153      |           0 |
|                              | 2013  | 0.040750      |           1 |
|                              | 2014  | 0.032648      |           0 |
|                              | 2015  | 0.038374      |           1 |
+------------------------------+-------+---------------+-------------+

我可能误解了一些东西,但这就是要点:-)。

关于python - 有什么快速方法可以使用 pandas 获得时间序列数据的正确聚合输出?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55883846/

相关文章:

r - 手动计算ACF

python - 识别 pandas 中具有稀疏 nan 的时间序列中的数据组

python - Scala 有 pip/easy_install 吗?

python - 'numpy.float64' 对象不可迭代 - 基于内容的过滤模型

python - 没有 Sql Alchemy 引擎的数据框到 sql

mysql - 聚合具有不规则时间戳的 SQL 函数

python - Python 上的列表索引超出范围。没有任何作用

python - 使用 matplotlib.pyplot [[x1,y1],[x2,y2]] 绘制点

python - Pandas 合并具有不同名称的列并避免重复

python - 使用与列对应的列表值在 pandas 数据框中插入值