python - 在 python/pandas 中按月对每日数据进行分组,然后进行归一化

标签 python pandas

我在 Pandas DataFrame 中有下表:

    q_string    q_visits    q_date
0   nucleus         1790        2012-10-02 00:00:00
1   neuron          364         2012-10-02 00:00:00
2   current         280         2012-10-02 00:00:00
3   molecular       259         2012-10-02 00:00:00
4   stem            201         2012-10-02 00:00:00

该表包含来自服务器日志的查询量,按天计算。我想做两件事:

  1. 我想按月对查询进行分组,汇总整个月的查询查询量,例如如果“分子”出现在 2012-10-02 的卷 1000 和 2012-10-03 的卷 500 中,那么它应该在日期为 2012-10-31 的 1500(卷)的新表中有一个条目(代表月份的月底结束点——转换后的表格中的所有日期都将是代表它们相关的整个月份的月底)。
  2. 我想添加第 5 列,其中包含 month-normalized q_visits。即,一个字词的每月查询量除以该月所有字词的总查询量。

这样做的最佳方法是什么?

最佳答案

如果我理解正确的话:

对于 (1) 这样做:

通过从您提供的值以及一些随机日期和访问次数中抽样来制作一些假数据:

In [179]: string = Series(np.random.choice(df.string.values, size=100), name='string')

In [180]: visits = Series(poisson(1000, size=100), name='date')

In [181]: date = Series(np.random.choice([df.date[0], now(), Timestamp('1/1/2001'), Timestamp('11/15/2001'), Timestamp('12/1/01'), Timestamp('5/1/01')], size=100), dtype='datetime64[ns]', name='date')

In [182]: df = DataFrame({'string': string, 'visits': visits, 'date': date})

In [183]: df.head()
Out[183]:
                 date   string  visits
0 2001-11-15 00:00:00  current     997
1 2001-11-15 00:00:00  current     974
2 2012-10-02 00:00:00     stem     982
3 2001-12-01 00:00:00     stem     984
4 2001-01-01 00:00:00  current     989

In [186]: resamp = df.set_index('date').groupby('string').resample('M', how='sum')

In [187]: resamp.head()
Out[187]:
                    visits
string  date
current 2001-01-31    2996
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31    3016

NaN 在那里是因为在那几个月里没有使用该查询字符串的访问。

对于 (2),按日期分组,然后除以总和:

In [188]: g = resamp.groupby(level='date').apply(lambda x: x / x.sum())

In [189]: g.head()
Out[189]:
                    visits
string  date
current 2001-01-31   0.177
        2001-02-28     NaN
        2001-03-31     NaN
        2001-04-30     NaN
        2001-05-31   0.188

只是为了让您相信 (2) 正在做您想做的事:

In [176]: h = g.sortlevel('date').head()

In [177]: h
Out[177]:
                      visits
string    date
current   2001-01-31   0.077
molecular 2001-01-31   0.228
neuron    2001-01-31   0.073
nucleus   2001-01-31   0.234
stem      2001-01-31   0.388

In [178]: h.sum()
Out[178]:
visits    1
dtype: float64

如果您想将 resamp 转换为 DataFrame 并删除 NaN,请执行以下操作:

In [196]: resamp.dropna()
Out[196]:
                      visits
string    date
current   2001-01-31    2996
          2001-05-31    3016
          2001-11-30    5959
          2001-12-31    3998
          2013-09-30    1077
molecular 2001-01-31    3984
          2001-05-31    1911
          2001-11-30    3054
          2001-12-31    1020
          2012-10-31     977
          2013-09-30    1947
neuron    2001-01-31    3961
          2001-05-31    2069
          2001-11-30    5010
          2001-12-31    2065
          2012-10-31    6973
          2013-09-30     994
nucleus   2001-01-31    3060
          2001-05-31    3035
          2001-11-30    2924
          2001-12-31    4144
          2012-10-31    2004
          2013-09-30    7881
stem      2001-01-31    2911
          2001-05-31    5994
          2001-11-30    6072
          2001-12-31    4916
          2012-10-31    1991
          2013-09-30    3977

In [197]: resamp.dropna().reset_index()
Out[197]:
       string                date  visits
0     current 2001-01-31 00:00:00    2996
1     current 2001-05-31 00:00:00    3016
2     current 2001-11-30 00:00:00    5959
3     current 2001-12-31 00:00:00    3998
4     current 2013-09-30 00:00:00    1077
5   molecular 2001-01-31 00:00:00    3984
6   molecular 2001-05-31 00:00:00    1911
7   molecular 2001-11-30 00:00:00    3054
8   molecular 2001-12-31 00:00:00    1020
9   molecular 2012-10-31 00:00:00     977
10  molecular 2013-09-30 00:00:00    1947
11     neuron 2001-01-31 00:00:00    3961
12     neuron 2001-05-31 00:00:00    2069
13     neuron 2001-11-30 00:00:00    5010
14     neuron 2001-12-31 00:00:00    2065
15     neuron 2012-10-31 00:00:00    6973
16     neuron 2013-09-30 00:00:00     994
17    nucleus 2001-01-31 00:00:00    3060
18    nucleus 2001-05-31 00:00:00    3035
19    nucleus 2001-11-30 00:00:00    2924
20    nucleus 2001-12-31 00:00:00    4144
21    nucleus 2012-10-31 00:00:00    2004
22    nucleus 2013-09-30 00:00:00    7881
23       stem 2001-01-31 00:00:00    2911
24       stem 2001-05-31 00:00:00    5994
25       stem 2001-11-30 00:00:00    6072
26       stem 2001-12-31 00:00:00    4916
27       stem 2012-10-31 00:00:00    1991
28       stem 2013-09-30 00:00:00    3977

您当然也可以为 g 执行此操作:

In [198]: g.dropna()
Out[198]:
                      visits
string    date
current   2001-01-31   0.177
          2001-05-31   0.188
          2001-11-30   0.259
          2001-12-31   0.248
          2013-09-30   0.068
molecular 2001-01-31   0.236
          2001-05-31   0.119
          2001-11-30   0.133
          2001-12-31   0.063
          2012-10-31   0.082
          2013-09-30   0.123
neuron    2001-01-31   0.234
          2001-05-31   0.129
          2001-11-30   0.218
          2001-12-31   0.128
          2012-10-31   0.584
          2013-09-30   0.063
nucleus   2001-01-31   0.181
          2001-05-31   0.189
          2001-11-30   0.127
          2001-12-31   0.257
          2012-10-31   0.168
          2013-09-30   0.496
stem      2001-01-31   0.172
          2001-05-31   0.374
          2001-11-30   0.264
          2001-12-31   0.305
          2012-10-31   0.167
          2013-09-30   0.251

In [199]: g.dropna().reset_index()
Out[199]:
       string                date  visits
0     current 2001-01-31 00:00:00   0.177
1     current 2001-05-31 00:00:00   0.188
2     current 2001-11-30 00:00:00   0.259
3     current 2001-12-31 00:00:00   0.248
4     current 2013-09-30 00:00:00   0.068
5   molecular 2001-01-31 00:00:00   0.236
6   molecular 2001-05-31 00:00:00   0.119
7   molecular 2001-11-30 00:00:00   0.133
8   molecular 2001-12-31 00:00:00   0.063
9   molecular 2012-10-31 00:00:00   0.082
10  molecular 2013-09-30 00:00:00   0.123
11     neuron 2001-01-31 00:00:00   0.234
12     neuron 2001-05-31 00:00:00   0.129
13     neuron 2001-11-30 00:00:00   0.218
14     neuron 2001-12-31 00:00:00   0.128
15     neuron 2012-10-31 00:00:00   0.584
16     neuron 2013-09-30 00:00:00   0.063
17    nucleus 2001-01-31 00:00:00   0.181
18    nucleus 2001-05-31 00:00:00   0.189
19    nucleus 2001-11-30 00:00:00   0.127
20    nucleus 2001-12-31 00:00:00   0.257
21    nucleus 2012-10-31 00:00:00   0.168
22    nucleus 2013-09-30 00:00:00   0.496
23       stem 2001-01-31 00:00:00   0.172
24       stem 2001-05-31 00:00:00   0.374
25       stem 2001-11-30 00:00:00   0.264
26       stem 2001-12-31 00:00:00   0.305
27       stem 2012-10-31 00:00:00   0.167
28       stem 2013-09-30 00:00:00   0.251

最后,如果您想以不同的顺序放置列,请使用 reindex:

In [210]: g.dropna().reset_index().reindex(columns=['visits', 'string', 'date'])
Out[210]:
    visits     string                date
0    0.177    current 2001-01-31 00:00:00
1    0.188    current 2001-05-31 00:00:00
2    0.259    current 2001-11-30 00:00:00
3    0.248    current 2001-12-31 00:00:00
4    0.068    current 2013-09-30 00:00:00
5    0.236  molecular 2001-01-31 00:00:00
6    0.119  molecular 2001-05-31 00:00:00
7    0.133  molecular 2001-11-30 00:00:00
8    0.063  molecular 2001-12-31 00:00:00
9    0.082  molecular 2012-10-31 00:00:00
10   0.123  molecular 2013-09-30 00:00:00
11   0.234     neuron 2001-01-31 00:00:00
12   0.129     neuron 2001-05-31 00:00:00
13   0.218     neuron 2001-11-30 00:00:00
14   0.128     neuron 2001-12-31 00:00:00
15   0.584     neuron 2012-10-31 00:00:00
16   0.063     neuron 2013-09-30 00:00:00
17   0.181    nucleus 2001-01-31 00:00:00
18   0.189    nucleus 2001-05-31 00:00:00
19   0.127    nucleus 2001-11-30 00:00:00
20   0.257    nucleus 2001-12-31 00:00:00
21   0.168    nucleus 2012-10-31 00:00:00
22   0.496    nucleus 2013-09-30 00:00:00
23   0.172       stem 2001-01-31 00:00:00
24   0.374       stem 2001-05-31 00:00:00
25   0.264       stem 2001-11-30 00:00:00
26   0.305       stem 2001-12-31 00:00:00
27   0.167       stem 2012-10-31 00:00:00
28   0.251       stem 2013-09-30 00:00:00

关于python - 在 python/pandas 中按月对每日数据进行分组,然后进行归一化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18677271/

相关文章:

python - Python 函数中的列表赋值

python - 如何从 C 中找到 Python 函数的参数数量?

python - 加载一个非常大的文本文件并进行转置

python - 显示不以 ".0"结尾的值 Python Pandas

python - 如何从 Python 中的 pandas 数据框中获取 networkx 图的分支作为列表?

python - 为什么 Django 1.0 管理应用程序不能工作?

python - 追加到 python 对象列表时出错

python - 在从另一个 .py 文件调用的函数中使用 cv2.detectMultiScale() 时出错

python - pandas_ml 中的 cross_validation 问题

python - pandas.DataFrame.load/python2 和 python3 之间的保存 : pickle protocol issues