python - Pandas 中的数据框与日期和周期的争论

有很多我通常会在 SQL 和 Excel 中完成的事情，而我正在尝试使用 Pandas 完成。这里有几个不同的争论问题，合并为一个问题，因为它们都有相同的目标。

我在 python 中有一个包含三列的数据框 df:

   |  EventID  |  PictureID  |  Date
0  |  1        |  A          |  2010-01-01
1  |  2        |  A          |  2010-02-01
2  |  3        |  A          |  2010-02-15
3  |  4        |  B          |  2010-01-01
4  |  5        |  C          |  2010-02-01
5  |  6        |  C          |  2010-02-15

EventID 是唯一的。尽管 PictureID + Date 不同，但 PictureID 并不唯一。

我。首先我想添加一个新列:

df['period'] = the month and year that the event falls into beginning 2010-01.

二.其次，我想将数据“融合”到一些新的数据帧中，该数据帧计算给定时间段内给定 PictureID 的事件数量。我将使用仅包含两个句点的示例。

   |  PictureID  |  Period  | Count
0  |  A          |  2010-01 | 1
1  |  A          |  2010-02 | 2
2  |  B          |  2010-01 | 1
3  |  C          |  2010-02 | 2

这样我就可以将这个新数据帧堆叠(？)到为所有唯一 PictureID 提供周期计数的东西中:

   |  PictureID  |  2010-01 | 2010-02
0  |  A          |  1       | 2
1  |  B          |  1       | 0
2  |  C          |  0       | 2

我的感觉是 pandas 很容易就能完成这类事情，对吗？

[编辑:删除了困惑的第三部分。]

最佳答案

对于前两部分，您可以执行以下操作:

>>> df['Period'] = df['Date'].map(lambda d: d.strftime('%Y-%m'))
>>> df
   EventID PictureID                Date   Period
0        1         A 2010-01-01 00:00:00  2010-01
1        2         A 2010-02-01 00:00:00  2010-02
2        3         A 2010-02-15 00:00:00  2010-02
3        4         B 2010-01-01 00:00:00  2010-01
4        5         C 2010-02-01 00:00:00  2010-02
5        6         C 2010-02-15 00:00:00  2010-02
>>> grouped = df[['Period', 'PictureID']].groupby('Period')
>>> grouped['PictureID'].value_counts().unstack(0).fillna(0)
Period  2010-01  2010-02
A             1        2
B             1        0
C             0        2

对于第三部分，要么我没有很好地理解这个问题，要么你没有在示例中发布正确的数字。因为第三行中 A 的计数应该是 2？对于第 6 行中的 C 应为 1。如果期限为六个月...

无论哪种方式，你都应该这样做:

>>> ts = df.set_index('Date')
>>> ts.resample('6M', ...)

更新:这是一种非常丑陋的方法，我想我看到了更好的方法，但我找不到SO问题。但是，这也能完成工作......

def for_half_year(row, data):
    date = row['Date']
    pid = row['PictureID']
    # Do this 6 month checking better
    if '__start' not in data or (date - data['__start']).days > 6*30:
        # Reset values
        for key in data:
            data[key] = 0
        data['__start'] = date
    data[pid] = data.get(pid, -1) + 1
    return data[pid]

df['PastSix'] = df.apply(for_half_year, args=({},), axis=1)

关于python - Pandas 中的数据框与日期和周期的争论，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18806875/

python - Pandas 中的数据框与日期和周期的争论

上一篇：python - PyQt 不断更新变量。

下一篇：python:在cherrypy中使用json.load()时出现500错误