python - 在执行一些额外操作的同时将数据帧重新采样为新数据帧

我正在使用一个数据框，其中每个条目(行)都带有开始时间、持续时间和其他属性。我想从这个数据框创建一个新的数据框，我将在其中将每个条目从原始条目转换为 15 分钟的间隔，同时保持所有其他属性相同。新数据帧中每个条目在旧数据帧中的条目数量将取决于原始数据帧的实际持续时间。

起初我尝试使用 pd.resample 但它并没有完全达到我的预期。然后，我使用 itertuples() 构建了一个运行良好的函数，但对于大约 3000 行的数据帧，它花费了大约半个小时。现在我想对 200 万行执行相同的操作，因此我正在寻找其他可能性。

假设我有以下数据框:

testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm'], 'id': [1,2,3,4]}
testdf = pd.DataFrame(testdict)
testdf.loc[:,['start']] = pd.to_datetime(testdf['start'])
print(testdf)

>>>testdf
                 start  duration Attribute_A  id
0  2018-01-05 11:48:00        22         abc   1
1  2018-05-04 09:05:00         8         def   2
2  2018-08-09 07:15:00        35         hij   3
3  2018-09-27 15:00:00         2         klm   4

我希望我的结果如下所示:

>>>resultdf
                start  duration Attribute_A  id
0 2018-01-05 11:45:00        12         abc   1
1 2018-01-05 12:00:00        10         abc   1
2 2018-05-04 09:00:00         8         def   2
3 2018-08-09 07:15:00        15         hij   3
4 2018-08-09 07:30:00        15         hij   3
5 2018-08-09 07:45:00         5         hij   3
6 2018-09-27 15:00:00         2         klm   4

这是我用 itertuples 构建的函数，它产生了期望的结果(我在上面展示的那个):

def min15_divider(df,newdf):
for row in df.itertuples():
    orig_min = row.start.minute
    remains = orig_min % 15 # Check if it is already a multiple of 15
    if remains == 0:
        new_time = row.start.replace(second=0)
        if row.duration < 15: # if it shorter than 15 min just use that for the duration
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,
                         'duration': row.duration, 'id':row.id}
            newdf = newdf.append(to_append, ignore_index=True)
        else: # if not, divide that in 15 min intervals until duration is exceeded
            cumu_dur = 15
            while cumu_dur < row.duration:
                to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
                if cumu_dur < 15:
                    to_append['duration'] = cumu_dur
                else:
                    to_append['duration'] = 15
                new_time = new_time + pd.Timedelta('15 minutes')
                cumu_dur = cumu_dur + 15
                newdf = newdf.append(to_append, ignore_index=True)

            else: # add the remainder in the last 15 min interval
                final_dur = row.duration - (cumu_dur - 15)
                to_append = {'start': new_time, 'Attribute_A': row.Attribute_A,'duration': final_dur, 'id':row.id}
                newdf = newdf.append(to_append, ignore_index=True)

    else: # When it is not an exact multiple of 15 min
        new_min = orig_min - remains # convert to multiple of 15
        new_time = row.start.replace(minute=new_min)
        new_time = new_time.replace(second=0)
        cumu_dur = 15 - remains # remaining minutes in the initial interval
        while cumu_dur < row.duration: # divide total in 15 min intervals until duration is exceeded
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'id':row.id}
            if cumu_dur < 15:
                to_append['duration'] = cumu_dur
            else:
                to_append['duration'] = 15

            new_time = new_time + pd.Timedelta('15 minutes')
            cumu_dur = cumu_dur + 15
            newdf = newdf.append(to_append, ignore_index=True)

        else: # when we reach the last interval or the starting duration was less than the remaining minutes
            if row.duration < 15:
                final_dur = row.duration # original duration less than remaining minutes in first interval
            else:
                final_dur = row.duration - (cumu_dur - 15) # remaining duration in last interval
            to_append = {'start': new_time, 'Attribute_A': row.Attribute_A, 'duration': final_dur, 'id':row.id}
            newdf = newdf.append(to_append, ignore_index=True)
return newdf

有没有其他方法可以在不使用 itertuples 的情况下节省我一些时间？

提前致谢。

附言。对于我的帖子中可能看起来有点奇怪的任何内容，我深表歉意，因为这是我第一次在 stackoverflow 中自己提出问题。

编辑

许多条目可以有相同的开始时间，所以 .groupby 'start' 可能会有问题。但是，每个条目都有一个具有唯一值的列，简称为“id”。

最佳答案

使用 pd.resample 是个好主意，但由于每行只有开始时间，因此需要先构建结束行才能使用。

下面的代码假定'start' 列中的每个开始时间都是唯一的，因此grouby 可以用在一些不寻常的地方方式，因为它只会提取一行。
我使用 groupby 因为它会自动重新组合由 apply 使用的自定义函数生成的数据帧。
另请注意，'duration' 列在分钟内转换为 timedelta，以便稍后更好地执行一些数学运算。

import pandas as pd

testdict = {'start':['2018-01-05 11:48:00', '2018-05-04 09:05:00', '2018-08-09 07:15:00', '2018-09-27 15:00:00'], 'duration':[22,8,35,2], 'Attribute_A':['abc', 'def', 'hij', 'klm']}
testdf = pd.DataFrame(testdict)
testdf['start'] = pd.to_datetime(testdf['start'])
testdf['duration'] = pd.to_timedelta(testdf['duration'], 'T')
print(testdf)

def calcduration(df, starttime):
    if len(df) == 1:
        return
    elif len(df) == 2:
        df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
        df['duration'].iloc[1] = df['duration'].iloc[1] - df['duration'].iloc[0]
    elif len(df) > 2:
        df['duration'].iloc[0] = pd.Timedelta(15, 'T') - (starttime - df.index[0])
        df['duration'].iloc[1:-1] = pd.Timedelta(15, 'T')
        df['duration'].iloc[-1] = df['duration'].iloc[-1] - df['duration'].iloc[:-1].sum()

def expandtime(x):
    frow = x.copy()
    frow['start'] = frow['start'] + frow['duration']
    gdf = pd.concat([x, frow], axis=0)
    gdf = gdf.set_index('start')
    resdf = gdf.resample('15T').nearest()
    calcduration(resdf, x['start'].iloc[0])
    return resdf

findf = testdf.groupby('start', as_index=False).apply(expandtime)
print(findf)

此代码产生:

                      duration Attribute_A
  start                                   
0 2018-01-05 11:45:00 00:12:00         abc
  2018-01-05 12:00:00 00:10:00         abc
1 2018-05-04 09:00:00 00:08:00         def
2 2018-08-09 07:15:00 00:15:00         hij
  2018-08-09 07:30:00 00:15:00         hij
  2018-08-09 07:45:00 00:05:00         hij
3 2018-09-27 15:00:00 00:02:00         klm

一些解释

expandtime 是第一个自定义函数。它采用一行数据帧(因为我们假设 'start' 值是唯一的)，构建第二行，其 'start' 等于 'start ' 第一行 + 持续时间，然后使用 resample 以 15 分钟的时间间隔对其进行采样。所有其他列的值都是重复的。

calcduration 用于对 'duration' 列进行一些数学计算，以计算每行的正确持续时间。

关于python - 在执行一些额外操作的同时将数据帧重新采样为新数据帧，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56650656/

python - 在执行一些额外操作的同时将数据帧重新采样为新数据帧

编辑

一些解释

上一篇：python - 如何 reshape 文本数据以适合keras中的LSTM模型

下一篇：python - 来自 Amazon.com(和 Amazon.in)的程序化结账