python - 将日期范围行拆分为年(取消分组) - Python Pandas

标签 python pandas date dataframe

我有一个像这样的数据框:

    Start date  end date        A    B
    01.01.2020  30.06.2020      2    3
    01.01.2020  31.12.2020      3    1
    01.04.2020  30.04.2020      6    2
    01.01.2021  31.12.2021      2    3
    01.07.2020  31.12.2020      8    2
    01.01.2020  31.12.2023      1    2
    .......

我想拆分 end - start > 1 年的行(请参阅最后一行,其中 end=2023 且 start = 2020),保持 A 列相同的值,同时按比例拆分 B 列中的值:

    Start date  end date        A    B
    01.01.2020  30.06.2020      2    3
    01.01.2020  31.12.2020      3    1
    01.04.2020  30.04.2020      6    2
    01.01.2021  31.12.2021      2    3
    01.07.2020  31.12.2020      8    2
    01.01.2020  31.12.2020      1    2/4
    01.01.2021  31.12.2021      1    2/4
    01.01.2022  31.12.2022      1    2/4
    01.01.2023  31.12.2023      1    2/4
    .......

有什么想法吗?

最佳答案

这是我的解决方案。请参阅下面的评论:

import io

# TEST DATA:
text="""     start         end      A      B 
        01.01.2020  30.06.2020      2      3 
        01.01.2020  31.12.2020      3      1 
        01.04.2020  30.04.2020      6      2 
        01.01.2021  31.12.2021      2      3 
        01.07.2020  31.12.2020      8      2
        31.12.2020  20.01.2021     12     12
        31.12.2020  01.01.2021     22     22
        30.12.2020  01.01.2021     32     32
        10.05.2020  28.09.2023     44     44
        27.11.2020  31.12.2023     88     88
        31.12.2020  31.12.2023    100    100
        01.01.2020  31.12.2021    200    200
      """

df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)

#----------------------------------------
# SOLUTION:

def split_years(r):
    """
        Split row 'r' where "end"-"start" greater than 0.
        The new rows have repeated values of 'A', and 'B' divided by the number of years.
        Return: a DataFrame with rows per year.
    """
    t1,t2 = r["start"], r["end"]
    ys= t2.year - t1.year
    kk= 0 if t1.is_year_end else 1
    if ys>0:
        l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
        l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
        return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
    print("year difference <= 0!")
    return None


# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups 
print("\n---- grps:\n",grps)

# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)

# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)

# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
    print(fr,"\n")

# Insert the "one year" data frame to the list, and concatenate them:    
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)

# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)

输出:

---- grps:
 {False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}

---- df2:
         start        end    A    B
5  2020-12-31 2021-01-20   12   12
6  2020-12-31 2021-01-01   22   22
7  2020-12-30 2021-01-01   32   32
8  2020-10-05 2023-09-28   44   44
9  2020-11-27 2023-12-31   88   88
10 2020-12-31 2023-12-31  100  100
11 2020-01-01 2021-12-31  200  200

---- ldfs:
       start        end   A    B
0 2020-12-31 2020-12-31  12  6.0
1 2021-01-01 2021-01-20  12  6.0 

       start        end   A     B
0 2020-12-31 2020-12-31  22  11.0
1 2021-01-01 2021-01-01  22  11.0 

       start        end   A     B
0 2020-12-30 2020-12-31  32  16.0
1 2021-01-01 2021-01-01  32  16.0 

       start        end   A     B
0 2020-10-05 2020-12-31  44  11.0
1 2021-01-01 2021-12-31  44  11.0
2 2022-01-01 2022-12-31  44  11.0
3 2023-01-01 2023-09-28  44  11.0 

       start        end   A     B
0 2020-11-27 2020-12-31  88  22.0
1 2021-01-01 2021-12-31  88  22.0
2 2022-01-01 2022-12-31  88  22.0
3 2023-01-01 2023-12-31  88  22.0 

       start        end    A     B
0 2020-12-31 2020-12-31  100  25.0
1 2021-01-01 2021-12-31  100  25.0
2 2022-01-01 2022-12-31  100  25.0
3 2023-01-01 2023-12-31  100  25.0 

       start        end    A      B
0 2020-01-01 2020-12-31  200  100.0
1 2021-01-01 2021-12-31  200  100.0 


---- df_rslt:
         start        end    A      B
0  2020-01-01 2020-06-30    2    3.0
1  2020-01-01 2020-12-31    3    1.0
2  2020-01-01 2020-12-31  200  100.0
3  2020-01-04 2020-04-30    6    2.0
4  2020-01-07 2020-12-31    8    2.0
5  2020-10-05 2020-12-31   44   11.0
6  2020-11-27 2020-12-31   88   22.0
7  2020-12-30 2020-12-31   32   16.0
8  2020-12-31 2020-12-31   12    6.0
9  2020-12-31 2020-12-31  100   25.0
10 2020-12-31 2020-12-31   22   11.0
11 2021-01-01 2021-12-31  100   25.0
12 2021-01-01 2021-12-31   88   22.0
13 2021-01-01 2021-12-31   44   11.0
14 2021-01-01 2021-01-01   32   16.0
15 2021-01-01 2021-01-01   22   11.0
16 2021-01-01 2021-01-20   12    6.0
17 2021-01-01 2021-12-31    2    3.0
18 2021-01-01 2021-12-31  200  100.0
19 2022-01-01 2022-12-31   88   22.0
20 2022-01-01 2022-12-31  100   25.0
21 2022-01-01 2022-12-31   44   11.0
22 2023-01-01 2023-09-28   44   11.0
23 2023-01-01 2023-12-31   88   22.0
24 2023-01-01 2023-12-31  100   25.0

关于python - 将日期范围行拆分为年(取消分组) - Python Pandas,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58762950/

相关文章:

python - Richardson–Lucy 算法是如何工作的?代码示例?

python - 如何引发异常而不是在 Python "ConfigParser"部分返回 None?

python - 从 pandas 表中选取/过滤元素,其中数据位于列标题值之间

python - 当范围未知时,Pandas 按值范围分组

java - 在java中格式化包含日期的字符串时出现问题

Python 使用正则表达式重命名文件

python - 使用 Pandas 插值将每月值转换为每日值

带图像的 Javascript 时钟

sql - postgresql - 将字符串转换为时间

python - 如何将 "listen"到 Python 中的多处理队列