假设我将以下数据集保存在 Pandas 数据框中 - 请注意最后一列 [Status] 是我要创建的列:
Department Employee Issue Date Submission Date ***Status***
A Joe 18/05/2014 25/06/2014 0
A Joe 1/06/2014 28/06/2014 1
A Joe 23/06/2014 30/06/2014 2
A Mark 1/03/2015 13/03/2015 0
A Mark 23/04/2015 15/04/2015 0
A William 15/07/2016 30/07/2016 0
A William 1/08/2016 23/08/2016 0
A William 20/08/2016 19/08/2016 1
B Liz 18/05/2014 7/06/2014 0
B Liz 1/06/2014 15/06/2014 1
B Liz 23/06/2014 16/06/2014 0
B John 1/03/2015 13/03/2015 0
B John 23/04/2015 15/04/2015 0
B Alex 15/07/2016 30/07/2016 0
B Alex 1/08/2016 23/08/2016 0
B Alex 20/08/2016 19/08/2016 1
我想根据以下条件创建一个额外的列 [Status]:
- 对于每个唯一的[部门]和[员工]组合(例如,有三行对应于部门 A 中的乔),将[发行日期]从最旧到最新排序
- 如果当前行 [Issue Date] 大于 ALL 之前的行 [Submission Date],则将 [Status] 标记为 0; else [状态] = [发布日期] < [提交日期] 的次数
例如:对于 A 部门的员工 Joe。当 [Issue Date] = '1/06/2014' 时,上一行的 [Submission Date] 晚于 [Issue Date],因此 [Status] = 1 for第 2 行。类似地,当 [Issue Date] = '23/06/2014' 时,第 1 行和第 2 行的 [Submission Date] 都在 [Issue Date] 之后,因此第 3 行的 [Status] = 2。我们需要对 Department 和 Employee 的每个唯一组合执行此计算。
- 注意:实际数据集的排序不如显示的示例。
最佳答案
这个问题是 6 个月前发布的,但希望我的回答仍然能提供一些帮助。
首先,导入库并制作数据框:
# import libraries
import numpy as np
import pandas as pd
# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
'Employee' : ['Joe']*3 +\
['Mark']*2 +\
['William']*3 +\
['Liz']*3 +\
['John']*2 +\
['Alex']*3,
'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016',
'18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016'],
'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016',
'7/06/2014', '15/06/2014', '16/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016']})
其次,将 Issue Date 和 Submission Date 转换为 datetime:
# Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
dayfirst = True)
第三,重置索引并按部门、员工和发行日期对值进行排序:
# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
'Employee',
'Issue Date'],
inplace = True)
四、按Department、Employee分组;累积计数行;插入原始 df:
# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
'grouped count',
df.groupby(['Department',
'Employee']).cumcount())
第五,创建一个 no_issue 和 no_submission 数据框并将它们合并到 Department 和 Employee 上:
# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)
# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)
# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
how = 'outer',
on = ['Department',
'Employee'])
这会根据每个部门、员工组的发布日期数量复制提交日期
这是乔的样子:
第六,创建一个数据框,只保留分组 count_x 小于分组 count_y 的行,然后按部门、员工和发行日期排序:
# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
'Employee',
'Issue Date'])
第七,将状态列插入为 bool 值,其中发布日期小于提交日期:
# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
'Status',
merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])
八、按Department、Employee、Issue Date分组,汇总Status,并重置索引:
# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
'Employee',
'Issue Date']).agg({'Status' : np.sum}).reset_index()
这将返回一个数据框,其中包含每个部门、员工组的所有正确状态减去最小发布日期
九、将合并后的原始dataframe按Department和Employee分组,找到最小的Issue Date,并重置索引:
# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
'Employee']).agg({'Issue Date' : 'min'}).reset_index()
第十步,将 merged1 与 merged 连接起来,用 0 填充 na(因为最小发行日期的状态始终为 0)并按部门、员工和发行日期排序:
# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
'Employee',
'Issue Date'])
第十一步,将合并后的数据框与部门、员工和发行日期上的串联数据框进行内部合并,然后删除分组计数:
# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
how = 'inner',
on = ['Department',
'Employee',
'Issue Date']).drop('grouped count',
axis = 1)
瞧!这是您的最终数据框:
# Final df
final
关于python - Pandas 数据框 : Create additional column based on date columns comparison,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42901414/