python - Pandas 数据框 : Create additional column based on date columns comparison

假设我将以下数据集保存在 Pandas 数据框中 - 请注意最后一列 [Status] 是我要创建的列:

Department  Employee    Issue Date  Submission Date ***Status***
A   Joe 18/05/2014  25/06/2014  0
A   Joe 1/06/2014   28/06/2014  1
A   Joe 23/06/2014  30/06/2014  2
A   Mark    1/03/2015   13/03/2015  0
A   Mark    23/04/2015  15/04/2015  0
A   William 15/07/2016  30/07/2016  0
A   William 1/08/2016   23/08/2016  0
A   William 20/08/2016  19/08/2016  1
B   Liz 18/05/2014  7/06/2014   0
B   Liz 1/06/2014   15/06/2014  1
B   Liz 23/06/2014  16/06/2014  0
B   John    1/03/2015   13/03/2015  0
B   John    23/04/2015  15/04/2015  0
B   Alex    15/07/2016  30/07/2016  0
B   Alex    1/08/2016   23/08/2016  0
B   Alex    20/08/2016  19/08/2016  1

我想根据以下条件创建一个额外的列 [Status]:

对于每个唯一的[部门]和[员工]组合(例如，有三行对应于部门 A 中的乔)，将[发行日期]从最旧到最新排序
如果当前行 [Issue Date] 大于 ALL 之前的行 [Submission Date]，则将 [Status] 标记为 0； else [状态] = [发布日期] < [提交日期] 的次数

例如:对于 A 部门的员工 Joe。当 [Issue Date] = '1/06/2014' 时，上一行的 [Submission Date] 晚于 [Issue Date]，因此 [Status] = 1 for第 2 行。类似地，当 [Issue Date] = '23/06/2014' 时，第 1 行和第 2 行的 [Submission Date] 都在 [Issue Date] 之后，因此第 3 行的 [Status] = 2。我们需要对 Department 和 Employee 的每个唯一组合执行此计算。

注意:实际数据集的排序不如显示的示例。

最佳答案

这个问题是 6 个月前发布的，但希望我的回答仍然能提供一些帮助。

首先，导入库并制作数据框:

# import libraries
import numpy as np
import pandas as pd

# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
                   'Employee' : ['Joe']*3 +\
                                ['Mark']*2 +\
                                ['William']*3 +\
                                ['Liz']*3 +\
                                ['John']*2 +\
                                ['Alex']*3,
                   'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
                                   '1/03/2015', '23/04/2015',
                                   '15/07/2016', '1/08/2016', '20/08/2016',
                                   '18/05/2014', '1/06/2014', '23/06/2014',
                                   '1/03/2015', '23/04/2015',
                                   '15/07/2016', '1/08/2016', '20/08/2016'],
                   'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
                                        '13/03/2015', '15/04/2015',
                                        '30/07/2016', '23/08/2016', '19/08/2016',
                                        '7/06/2014', '15/06/2014', '16/06/2014',
                                        '13/03/2015', '15/04/2015',
                                        '30/07/2016', '23/08/2016', '19/08/2016']})

其次，将 Issue Date 和 Submission Date 转换为 datetime:

    # Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
                                         dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
                                              dayfirst = True)

第三，重置索引并按部门、员工和发行日期对值进行排序:

# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
                                              'Employee',
                                              'Issue Date'],
                                        inplace = True)

四、按Department、Employee分组；累积计数行；插入原始 df:

# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
          'grouped count',
          df.groupby(['Department',
                      'Employee']).cumcount())

第五，创建一个 no_issue 和 no_submission 数据框并将它们合并到 Department 和 Employee 上:

# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)

# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)

# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
                        how = 'outer',
                        on = ['Department',
                              'Employee'])

这会根据每个部门、员工组的发布日期数量复制提交日期

这是乔的样子:

第六，创建一个数据框，只保留分组 count_x 小于分组 count_y 的行，然后按部门、员工和发行日期排序:

# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
                 merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
                                                                     'Employee',
                                                                     'Issue Date'])

第七，将状态列插入为 bool 值，其中发布日期小于提交日期:

# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
               'Status',
               merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])

八、按Department、Employee、Issue Date分组，汇总Status，并重置索引:

# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
                           'Employee',
                           'Issue Date']).agg({'Status' : np.sum}).reset_index()

这将返回一个数据框，其中包含每个部门、员工组的所有正确状态减去最小发布日期

九、将合并后的原始dataframe按Department和Employee分组，找到最小的Issue Date，并重置索引:

# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
                         'Employee']).agg({'Issue Date' : 'min'}).reset_index()

第十步，将 merged1 与 merged 连接起来，用 0 填充 na(因为最小发行日期的状态始终为 0)并按部门、员工和发行日期排序:

# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
                                                                        'Employee',
                                                                        'Issue Date'])

第十一步，将合并后的数据框与部门、员工和发行日期上的串联数据框进行内部合并，然后删除分组计数:

# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
                 how = 'inner',
                 on = ['Department',
                       'Employee',
                       'Issue Date']).drop('grouped count',
                                           axis = 1)

瞧!这是您的最终数据框:

# Final df
final

关于python - Pandas 数据框 : Create additional column based on date columns comparison，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42901414/

python - Pandas 数据框 : Create additional column based on date columns comparison

上一篇：python - 列出本地网络设备python的IP/MAC/名称

下一篇：python - CNTK 将标签索引转换为单热向量表示