python - Pandas 数据框 : Create additional column based on date columns comparison

标签 python python-3.x pandas dataframe

假设我将以下数据集保存在 Pandas 数据框中 - 请注意最后一列 [Status] 是我要创建的列:

Department  Employee    Issue Date  Submission Date ***Status***
A   Joe 18/05/2014  25/06/2014  0
A   Joe 1/06/2014   28/06/2014  1
A   Joe 23/06/2014  30/06/2014  2
A   Mark    1/03/2015   13/03/2015  0
A   Mark    23/04/2015  15/04/2015  0
A   William 15/07/2016  30/07/2016  0
A   William 1/08/2016   23/08/2016  0
A   William 20/08/2016  19/08/2016  1
B   Liz 18/05/2014  7/06/2014   0
B   Liz 1/06/2014   15/06/2014  1
B   Liz 23/06/2014  16/06/2014  0
B   John    1/03/2015   13/03/2015  0
B   John    23/04/2015  15/04/2015  0
B   Alex    15/07/2016  30/07/2016  0
B   Alex    1/08/2016   23/08/2016  0
B   Alex    20/08/2016  19/08/2016  1

我想根据以下条件创建一个额外的列 [Status]:

  1. 对于每个唯一的[部门]和[员工]组合(例如,有三行对应于部门 A 中的乔),将[发行日期]从最旧到最新排序
  2. 如果当前行 [Issue Date] 大于 ALL 之前的行 [Submission Date],则将 [Status] 标记为 0; else [状态] = [发布日期] < [提交日期] 的次数

例如:对于 A 部门的员工 Joe。当 [Issue Date] = '1/06/2014' 时,上一行的 [Submission Date] 晚于 [Issue Date],因此 [Status] = 1 for第 2 行。类似地,当 [Issue Date] = '23/06/2014' 时,第 1 行和第 2 行的 [Submission Date] 都在 [Issue Date] 之后,因此第 3 行的 [Status] = 2。我们需要对 Department 和 Employee 的每个唯一组合执行此计算。

  • 注意:实际数据集的排序不如显示的示例。

最佳答案

这个问题是 6 个月前发布的,但希望我的回答仍然能提供一些帮助。

首先,导入库并制作数据框:

# import libraries
import numpy as np
import pandas as pd

# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
                   'Employee' : ['Joe']*3 +\
                                ['Mark']*2 +\
                                ['William']*3 +\
                                ['Liz']*3 +\
                                ['John']*2 +\
                                ['Alex']*3,
                   'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
                                   '1/03/2015', '23/04/2015',
                                   '15/07/2016', '1/08/2016', '20/08/2016',
                                   '18/05/2014', '1/06/2014', '23/06/2014',
                                   '1/03/2015', '23/04/2015',
                                   '15/07/2016', '1/08/2016', '20/08/2016'],
                   'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
                                        '13/03/2015', '15/04/2015',
                                        '30/07/2016', '23/08/2016', '19/08/2016',
                                        '7/06/2014', '15/06/2014', '16/06/2014',
                                        '13/03/2015', '15/04/2015',
                                        '30/07/2016', '23/08/2016', '19/08/2016']})

df

其次,将 Issue Date 和 Submission Date 转换为 datetime:

    # Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
                                         dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
                                              dayfirst = True)

第三,重置索引并按部门、员工和发行日期对值进行排序:

# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
                                              'Employee',
                                              'Issue Date'],
                                        inplace = True)

四、按Department、Employee分组;累积计数行;插入原始 df:

# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
          'grouped count',
          df.groupby(['Department',
                      'Employee']).cumcount())

grouped count

第五,创建一个 no_issue 和 no_submission 数据框并将它们合并到 Department 和 Employee 上:

# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)

# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)

# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
                        how = 'outer',
                        on = ['Department',
                              'Employee'])

这会根据每个部门、员工组的发布日期数量复制提交日期

这是乔的样子:

merged

第六,创建一个数据框,只保留分组 count_x 小于分组 count_y 的行,然后按部门、员工和发行日期排序:

# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
                 merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
                                                                     'Employee',
                                                                     'Issue Date'])

第七,将状态列插入为 bool 值,其中发布日期小于提交日期:

# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
               'Status',
               merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])

八、按Department、Employee、Issue Date分组,汇总Status,并重置索引:

# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
                           'Employee',
                           'Issue Date']).agg({'Status' : np.sum}).reset_index()

这将返回一个数据框,其中包含每个部门、员工组的所有正确状态减去最小发布日期

status

九、将合并后的原始dataframe按Department和Employee分组,找到最小的Issue Date,并重置索引:

# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
                         'Employee']).agg({'Issue Date' : 'min'}).reset_index()

第十步,将 merged1 与 merged 连接起来,用 0 填充 na(因为最小发行日期的状态始终为 0)并按部门、员工和发行日期排序:

# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
                                                                        'Employee',
                                                                        'Issue Date'])

第十一步,将合并后的数据框与部门、员工和发行日期上的串联数据框进行内部合并,然后删除分组计数:

# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
                 how = 'inner',
                 on = ['Department',
                       'Employee',
                       'Issue Date']).drop('grouped count',
                                           axis = 1)

瞧!这是您的最终数据框:

# Final df
final

final

关于python - Pandas 数据框 : Create additional column based on date columns comparison,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42901414/

相关文章:

python - 我如何在 cssutils 中捕获 CSSStyleDeclaration 错误

python - discord bot autorole——在 _run_event yield from getattr(self, event)(*args, **kwargs)

python - Pandas - 遍历 2 列(纬度和经度)并找到每个坐标与特定位置之间的距离

python - 在 Python 中合并音频文件

python - 如何使用基于函数内用户输入的方法来更改类

c# - 在 Python 中将字符串从 sha1 哈希转换为 base 64

python - 我们可以从 AWS S3 复制图像并将其写入 excel 文件(S3)而不使用 Python 在本地存储数据吗?

python - pandas 应用并分配给多列

python - 为什么这段代码在全局变量中设置了 True 和 False?

python - multiprocessing.shared_memory 是否需要锁定?