python - 查找大型数据集中两个日期之间是否有假期？

我正在处理一个包含约 2600 万行和 13 列的数据集，其中包括两个日期时间列 arr_date 和 dep_date。我正在尝试创建一个新的 bool 列来检查这些日期之间是否有美国假期。我正在对整个数据帧使用 apply 函数，但执行时间太慢。该代码现已在 Goolge 云平台(24GB 内存，4 核)上运行超过 48 小时。有没有更快的方法来做到这一点？

数据集如下所示: Sample data

我使用的代码是 -

import pandas as pd
import numpy as np
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

df = pd.read_pickle('dataGT70.pkl')
cal = calendar()
def mark_holiday(df):
    df.apply(lambda x: True if (len(cal.holidays(start=x['dep_date'], end=x['arr_date']))>0 and x['num_days']<20) else False, axis=1)
    return df

df = mark_holiday(df)

最佳答案

这花了我大约两分钟的时间来运行一个 30m 行、两列的示例数据帧 start_date和end_date .

这个想法是获取在最短开始日期或之后发生的所有假期的排序列表，然后使用 bisect_left 来自bisect模块以确定每个开始日期或之后发生的下一个假期。然后将该假期与结束日期进行比较。如果小于或等于结束日期，则开始日期和结束日期(含)之间的日期范围内必须至少有一个假期。

from bisect import bisect_left
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

# Create sample dataframe of 10k rows with an interval of 1-19 days.
np.random.seed(0)
n = 10000  # Sample size, e.g. 10k rows.
years = np.random.randint(2010, 2019, n)
months = np.random.randint(1, 13, n)
days = np.random.randint(1, 29, n)
df = pd.DataFrame({'start_date': [pd.Timestamp(*x) for x in zip(years, months, days)],
                   'interval': np.random.randint(1, 20, n)})
df['end_date'] = df['start_date'] + pd.TimedeltaIndex(df['interval'], unit='d')
df = df.drop('interval', axis=1)

# Get a sorted list of holidays since the fist start date.
hols = calendar().holidays(df['start_date'].min())

# Determine if there is a holiday between the start and end dates (both inclusive).
df['holiday_in_range'] = df['end_date'].ge(
    df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]))

>>> df.head(6)
  start_date   end_date  holiday_in_range
0 2015-07-14 2015-07-31             False
1 2010-12-18 2010-12-30              True  # 2010-12-24
2 2013-04-06 2013-04-16             False
3 2013-09-12 2013-09-24             False
4 2017-10-28 2017-10-31             False
5 2013-12-14 2013-12-29              True  # 2013-12-25

因此，对于给定的start_date时间戳(例如 2013-12-14 )、bisect_right(hols, '2013-12-14')将产生 39，并且 hols[39] 结果为 2013-12-25 ，下一个假期落在 2013-12-14 或之后开始日期。下一个假期计算为 df['start_date'].apply(lambda x: bisect_left(hols, x)).map(lambda x: hols[x]) 。然后将此假期与 end_date 进行比较，和holiday_in_range因此是True如果end_date大于或等于此假日值，否则假日必须在此 end_date 之后.

关于python - 查找大型数据集中两个日期之间是否有假期？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56926870/

python - 查找大型数据集中两个日期之间是否有假期？

上一篇：python - 是否可以为 pygame.Rect 对象创建一个新的可分配属性？

下一篇：python - 我使用生成器表达式有什么问题？