python - 计算数据框列的最快方法

我遇到了一个 pandas 问题，需要帮助。

一方面，我有一个如下所示的 DataFrame:

   contributor_id     timestamp     edits    upper_month   lower_month
0      8             2018-01-01       1      2018-04-01    2018-02-01
1      26424341      2018-01-01       11     2018-04-01    2018-02-01
10     26870381      2018-01-01       465    2018-04-01    2018-02-01
22     28109145      2018-03-01       17     2018-06-01    2018-04-01
23     32769624      2018-01-01       84     2018-04-01    2018-02-01
25     32794352      2018-01-01       4      2018-04-01    2018-02-01

另一方面，我有(在另一个 DF 中可用)给定的日期索引:

2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01, 2018-05-01, 2018-06-01, 2018-07-01, 2018-08-01, 2018-09-01, 2018-10-01, 2018-11-01, 2018-12-01.

我需要创建一个 pd.Series，它具有先前显示的索引作为索引。对于索引中的每个日期，pd.Series 的数据必须是:

如果日期 >= lower_month 且日期 <= upper_month，则我加 1。

目标是针对每个日期计算该日期位于前一个 DataFrame 中的上月值和下月值之间的次数。

这种情况下的样本输出 pd.Series 是:

2018-01-01    0
2018-02-01    5
2018-03-01    5
2018-04-01    6
2018-05-01    1
2018-06-01    1
2018-07-01    0
2018-08-01    0
2018-09-01    0
2018-10-01    0
2018-11-01    0
2018-12-01    0

有没有一种快速的方法可以避免多次遍历第一个数据帧？

谢谢大家

最佳答案

使用列表理解和扁平化来测试转换为元组和范围内值的压缩列之间的成员资格，在生成器中创建 DataFrame 和 sum:

rng = pd.date_range('2018-01-01', freq='MS', periods=12)
vals = list(zip(df['lower_month'], df['upper_month']))

s = pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})

编辑:

为了获得更好的性能，请使用count 方法，谢谢@Stef:

s = pd.Series({y: [y >= x1 and y <= x2 for x1, x2 in vals].count(True) for y in rng})

print (s)
2018-01-01    0
2018-02-01    5
2018-03-01    5
2018-04-01    6
2018-05-01    1
2018-06-01    1
2018-07-01    0
2018-08-01    0
2018-09-01    0
2018-10-01    0
2018-11-01    0
2018-12-01    0
dtype: int64

性能:

np.random.seed(123)

def random_dates(start, end, n=10000):

    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s').floor('d')


d1 = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01')) + pd.offsets.MonthBegin(0)
d2 = random_dates(pd.to_datetime('2018-01-01'), pd.to_datetime('2020-01-01')) + pd.offsets.MonthBegin(0)

df = pd.DataFrame({'lower_month':d1, 'upper_month':d2})
rng = pd.date_range('2015-01-01', freq='MS', periods=6 * 12)
vals = list(zip(df['lower_month'], df['upper_month']))

In [238]: %timeit pd.Series({y: [y >= x1 and y <= x2 for x1, x2 in vals].count(True) for y in rng})
158 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [239]: %timeit pd.Series({y: sum(y >= x1 and y <= x2 for x1, x2 in vals) for y in rng})
221 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#first solution is slow    
In [240]: %timeit pd.DataFrame([(y, y >= x1 and y <= x2) for x1, x2 in vals for y in rng],                  columns=['d','test']).groupby('d')['test'].sum().astype(int)
4.52 s ± 396 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

关于python - 计算数据框列的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57073490/

python - 计算数据框列的最快方法

上一篇：python - 如何检测 if 语句何时退出

下一篇：python - OpenCV 的 label2rgb 实现