python - 我可以在哪里改进我的代码以缩短其执行时间？

请求from HackerRank :

If the amount spent by a client on a particular day is greater than or equal to 2× the client's median spending for a trailing number of days, they send the client a notification about potential fraud. The bank doesn't send the client any notifications until they have at least that trailing number of prior days' transaction data.

Given the number of trailing days d and a client's total daily expenditures for a period of n days, determine the number of times the client will receive a notification over all n days.

我的代码可以解决问题，但是对于大的测试用例有时间限制。我的代码无法通过时限要求。我的代码实际上很短:

from statistics import median

first_multiple_input = input().rstrip().split()
n = int(first_multiple_input[0])
d = int(first_multiple_input[1])
expenditure = list(map(int, input().rstrip().split()))
count=0
for i in range(len(expenditure)-d):
    if expenditure[d+i] >= 2*median(expenditure[i:d+i]) :
        count+=1
print( count)

请指出造成延迟的原因以及如何改进。

有助于理解代码的小测试用例:

9 5                 expenditure[] size n =9, d = 5
2 3 4 2 3 6 8 4 5   expenditure = [2, 3, 4, 2, 3, 6, 8, 4, 5]

最佳答案

分析/想法

你的 median(expenditure[i:d+i]) 是罪魁祸首，因为 sorting 需要 O(d log d) 时间每次都是大小为 d 的整个未排序切片。您可以通过保留尾随元素的当前窗口将其减少到 O(log d)，例如在 SortedList 中.您从中间的一两个元素获取中值，然后更新，只需添加一个新元素并删除最旧的元素。

实现

from sortedcontainers import SortedList

n = 9
d = 5
expenditure = [2, 3, 4, 2, 3, 6, 8, 4, 5]

count = 0
trailing = SortedList(expenditure[:d])
half = d // 2
for i in range(d, n):
    median = (trailing[half] + trailing[~half]) / 2
    if expenditure[i] >= 2 * median:
        count += 1
    trailing.add(expenditure[i])
    trailing.remove(expenditure[i - d])
print(count)

我们可以省略 /2 和 2 *，但是“median”将是错误的名称，naming things is hard .我们可以做 if expenditure[i] >= trailing[half] + trailing[~half]，但我觉得不太清楚。

输出

如果你添加

    print(f'{trailing=} {median=} {expenditure[i]=}')

在 median = ... 行之后，您可以看到发生了什么:

trailing=SortedList([2, 2, 3, 3, 4]) median=3.0 expenditure[i]=6
trailing=SortedList([2, 3, 3, 4, 6]) median=3.0 expenditure[i]=8
trailing=SortedList([2, 3, 4, 6, 8]) median=4.0 expenditure[i]=4
trailing=SortedList([2, 3, 4, 6, 8]) median=4.0 expenditure[i]=5
2

替代实现

使用 zip 代替索引:

count = 0
trailing = SortedList(expenditure[:d])
half = d // 2
for today, oldest in zip(expenditure[d:], expenditure):
    median = (trailing[half] + trailing[~half]) / 2
    if today >= 2 * median:
        count += 1
    trailing.add(today)
    trailing.remove(oldest)
print(count)

替代数据结构:排序规则列表

我发现了问题at HackerRank ，它没有 sortedcontainers。但是以下内容在那里被接受。

我们可以使用常规的 Python list，但在 Python 标准库中包含的 sorted 和 bisect 的帮助下我们自己对其进行排序:

from bisect import bisect_left, insort

count = 0
trailing = sorted(expenditure[:d])
half = d // 2
for today, oldest in zip(expenditure[d:], expenditure):
    median = (trailing[half] + trailing[~half]) / 2
    if today >= 2 * median:
        count += 1
    insort(trailing, today)
    del trailing[bisect_left(trailing, oldest)]
print(count)

访问中间元素需要 O(1) 时间，查找插入/删除索引需要 O(log d) 时间，实际插入/删除需要 O(d) 时间(因为它需要移位索引右侧的所有元素)。但是 O(d) 的转移速度非常快非常低。

还有两个:排序的字节数组和计数排序

问题最初不包括对 HackerRank 的引用。现在我看到值被限制为 0 到 200 之间的整数，我们也可以使用 bytearray:

trailing = bytearray(sorted(expenditure[:d]))

正如我刚才在讨论中看到的那样，对于这个允许值范围，我们还可以使用一种计数排序形式。我认为 Fenwick tree会让这个特别快，我可能会稍后尝试。

基准

在评论中，您提到 n=200000 和 d=10122 是一个大案例。所以我用这些数据进行了测试:

n = 200000
d = 10122
expenditure = random.choices(range(201), k=n)

我的解决方案的基准:

                       at replit.com   on my weak laptop
SortedList + indexing   ~1.8 seconds    ~6.4 seconds
SortedList + zipping    ~1.8 seconds    ~6.4 seconds
sorted regular list     ~0.6 seconds    ~8.8 seconds
sorted bytearray        ~0.3 seconds    ~1.7 seconds

不确定为什么常规列表解决方案在我的笔记本电脑上相对较慢。我怀疑它超出了我的 CPU 的 1 级缓存。

关于python - 我可以在哪里改进我的代码以缩短其执行时间？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69281285/

python - 我可以在哪里改进我的代码以缩短其执行时间？

分析/想法

实现

输出

替代实现

替代数据结构:排序规则列表

还有两个:排序的字节数组和计数排序

基准

上一篇：html - 屏幕阅读器应该读取 "Home, Link, Menu Item"还是 "Home, Menu Item"

下一篇：sql-server - 如何从两个不同表的两列中排序数据