python - 根据键修剪有序字典？

根据键“修剪”字典的最快方法是什么？我的理解是，自 Python 3.7 以来，字典现在保留顺序

我有一个字典，其中包含键(日期时间类型):val(浮点类型)。字典按时间顺序排列。

time_series_dict = 
{"2019-02-27 14:00:00": 95,
"2019-02-27 15:00:00": 98,
"2019-02-27 16:25:00: 80,
.............
"2019-03-01 12:15:00": 85
}

我想修剪字典，删除 start_date 和 end_date 之外的所有内容。字典可以有数千个值。有没有比以下更快的方法:

for k in list(time_series_dict.keys()):
    if not start_date <= k <= end_date:
        del time_series_dict[k]

最佳答案

如果您的字典有 1000 个键，并且您要从时间戳有序序列的开头和结尾删除键，请考虑使用 binary search在键的列表副本中查找截止点。 Python 包括 bisect module为此:

from bisect import bisect_left, bisect_right

def trim_time_series_dict(tsd, start_date, end_date):
    ts = list(tsd)
    before = bisect_right(ts, start_date)  # insertion point at > start_date
    after = bisect_left(ts, end_date)      # insertion point is < end_date
    for i in range(before):                # up to == start_date
        del tsd[ts[i]]
    for i in range(after + 1, len(ts)):    # from >= end_date onwards
        del tsd[ts[i]]

我运行了一些time trials看看这是否会对您的典型数据集产生影响；正如预期的那样，当删除的键的数量明显低于输入字典的长度时，它就会得到返回。

计时赛设置(导入、构建测试数据字典以及开始和结束日期、定义测试函数)

>>> import random
>>> from bisect import bisect_left, bisect_right
>>> from datetime import datetime, timedelta
>>> from itertools import islice
>>> from timeit import Timer
>>> def randomised_ordered_timestamps():
...     date = datetime.now().replace(second=0, microsecond=0)
...     while True:
...         date += timedelta(minutes=random.randint(15, 360))
...         yield date.strftime('%Y-%m-%d %H:%M:%S')
...
>>> test_data = {ts: random.randint(50, 500) for ts in islice(randomised_ordered_timestamps(), 10000)}
>>> start_date = next(islice(test_data, 25, None))                 # trim 25 from the start
>>> end_date = next(islice(test_data, len(test_data) - 25, None))  # trim 25 from the end
>>> def iteration(t, start_date, end_date):
...     time_series_dict = t.copy()  # avoid mutating test data
...     for k in list(time_series_dict.keys()):
...         if not start_date <= k <= end_date:
...             del time_series_dict[k]
...
>>> def bisection(t, start_date, end_date):
...     tsd = t.copy()  # avoid mutating test data
...     ts = list(tsd)
...     before = bisect_right(ts, start_date)  # insertion point at > start_date
...     after = bisect_left(ts, end_date)      # insertion point is < end_date
...     for i in range(before):                # up to == start_date
...         del tsd[ts[i]]
...     for i in range(after + 1, len(ts)):    # from >= end_date onwards
...         del tsd[ts[i]]
...

试验结果:

>>> count, total = Timer("t.copy()", "from __main__ import test_data as t").autorange()
>>> baseline = total / count
>>> for test in (iteration, bisection):
...     timer = Timer("test(t, s, e)", "from __main__ import test, test_data as t, start_date as s, end_date as e")
...     count, total = timer.autorange()
...     print(f"{test.__name__:>10}: {((total / count) - baseline) * 1000000:6.2f} microseconds")
...
 iteration: 671.33 microseconds
 bisection:  80.92 microseconds

(测试减去首先进行字典复制的基线成本)。

但是，对于此类操作很可能有更有效的数据结构。我查看了sortedcontainers project因为它包括 SortedDict() type支持直接对键进行二分。不幸的是，虽然它的性能比您的迭代方法更好，但我无法让它在这里比平分键列表的副本更好:

>>> from sortedcontainers import SortedDict
>>> test_data_sorteddict = SortedDict(test_data)
>>> def sorteddict(t, start_date, end_date):
...     tsd = t.copy()
...     # SortedDict supports slicing on the key view
...     keys = tsd.keys()
...     del keys[:tsd.bisect_right(start_date)]
...     del keys[tsd.bisect_left(end_date) + 1:]
...
>>> count, total = Timer("t.copy()", "from __main__ import test_data_sorteddict as t").autorange()
>>> baseline = total / count
>>> timer = Timer("test(t, s, e)", "from __main__ import sorteddict as test, test_data_sorteddict as t, start_date as s, end_date as e")
>>> count, total = timer.autorange()
>>> print(f"sorteddict: {((total / count) - baseline) * 1000000:6.2f} microseconds")
sorteddict: 249.46 microseconds

但是，我可能错误地使用了该项目。从 SortedDict 对象中删除键的时间复杂度为 O(NlogN)，所以我怀疑这就是问题所在。从其他 9950 个键值对创建新的 SortedDict() 对象仍然较慢(超过 2 毫秒，您不想与其他方法进行比较)。

但是，如果您要使用 SortedDict.irange() method您可以简单地忽略值，而不是删除它们，并迭代字典键的子集:

for ts in timeseries(start_date, end_date, inclusive=(False, False)):
    # iterates over all start_date > timestamp > end_date keys, in order.

无需删除任何内容。 irange() 实现在底层使用二分法。

关于python - 根据键修剪有序字典？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54908136/

python - 根据键修剪有序字典？

上一篇：python - aiohttp 的 Pip 安装失败并出现奇怪的 AttributeError _dep_map

下一篇：python - 如何将字符串列与 Null 合并