Python:对子字符串列表进行排序，而不创建每个子字符串的单独副本

我有一个非常长的字符串，需要创建一个子字符串列表，这些子字符串从字符串中的每个项目开始一直到结尾。例如...

longstring= "some huge string"
substring_list=[longstring[x:] for x in range(len(longstring))]

然后我需要按字母顺序对 substring_list 进行排序。我正在使用 cython 尝试在 C 中运行它，以使速度更快(脚本的后续部分需要)。但是，当我运行代码时，我很快就会耗尽内存(当我尝试创建 substring_list 时)，并且我的脚本被杀死。我认为通过对字符串进行切片，我只是创建了一个指向长字符串切片的链接，并且这不需要太多额外的内存。这不是真的吗？

最佳答案

切片 Python 内置类(例如 list、tuple 以及 str)通常会创建一个新对象。它可以与原始版本共享底层 C 字符数组，但这将是一个实现细节。即使它已实现(我假设它没有实现)，它也可能取决于您使用的 python 版本(python 2.x、python 3.x，...)或哪个 python(CPython、PyPy、IronPython、 ...)你正在使用等等，所以你不应该依赖它。

这也可能是一个缺点，因为大切片中的一小部分会使大切片保持事件状态，因为它是“惰性”评估的，所以在我看来，他们没有以这种方式实现它是对的。

但是你可以自己做，@georg 已经提出了一个解决方案，但考虑到 python-3.x 不再支持 __cmp__ 我将提出另一个有效的(但类似的)解决方案独立于 python 版本:

class SubString(object):
    # slots reduce the memory overhead of the instances
    __slots__ = ('str_', 'startidx')

    def __init__(self, str_, startidx):
        self.str_ = str_
        self.startidx = startidx

    # sorted only requires "<" to be implemented so we only need __lt__
    def __lt__(self, other):
        # temporarly create explicit substrings but only temporaries for the comparison
        return self.str_[self.startidx:] < other.str_[other.startidx:]

    def __repr__(self):
        return self.str_[self.startidx:]

这些应该非常节省内存，但是需要一段时间才能对它们进行排序，但至少我没有内存问题。

我测试了它:

import random
import string

# create a long random string
mystr = ''.join(random.choice(string.ascii_letters) for _ in range(100000))

# sort the instances
sortedsubstr = sorted(SubString(mystr, i) for i in range(len(mystr)))

关于Python:对子字符串列表进行排序，而不创建每个子字符串的单独副本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41668848/

Python:对子字符串列表进行排序，而不创建每个子字符串的单独副本

上一篇：python - ElasticSearch 插入性能不佳

下一篇：python - 在 matplotlib 中仅连接一段数字数组