python - 存储一组四个(或更多)值的最佳数据结构是什么？

假设我有以下 variables及其对应的values代表 record .

name = 'abc'
age = 23
weight = 60
height = 174

请注意 value可能不同 types ( string 、 integer 、 float 、对任何其他对象的引用等)。

会有很多records (至少 >100,000)。每一个record将是 unique当所有这四个 variables (实际上它的 values )放在一起。换句话说，不存在record。所有 4 values是一样的。

我试图在 Python 中找到一个有效的数据结构这将允许我(存储和)检索 records基于其中任何一项 variables在 log(n)时间复杂度。

例如:

def retrieve(name=None,age=None,weight=None,height=None) 
    if name is not None and age is None and weight is None and height is None:
        /* get all records with the given name */
    if name is None and age is not None and weight is None and height is None:
        /* get all records with the given age */
    ....
    return records

retrieve的方式应该调用如下:

retrieve(name='abc')

上面应该返回 [{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]

retrieve(age=23)

上面应该返回 [{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]

而且，我可能还需要添加一两个 variables到这个记录在未来。例如，sex = 'm' .所以，retrieve功能必须是可扩展的。

简而言之:Python中是否有数据结构？这将允许 storing a record与 n数量columns (姓名、年龄、性别、体重、高度等)和retrieving records基于任何(一个)column在 logarithmic (或理想情况下 constant - O(1) 查找时间)复杂性？

最佳答案

Python 中并没有一个内置的数据结构可以满足您的所有需求，但是可以很容易地组合使用它所必需的数据结构来实现您的目标，并且相当高效。

例如，假设您的输入是名为 employees.csv 的逗号分隔值文件中的以下数据具有如第一行所示定义的字段名称:

name,age,weight,height
Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in

以下是工作代码，说明如何读取此数据并将其存储到记录列表中，并自动创建单独的查找表以查找与每个记录的字段中包含的值相关联的记录。

记录是由 namedtuple 创建的类的实例这是一种非常有效的内存，因为每个人都缺少一个 __dict__类实例通常包含的属性。使用它们将可以使用点语法按名称访问每个字段，例如 record.fieldname .

查找表是defaultdict(list)实例，它平均提供类似字典的 O(1) 次查找时间，并且还允许多个值与每个值相关联。因此，查找键是要查找的字段值的值，与之关联的数据将是 Person 的整数索引列表。存储在 employees 中的记录列出该值 - 所以它们都会相对较小。

请注意，该类的代码完全是数据驱动的，因为它不包含任何硬编码的字段名称，而是在读入时全部取自 csv 数据输入文件的第一行。当然，当使用实例，所有retrieve()方法调用必须提供有效的字段名称。

更新

修改为在首次读取数据文件时不为每个字段的每个唯一值创建查找表。现在 retrieve()方法“懒惰地”仅在需要时创建它们(并保存/缓存结果以供将来使用)。还修改为在 Python 2.7+ 中工作，包括 3.x。

from collections import defaultdict, namedtuple
import csv

class DataBase(object):
    def __init__(self, csv_filename, recordname):
        # Read data from csv format file into a list of namedtuples.
        with open(csv_filename, 'r') as inputfile:
            csv_reader = csv.reader(inputfile, delimiter=',')
            self.fields = next(csv_reader)  # Read header row.
            self.Record = namedtuple(recordname, self.fields)
            self.records = [self.Record(*row) for row in csv_reader]
            self.valid_fieldnames = set(self.fields)

        # Create an empty table of lookup tables for each field name that maps
        # each unique field value to a list of record-list indices of the ones
        # that contain it.
        self.lookup_tables = {}

    def retrieve(self, **kwargs):
        """ Fetch a list of records with a field name with the value supplied
            as a keyword arg (or return None if there aren't any).
        """
        if len(kwargs) != 1: raise ValueError(
            'Exactly one fieldname keyword argument required for retrieve function '
            '(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
        field, value = kwargs.popitem()  # Keyword arg's name and value.
        if field not in self.valid_fieldnames:
            raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
        if field not in self.lookup_tables:  # Need to create a lookup table?
            lookup_table = self.lookup_tables[field] = defaultdict(list)
            for index, record in enumerate(self.records):
                field_value = getattr(record, field)
                lookup_table[field_value].append(index)
        # Return (possibly empty) sequence of matching records.
        return tuple(self.records[index]
                        for index in self.lookup_tables[field].get(value, []))

if __name__ == '__main__':
    empdb = DataBase('employees.csv', 'Person')

    print("retrieve(name='Ted Kingston'): {}".format(empdb.retrieve(name='Ted Kingston')))
    print("retrieve(age='27'): {}".format(empdb.retrieve(age='27')))
    print("retrieve(weight='150'): {}".format(empdb.retrieve(weight='150')))
    try:
        print("retrieve(hight='5ft 6in'):".format(empdb.retrieve(hight='5ft 6in')))
    except ValueError as e:
        print("ValueError('{}') raised as expected".format(e))
    else:
        raise type('NoExceptionError', (Exception,), {})(
            'No exception raised from "retrieve(hight=\'5ft\')" call.')

输出:

retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
                     Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
                           raised as expected

关于python - 存储一组四个(或更多)值的最佳数据结构是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15418386/

python - 存储一组四个(或更多)值的最佳数据结构是什么？

上一篇：python - 如何让pycharm在ubuntu中始终显示行号

下一篇：python - 有没有办法检查 Python 元组中是否存在某个项目？