search - 最适合前缀匹配搜索的数据结构

我要创建一个客户列表系统(可以有1000万个客户)，每个客户都有一个唯一的ID，唯一的ID由10个字母组成，前3个是大写字母，后7个是数字(例如:LQK0333208、HCK1646129，...)。系统必须以最快的方式执行两种搜索操作(精确匹配搜索和部分匹配搜索):

对于精确匹配搜索，用户输入完整的客户 ID，系统会显示匹配客户的详细信息，如果没有匹配客户，则会显示错误消息。
对于部分匹配搜索，用户输入多个(至少 5 个、最多 8 个)客户 ID 开头字母，系统会显示匹配客户的详细信息，如果没有匹配客户，则显示错误消息。如果匹配的客户数量大于 10 个，则仅显示其中 10 个。

那么什么数据结构适合这个系统呢？目前，我使用AVL树来处理这个问题:

对于精确匹配搜索，我将执行对数搜索(左子树和右子树):O(log(n))。
对于部分匹配搜索，我将对 AVL 树执行有序搜索，并检查每个客户是否具有所需的前缀。这是一个线性搜索:O(n)。

但是我想要部分匹配搜索，系统将在时间复杂度方面执行更好的搜索。那么关于数据结构的任何建议是否适合系统的要求？

编辑1:我已经使用 Trie 树和三元搜索树测试了该程序，但适用于更大的数据集，例如(1000 万客户)。我无法将内存中的数据结构存储在具有更大数据集的内存中。那么有什么建议吗？
编辑2:我已经测试了排序数组数据结构，它适用于1000万用户的数据集。实际上，当我对 Trie 树或三元树一无所知时，这是我的第一个方法。据我了解，首先我们将所有客户存储在一个数组中，然后使用一些排序算法(例如快速排序)对数组进行排序。然后进行二分查找来查找key，就是O(log(n))来进行查找操作，相当不错!但从长远来看，当我们需要向数组中添加额外的数据(不是创建新数据，而是添加到数组中)时，例如仅增加一个客户，因此添加新元素最坏情况下将花费 O(n)情况下，我们需要找到在哪里添加和移动元素。但对于 Trie 或三叉树这样的数据结构，当添加新元素时，可能只需要 O(1)，因为我们只需要遍历树来查找字符串。如果我们不介意空间复杂度，我认为 trie 或三叉树最适合这个项目。

最佳答案

合适的数据结构是 trie 。这是一棵所有前缀的树，其中每个节点(根除外)代表一个字符，从根到叶子的每个可能路径都是与有效 ID 对应的字符序列。

部分匹配意味着存在一条从根开始、以内部节点结束的路径。

如果使用高效的子查找来实现，则在此特定用例中可以通过 10 个步骤找到匹配项。因此，如果我们将 10 视为常数，则无论树有多大(即宽)，都可以在常数时间内完成匹配。这假设通过字符查找子项可以在恒定时间内(平均)完成。

由于在这个特定的用例中字母表是有限的(仅限大写或仅限数字)，一个节点最多可以有 26 个子条目，这些子条目可以存储在该大小的数组中，其中索引映射到相应的特点。这将确保从父节点步进到相关子节点的时间恒定。或者，也可以使用散列系统(而不是具有 26 个槽的数组)。

这是一个 JavaScript 演示实现(使用子对象的普通对象，即“字典”):

class TrieNode {
    constructor(data=null) {
        this.children = {}; // Dictionary, <character, TrieNode>
        this.data = data; // Non-null when this node represents the end of a valid word
    }
    addWord(word, data) {
        let node = this; // the root of the tree
        for (let ch of word) {
            if (!(ch in node.children)) {
                node.children[ch] = new TrieNode(); 
            }
            node = node.children[ch]; // Walk down the tree
        }
        node.data = data;
    }
    *getAllData() { // This method returns an iterator over all data in this subtree
        if (this.data != null) yield this.data;
        // Recursively yield all data in the children's subtrees
        for (const child in this.children) yield* this.children[child].getAllData();
    }
    *find(prefix) { // This method returns an iterator over matches
        let node = this;
        // Find the node where this prefix ends:
        for (let ch of prefix) {
            if (!(ch in node.children)) return; // No matches
            node = node.children[ch];
        }
        // Yield all data in this subtree
        yield* node.getAllData();
    }
}

class Customer {
    constructor(id, name) {
        this.id = id;
        this.name = name;
    }
    toString() {
        return this.name + " (" + this.id + ")";
    }
}

// Demo
// Create some Customer data:
const database = [
    new Customer('LQK0333208', 'Hanna'),
    new Customer('LQK0333311', 'Bert'),
    new Customer('LQK0339999', 'Joline'),
    new Customer('HCK1646129', 'Sarah'),
    new Customer('HCK1646130', 'Pete'),
    new Customer('HCK1700012', 'Cristine')
];

// Build a trie for the database of customers
const trie = new TrieNode(); // The root node of the trie.
for (const customer of database) {
    trie.addWord(customer.id, customer);
}
// Make a few queries
console.log("query: LQK0333");
for (const customer of trie.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of trie.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of trie.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of trie.find("LQK09")) console.log("found: " + customer);

排序数组

另一种方法是将客户记录存储在排序数组中。 JavaScript 没有这样的数据结构，但是 splice 在 JavaScript 中的速度惊人地快，因此您可以通过在已排序的位置插入新条目来维持排序顺序。二分查找可用于定位要查找或插入条目的索引:

class SortedArray {
    constructor(keyField) {
        this.arr = [];
        this.keyField = keyField;
    }
    addObject(obj) {
        const i = this.indexOf(obj[this.keyField]);
        if (this.arr[i]?.[this.keyField] === obj[this.keyField]) throw "Duplicate not added";
        this.arr.splice(i, 0, obj);
    }
    *find(prefix) { // This method returns an iterator over matches
        for (let i = this.indexOf(prefix); i < this.arr.length; i++) {
            const obj = this.arr[i];
            if (!obj[this.keyField].startsWith(prefix)) return;
            yield obj;
        }
    }
    indexOf(key) {
        let low = 0, high = this.arr.length;
        while (low < high) {
            const mid = (low + high) >> 1;
            if (key === this.arr[mid][this.keyField]) return mid;
            if (key > this.arr[mid][this.keyField]) {
                low = mid + 1;
            } else {
                high = mid;
            }
        }
        return low;
    }
}

class Customer {
    constructor(id, name) {
        this.id = id;
        this.name = name;
    }
    toString() {
        return this.name + " (" + this.id + ")";
    }
}

const database = [
    new Customer('LQK0333208', 'Hanna'),
    new Customer('LQK0333311', 'Bert'),
    new Customer('LQK0339999', 'Joline'),
    new Customer('HCK1646129', 'Sarah'),
    new Customer('HCK1646130', 'Pete'),
    new Customer('HCK1700012', 'Cristine')
];

const arr = new SortedArray("id");
for (const customer of database) {
    arr.addObject(customer);
}
console.log("query: LQK0333");
for (const customer of arr.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of arr.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of arr.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of arr.find("LQK09")) console.log("found: " + customer);

关于search - 最适合前缀匹配搜索的数据结构，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71824351/

search - 最适合前缀匹配搜索的数据结构

排序数组

上一篇：google-apps-script - 将电子表格文件中的工作表复制到 Google 云端硬盘文件夹

下一篇：php - Laravel eloquent 无法正确获取数据