java - 存储 String 数组和搜索的所有组合

我在内存中有一个很大的静态数据集，用于存储人的以下属性:

[性别、年龄、种族、婚姻状况、教育程度、原籍国、工种、职业]

每个属性都从一组预定义的值中获取值，并且每个属性的集合大小不同。这是字典:

[[男, 女], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 , 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 , 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70 , 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95 , 96, 97, 98, 99, 100], [White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black], [Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married -配偶缺席，已婚-AF-配偶]，[学士，一些大学，第 11，HS-grad，Prof-school，Assoc-acdm，Assoc-voc，第 9，第 7-8，第 12，硕士，第 1-4 , 10th, Doctorate, 5th-6th, Preschool], [United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China,古巴、伊朗、洪都拉斯、菲律宾、意大利、波兰、牙买加、越南、墨西哥、葡萄牙、爱尔兰、法国、多米尼加共和国、老挝、厄瓜多尔、台湾、海地、哥伦比亚、匈牙利、危地马拉、尼加拉瓜、苏格兰、泰国、南斯拉夫、萨尔瓦多、特立尼达和多巴哥、秘鲁、香港、荷兰-荷兰]、[私有(private)、Self-emp-not-inc、Self-emp-inc、Federal-gov、Local-gov、State-gov、Without-pay、Never-工作]，[技术支持，工艺维修，其他服务，销售，执行管理，专业教授，处理人员清洁工，机器操作检查，行政文员，农业捕鱼，运输移动，私有(private)-家庭服务、保护服务、武装部队]]

我想要一个保留所有可能组合的结构，这样对于我数据集中的每个组合，我都可以存储一些统计信息(例如，数据集中存在特定组合的次数)，但也为组合存储一些信息数据集中不存在的。所以所有的组合都应该被表示出来。

我尝试使用 ArrayList of String[] 生成所有可能的组合但它需要几秒钟，然后使用 indexOf(x) 搜索特定组合，其中 x 是 String[] 似乎不起作用。

public class Grid  {

// Immutable fields
private final int combinationLength;
private final String[][] values;
private final int[] maxIndexes;
private final ArrayList<String[]> GridValues = new ArrayList<String[]>();
// Mutable fields
private final int[] currentIndexes;
private boolean hasNext;

public Grid(final String[][] array) {
    combinationLength = array.length;
    values = array;
    maxIndexes = new int[combinationLength];
    currentIndexes = new int[combinationLength];

    if (combinationLength == 0) {
        hasNext = false;
        return;
    }

    hasNext = true;


    // Fill in the arrays of max indexes and current indexes.
    for (int i = 0; i < combinationLength; ++i) {
        if (values[i].length == 0) {
            // Set hasNext to false if at least one of the value-arrays is empty.
            // Stop the loop as the behavior of the iterator is already defined in this case:
            // the iterator will just return no combinations.
            hasNext = false;
            return;
        }

        maxIndexes[i] = values[i].length - 1;
        currentIndexes[i] = 0;
    }

    while (hasNext()){
        String[] nextCombination = next();
        GridValues.add(nextCombination);
    }
}


private boolean hasNext() {
    return hasNext;
}


public String[] next() {
    if (!hasNext) {
        throw new NoSuchElementException("No more combinations are available");
    }
    final String[] combination = getCombinationByCurrentIndexes();
    nextIndexesCombination();
    return combination;
}

private String[] getCombinationByCurrentIndexes() {
    final String[] combination = new String[combinationLength];
    for (int i = 0; i < combinationLength; ++i) {
        combination[i] = values[i][currentIndexes[i]];
    }
    return combination;
}

private void nextIndexesCombination() {

    for (int i = combinationLength - 1; i >= 0; --i) {
        if (currentIndexes[i] < maxIndexes[i]) {
            // Increment the current index
            ++currentIndexes[i];
            return;
        } else {
            // Current index at max: 
            // reset it to zero and "carry" to the next index
            currentIndexes[i] = 0;
        }
    }
    // If we are here, then all current indexes are at max, and there are no more combinations
    hasNext = false;
}
}

有人想出更快更好的方法吗？

非常感谢!

最佳答案

我在这里做一个假设——我假设数据不会不断变化(看着数据感觉不像是动态的)。

我会使用一个基于本地文件的 HSQL DB 来存储数据(我选择这个是为了速度目的 - 但是可以随意将它换成像 MySQL 这样的正式 DB)。

获取不同维度的所有类型计数的诀窍在于架构。对于数据挖掘，“Star Schema”是首选方法。此架构将允许您分组依据，并根据您想要的任何维度进行计数。在您的情况下，架构可能如下所示:

table person - columns(id (primary key), name, age, sex_id, country_id, highest_education_id, income)
table sex - columns(id (primary key), name)
table country - columns(id (primary key), name)
table education - columns(id (primary key), name)

这样，如果您想查找所有来自哥伦比亚的人数，查询将类似于:

select count(*) from people where country_id = <columbia country id>

你可以做更高阶的查询，比如，找出所有日本人的总收入:

select country.name, sum(people.income)
from people inner join country on people.country_id = country.id
and country.name = "Japan"

它具有高度的灵 active 和可扩展性。

关于java - 存储 String 数组和搜索的所有组合，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39062404/

java - 存储 String 数组和搜索的所有组合

上一篇：java - 访问来自不同类的变量

下一篇：java - 如何使用 @PathVariable 到 Pojo？