我在内存中有一个很大的静态数据集,用于存储人的以下属性:
[性别、年龄、种族、婚姻状况、教育程度、原籍国、工种、职业]
每个属性都从一组预定义的值中获取值,并且每个属性的集合大小不同。这是字典:
[[男, 女], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 , 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45 , 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70 , 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95 , 96, 97, 98, 99, 100], [White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black], [Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married -配偶缺席,已婚-AF-配偶],[学士,一些大学,第 11,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,第 9,第 7-8,第 12,硕士,第 1-4 , 10th, Doctorate, 5th-6th, Preschool], [United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China,古巴、伊朗、洪都拉斯、菲律宾、意大利、波兰、牙买加、越南、墨西哥、葡萄牙、爱尔兰、法国、多米尼加共和国、老挝、厄瓜多尔、台湾、海地、哥伦比亚、匈牙利、危地马拉、尼加拉瓜、苏格兰、泰国、南斯拉夫、萨尔瓦多、特立尼达和多巴哥、秘鲁、香港、荷兰-荷兰]、[私有(private)、Self-emp-not-inc、Self-emp-inc、Federal-gov、Local-gov、State-gov、Without-pay、Never-工作],[技术支持,工艺维修,其他服务,销售,执行管理,专业教授,处理人员清洁工,机器操作检查,行政文员,农业捕鱼,运输移动,私有(private)-家庭服务、保护服务、武装部队]]
我想要一个保留所有可能组合的结构,这样对于我数据集中的每个组合,我都可以存储一些统计信息(例如,数据集中存在特定组合的次数),但也为组合存储一些信息数据集中不存在的。所以所有的组合都应该被表示出来。
我尝试使用 ArrayList of String[] 生成所有可能的组合 但它需要几秒钟,然后使用 indexOf(x) 搜索特定组合,其中 x 是 String[] 似乎不起作用。
public class Grid {
// Immutable fields
private final int combinationLength;
private final String[][] values;
private final int[] maxIndexes;
private final ArrayList<String[]> GridValues = new ArrayList<String[]>();
// Mutable fields
private final int[] currentIndexes;
private boolean hasNext;
public Grid(final String[][] array) {
combinationLength = array.length;
values = array;
maxIndexes = new int[combinationLength];
currentIndexes = new int[combinationLength];
if (combinationLength == 0) {
hasNext = false;
return;
}
hasNext = true;
// Fill in the arrays of max indexes and current indexes.
for (int i = 0; i < combinationLength; ++i) {
if (values[i].length == 0) {
// Set hasNext to false if at least one of the value-arrays is empty.
// Stop the loop as the behavior of the iterator is already defined in this case:
// the iterator will just return no combinations.
hasNext = false;
return;
}
maxIndexes[i] = values[i].length - 1;
currentIndexes[i] = 0;
}
while (hasNext()){
String[] nextCombination = next();
GridValues.add(nextCombination);
}
}
private boolean hasNext() {
return hasNext;
}
public String[] next() {
if (!hasNext) {
throw new NoSuchElementException("No more combinations are available");
}
final String[] combination = getCombinationByCurrentIndexes();
nextIndexesCombination();
return combination;
}
private String[] getCombinationByCurrentIndexes() {
final String[] combination = new String[combinationLength];
for (int i = 0; i < combinationLength; ++i) {
combination[i] = values[i][currentIndexes[i]];
}
return combination;
}
private void nextIndexesCombination() {
for (int i = combinationLength - 1; i >= 0; --i) {
if (currentIndexes[i] < maxIndexes[i]) {
// Increment the current index
++currentIndexes[i];
return;
} else {
// Current index at max:
// reset it to zero and "carry" to the next index
currentIndexes[i] = 0;
}
}
// If we are here, then all current indexes are at max, and there are no more combinations
hasNext = false;
}
}
有人想出更快更好的方法吗?
非常感谢!
最佳答案
我在这里做一个假设——我假设数据不会不断变化(看着数据感觉不像是动态的)。
我会使用一个基于本地文件的 HSQL DB 来存储数据(我选择这个是为了速度目的 - 但是可以随意将它换成像 MySQL 这样的正式 DB)。
获取不同维度的所有类型计数的诀窍在于架构。 对于数据挖掘,“Star Schema”是首选方法。此架构将允许您分组依据,并根据您想要的任何维度进行计数。在您的情况下,架构可能如下所示:
table person - columns(id (primary key), name, age, sex_id, country_id, highest_education_id, income)
table sex - columns(id (primary key), name)
table country - columns(id (primary key), name)
table education - columns(id (primary key), name)
这样,如果您想查找所有来自哥伦比亚的人数,查询将类似于:
select count(*) from people where country_id = <columbia country id>
你可以做更高阶的查询,比如,找出所有日本人的总收入:
select country.name, sum(people.income)
from people inner join country on people.country_id = country.id
and country.name = "Japan"
它具有高度的灵 active 和可扩展性。
关于java - 存储 String 数组和搜索的所有组合,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39062404/