背景

我编写了一个程序，旨在计算与掷骰子或掷骰子组合相关的每个结果的几率。具体来说，当处理诸如“将两个六面骰子加在一起”(又称卡坦骰子)之类的掷骰时，计算概率的方式是通过保存各个结果之间的映射以及可以代表该结果的可能掷骰数。在此示例中，数据结构为 TreeMap<Integer, BigInteger> ，采用以下形式:

{
[2, 1],
[3, 2],
[4, 3],
[5, 4],
[6, 5],
[7, 6],
[8, 5],
[9, 4],
[10, 3],
[11, 2],
[12, 1]
}

在有人问之前，使用BigInteger在这种情况下并不算过分，因为它的设计目的是处理可能呈现给程序的任何可能的掷骰，并且像 100d6 这样的掷骰(将 100 个六面骰子的掷骰加在一起)很快就会得到我不知道的非常大的数字不想近似为 double .

作为该程序界面的一部分，我决定希望有关这些卷的统计数据是可查询的，而我希望程序查找的此类统计数据就是卷的中位数。我当前版本的算法总结了代表低于给定结果的所有结果的试验，从最低结果开始，如果总数超过试验的 50%，则将结果报告为中位数。

这就是我编写代码的方式。

//Is filled with values during object construction
TreeMap<Integer, BigInteger> probabilityMap = new TreeMap<>();

//Memoization to at least make sure we only make this calculation once
private Integer memoizedMedian = null;
public int getMedian() {
    if(memoizedMedian == null) {
        BigInteger trials = BigInteger.ZERO;
        BigInteger totalTrials = numOfTrials();
        for(Map.Entry<Integer, BigInteger> entry : probabilityMap.entrySet()) {
            //We're guaranteed to be iterating in order, due to how TreeMap's work
            trials = trials.add(entry.getValue());
            double percentile = trials.doubleValue() / totalTrials.doubleValue();
            if(percentile >= 0.5) {
                memoizedMedian = entry.getKey();
                break;
            }
        }
        //If we didn't find it, someone went wrong with the object initialization
        if(memoizedMedian == null)
            throw new RuntimeException("Probability Map was not properly Initialized");
    }
    return memoizedMedian;
}

问题

正如所写，这段代码有效:它找到我传递给它的任何一组值的中位数。

但问题是它可能很慢:考虑这个(夸张的)数据集来看看这可能是一个问题:

{
[1,1],
[2,1],
[3,1],
[4,1],
[5,1],
[6,1],
[7,1],
[8,1],
[9,1],
[10,11]
}

显然，10 是该数据集的中位数，但算法在扫描映射中的最后一个条目之前无法计算出该值，对于更大、更复杂的概率映射，可能需要一段时间才能达到。

所以我想改进算法来更负责任地处理这类数据集，但我不确定该采取什么样的方法。

我应该对我的算法进行什么样的更改才能更好地计算该数据集的中位数？我也愿意对根数据结构进行更改，但这应该有适当的理由。

最佳答案

我对 100d6 示例的可能性数量没有经验，因此这可能是也可能不是最佳优化的方法，但它会在您使用以下命令创建概率图时预先加载密集操作一对用于存放大值和小值的桶。这也是与顺序相关的，尽管可以通过双向重新平衡方法创建与顺序无关的顺序。我继续使用整数只是为了能够摆脱基本的数学运算。

初始条目将非常不稳定，需要大量的重新平衡。这样做的明显缺点是你的创作性能会受到影响，但你的中值性能会变成 O(1)。

小桶总是包含中位数，可以通过 max(smallbucket.keySet) 找到。大存储桶包含键集之上的所有内容，并且仅出于重新平衡的目的而保留。 请注意，如果真实中位数落在两次滚动之间，则这不是中位数，即 1d2 的中位数为 0.5 如果您仅使用整数作为中位数，则无法返回该中位数。

public class MedianMap {
    TreeMap<Integer, Integer> smallBucket = new TreeMap<>();    
    TreeMap<Integer, Integer> largeBucket = new TreeMap<>();

    Integer smallBucketSize = 0;
    Integer largeBucketSize = 0;
    Integer median = 0;

    public void add(int value, int trials) {
        //initial state is smallBucket should have more trials than largeBucket
        largeBucket.add(value, trials);
        largeBucket += trials;

        if(largeBucketSize > smallBucketSize) {
            rebalance();
        }
    }

    private void rebalance() {
        List<Integer> largeKeys = new ArrayList<>(largeBucket.keySet());
        Collections.sort(largeKeys);

        while(largeBucketSize > smallBucketSize) {
            //get the smallest bucket item to move over
            Integer key = largeKeys(0);
            Integer value = largeBucket.get(key);

            //move item from large to small bucket
            largeBucket.remove(key);
            smallBucket.add(key, value);

            //update bucket values
            largeBucketSize -= value;
            smallBucketSize += value; 

            //and the largest item in the small bucket is the new median
            median = key;

            //remove the first key from our large keys list
            largeKeys.remove(0);

            //repeat as necessary
        }
    }

    private int getMedian() {
        return median;
    }
}

关于java - 获得概率图中位数的更好方法是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53196802/

java - 获得概率图中位数的更好方法是什么？

背景

问题

上一篇：java - Spring Webflow - spring-context.xml 文件错误

下一篇：java - 错误ArrayList找不到符号symbol : variable myAl location: class JavaApplication into main