algorithm - 具有指定结果概率的值的随机样本

假设我们有四个符号 - 'a'、'b'、'c'、'd'。我们还有这些符号出现在函数输出中的四个给定概率 - P1、P2、P3、P4 (其总和等于1)。如何实现一个函数来生成其中三个符号的随机样本，以便生成的符号以指定的概率出现在其中？

示例:'a'、'b'、'c' 和 'd' 具有概率分别为9/30、8/30、7/30和6/30。该函数输出这四个符号中任意三个的各种随机样本:'abc'、'dca'、'bad' 等等。我们多次运行这个函数，计算每个符号在其输出中遇到的次数。最后，为 'a' 存储的计数值除以输出符号总数应收敛于 9/30，对于 'b' 到 8/30，'c' 到 7/30，'d'到6/30。

例如该函数生成 10 个输出:

adc dab bca dab dba cab dcb acd cab abc

其中 30 个符号包含 9 个 'a'、8 个 'b'、7 个 'c' 和 6 个'd'。当然，这是一个理想化的例子，因为只有当样本数量更大时这些值才会收敛 - 但它应该有望传达这个想法。

显然，只有当概率都不大于 1/3 时，这一切才有可能，因为每个样本输出始终包含三个不同的符号。如果函数无法满足提供的值，则可以进入无限循环或表现不稳定。

注意:该函数显然应该使用 RNG，但否则应该是无状态的。每个新的调用都应该独立于之前的任何调用，RNG 状态除外。

编辑:尽管描述提到从 4 个值中选择 3 个，但理想情况下该算法应该能够处理任何样本大小。

最佳答案

您的问题尚未确定。

如果我们为每个允许的三个字母组成的字符串分配一个概率，p(abc)、p(abd)、p(acd) 等 xtc，我们可以生成一系列方程

eqn1: p(abc) + p(abd) + ... others with a "a" ... = p1 ... ... eqn2: p(abd) + p(acd) + ... others with a "d" ... = p4

这比方程有更多的未知数，所以解决它的方法有很多。一旦找到解决方案，无论您选择什么方法(如果您是我，请使用单纯形算法)，使用 @alestanis 描述的轮盘赌方法从每个字符串的概率中进行采样。

from numpy import * # using cvxopt-1.1.5 from cvxopt import matrix, solvers ########################### # Functions to do some parts # function to find all valid outputs def perms(alphabet, length): if length == 0: yield "" return for i in range(len(alphabet)): val1 = alphabet[i] for val2 in perms(alphabet[:i]+alphabet[i+1:], length-1): yield val1 + val2 # roulette sampler def roulette_sampler(values, probs): # Create cumulative prob distro probs_cum = [sum(probs[:i+1]) for i in range(n_strings)] def fun(): r = random.rand() for p,s in zip(probs_cum, values): if r < p: return s # in case of rounding error return values[-1] return fun ############################ # Main Part # create list of all valid strings alphabet = "abcd" string_length = 3 alpha_probs = [string_length*x/30. for x in range(9,5,-1)] # show probs for a,p in zip(alphabet, alpha_probs): print "p("+a+") =",p # all valid outputs for this particular case strings = [perm for perm in perms(alphabet, string_length)] n_strings = len(strings) # constraints from probabilities p(abc) + p(abd) ... = p(a) contains = array([[1. if s.find(a) >= 0 else 0. for a in alphabet] for s in strings]) #both = concatenate((contains,wons), axis=1).T # hacky, but whatever #A = matrix(both) #b = matrix(alpha_probs + [1.]) A = matrix(contains.T) b = matrix(alpha_probs) #also need to constrain to [0,1] wons = array([[1. for s in strings]]) G = matrix(concatenate((eye(n_strings),wons,-eye(n_strings),-wons))) h = matrix(concatenate((ones(n_strings+1),zeros(n_strings+1)))) ## target matricies for approx KL divergence # uniform prob over valid outputs u = 1./len(strings) P = matrix(eye(n_strings)) q = -0.5*u*matrix(ones(n_strings)) # will minimise p^2 - pq for each p val equally # Do convex optimisation sol = solvers.qp(P,q,G,h,A,b) probs = array(sol['x']) # Print ouput for s,p in zip(strings,probs): print "p("+s+") =",p checkprobs = [0. for char in alphabet] for a,i in zip(alphabet, range(len(alphabet))): for s,p in zip(strings,probs): if s.find(a) > -1: checkprobs[i] += p print "p("+a+") =",checkprobs[i] print "total =",sum(probs) # Create the sampling function rndstring = roulette_sampler(strings, probs) ################### # Verify print "sampling..." test_n = 1000 output = [rndstring() for i in xrange(test_n)] # find which one it is sampled_freqs = [] for char in alphabet: n = 0 for val in output: if val.find(char) > -1: n += 1 sampled_freqs += [n] print "plotting histogram..." import matplotlib.pyplot as plt plt.bar(range(0,len(alphabet)),array(sampled_freqs)/float(test_n), width=0.5) plt.show()

编辑:Python 代码

关于algorithm - 具有指定结果概率的值的随机样本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13023787/

algorithm - 具有指定结果概率的值的随机样本

上一篇：java - 如何判断整数数组是否是交替数组？

下一篇：performance - 用于存储允许高效比较操作的值范围的数据结构