我无法理解水库采样所涉及的概率。下面是我看到的几乎无处不在的示例代码:
1/*
2 S has items to sample, R will contain the result, K number of items to select
3*/
4ReservoirSample(S[1..n], R[1..k])
5 // fill the reservoir array
6 for i = 1 to k
7 R[i] := S[i]
8
9 // replace elements with gradually decreasing probability
10 for i = k+1 to n
11 j := random(1, i) // important: inclusive range
12 if j <= k
13 R[j] := S[i]
我的理解对吗(?): 假设我们有 k=3 并且输入 = [100, 200, 300, 400, 500] 并且 i 当前处于 500 索引。 500 替换 300 在水库中的概率(大小为 3)= 300 在水库中被选中的概率 * 500 被选中的概率,只有当随机函数返回的索引小于或等于 3 时才有可能5 个选择 = 1/3 * 3/5 = 1/5
最佳答案
你的理解好像不对。 500 人被选中的概率与 300 人被选中的概率无关。
您应该查看 gfg link for the same. “这是如何运作的?”部分,其中说明如下:
Case 1: For last n-k stream items, i.e., for stream[i] where k <= i < n For every such stream item stream[i], we pick a random index from 0 to i and if the picked index is one of the first k indexes, we replace the element at picked index with stream[i]
To simplify the proof, let us first consider the last item. The probability that the last item is in final reservoir = The probability that one of the first k indexes is picked for last item = k/n (the probability of picking one of the k items from a list of size n)
Let us now consider the second last item. The probability that the second last item is in final reservoir[] = [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X [Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ] = [k/(n-1)]*[(n-1)/n] = k/n.
Similarly, we can consider other items for all stream items from stream[n-1] to stream[k] and generalize the proof.
Case 2: For first k stream items, i.e., for stream[i] where 0 <= i < k The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n]. The probability that an item from stream[0..k-1] is in final array = Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered = [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n
关于algorithm - 水库采样了解概率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35326289/