随机抽取 N 个唯一的客户 ID，同时确保交易越多，被选中的机会就越高

我有一个名为“transaction_history”的表，其中包含数百万笔交易，其中包含以下列: 第 1 列:customer_id 第2栏:交易日期

在此表中，一个客户可能有 x 笔交易，其中 X >= 1

我想要做的是获取 n 个(n 是分配给 n 个获胜者的奖品数量)唯一客户 ID 的随机样本，但确保给定客户的交易越频繁，他们被选中的机会就越高选出了一名获胜者。

我尝试过以下方法: 1-直接的 dplyr::sample_n(transaction_history, size = ...) 导致样本具有重复的 customer_ids

2- 交易 %>% dplyr::distinct(customer_id) %>% dplyr::sample_n(transaction_history, size = ...) 这不会给常客带来更高的机会

3-在再次采样之前从每个 customer_id 组中采样，这也违背了这一目标。

任何帮助将不胜感激。

谢谢

最佳答案

这个怎么样:

# create some random toy data to use for example:
testdata <- 
  tibble( person_id = sample(1:5, size=20, replace=TRUE) ) %>%
  mutate( transaction_id = row_number() ) %>% 
  arrange( person_id )

玩具数据如下所示:

   person_id transaction_id
 1         1              9
 2         1             11
 3         2              4
 4         2              5
 5         2              6
 6         2             10
 7         2             19
 8         3              7
 9         3             17
10         3             18
11         3             20
12         4              1
13         4              2
14         4              3
15         4              8
16         4             12
17         4             13
18         4             14
19         4             16
20         5             15

现在，计算每人的交易数量，并将该交易数量用作sample_n() 函数中的权重:

testdata %>% 
  # count number of transactions per person:
  group_by(person_id) %>% 
  summarise( n_transactions = n() ) %>%
  ungroup() %>% 
  # select a random 2 people, where chance of being selected is based on number of transactions:
  sample_n( size = 2,
            weight = n_transactions
          )

如果您使用相同的玩具数据多次运行上述代码块，您会发现交易次数较多的人被选中的频率更高。

sample_n()函数使用的实际选择概率计算如下: (请参阅 sample_n() 函数的帮助文档)

testdata %>% 
  # count number of transactions per person:
  group_by(person_id) %>% 
  summarise( n_transactions = n() ) %>%
  ungroup() %>% 
  # calculate selection probability:
  mutate( probability_of_being_selected = n_transactions/sum(n_transactions) )

  person_id n_transactions probability_of_being_selected
1         1              2                          0.1 
2         2              5                          0.25
3         3              4                          0.2 
4         4              8                          0.4 
5         5              1                          0.05

关于随机抽取 N 个唯一的客户 ID，同时确保交易越多，被选中的机会就越高，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62663965/

随机抽取 N 个唯一的客户 ID，同时确保交易越多，被选中的机会就越高

上一篇：python-3.x - Flask sqlalchemy postgres 模型问题 :How to avoid entering "id" manually

下一篇：javascript - 在Javascript中，如何从下拉列表和表单中总结数组中的所有值？