r - 通过 ID 和 R 中的某个因子分布创建一个随机子样本

标签 r random merge subset sampling

我正在使用 R 并拥有以下数据集,其中包含从书中取出的句子,并包含有关书籍 ID、封面颜色(颜色)以及与相应书籍匹配的句子 ID 的数据。

My dataset
    Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here
    .

我想在以下条件下抽取四个随机子样本(每个包含原始数据的 25%):
1)书籍颜色的分布应与原始数据集中的相同。如果有 10% 的蓝皮书,这也应该反射(reflect)在子样本中
2)子样本应该不按行数进行/拆分 (这是句子ID)但通过“书号” .这意味着如果对图书 ID 4 进行采样,则所有句子 7、8、9、10、11 都应在示例数据集中。
3) 此外,每个图书 ID 应该只在 4 个子样本之一中 - 这意味着如果我决定合并所有 4 个子样本,我想再次以原始数据集结束。

以上述方式拆分我的数据集的最佳解决方案是什么?

最佳答案

这应该有效。书籍按颜色分组,然后从长度为 4 的下一个倍数的池中抽取一个 1:4 的数字,以确保平均分配。然后按样本编号拆分数据帧。

library(readr)
library(dplyr)
library(tidyr)

books <- read_delim(
'Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here', 
'|', trim_ws = TRUE)

books %>%
  # sampling is done on a book ID level. We group by book 
  # and nest the sentences, to get only one row per book.
  group_by(`Book ID`) %>% 
  nest(book_data = c(`sentence ID`, Sentences)) %>% 

  # We want to split colours evenly. We therefore draw a sample number from 1:4
  # for each group of colours. To ensure an even split, we draw from a 
  # vector that is a repeat of 1:4 until it has a lenght, that is the 
  # first multiple of 4, that is >= the number of colours in a group.
  group_by(Colour) %>%
  mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>% 

  # Unnest the sentences again.
  unnest(book_data) %>% 

  # Split the data frame into lists by the sample number.
  split(.$sample) 
$`1`
# A tibble: 4 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         5 Blue              12 Text goes here      1
2         5 Blue              13 Text goes here      1
3         6 Red               14 Text goes here      1
4         6 Red               15 Text goes here      1

$`2`
# A tibble: 2 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         2 Red                4 Text goes here      2
2         2 Red                5 Text goes here      2

$`3`
# A tibble: 1 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         3 Green              6 Text goes here      3

$`4`
# A tibble: 8 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         1 Blue               1 Text goes here      4
2         1 Blue               2 Text goes here      4
3         1 Blue               3 Text goes here      4
4         4 Orange             7 Text goes here      4
5         4 Orange             8 Text goes here      4
6         4 Orange             9 Text goes here      4
7         4 Orange            10 Text goes here      4
8         4 Orange            11 Text goes here      4

关于r - 通过 ID 和 R 中的某个因子分布创建一个随机子样本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62387239/

相关文章:

pandas - 重新索引仅对具有唯一值的索引对象有效

regex - 在强(基础)中大写第一个单词的最快方法

r - 从 R 中的模式列表中仅提取第一次出现

python - random.randint(1,10) 会返回 11 吗?

algorithm - 如何证明一个随机数生成器优于另一个?

java - 来自字母数字字符串的与语言无关的随机数生成器

r - ..x.. 在 ggplot 表示法中代表什么

检索 R 中的最佳簇数

git - 是否可以使现有的 git 提交显示为 merge ?

svn - 颠覆 : merge without a working copy