r - 通过 ID 和 R 中的某个因子分布创建一个随机子样本

我正在使用 R 并拥有以下数据集，其中包含从书中取出的句子，并包含有关书籍 ID、封面颜色(颜色)以及与相应书籍匹配的句子 ID 的数据。

My dataset
    Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here
    .

我想在以下条件下抽取四个随机子样本(每个包含原始数据的 25%):
1)书籍颜色的分布应与原始数据集中的相同。如果有 10% 的蓝皮书，这也应该反射(reflect)在子样本中
2)子样本应该不按行数进行/拆分 (这是句子ID)但通过“书号” .这意味着如果对图书 ID 4 进行采样，则所有句子 7、8、9、10、11 都应在示例数据集中。
3) 此外，每个图书 ID 应该只在 4 个子样本之一中 - 这意味着如果我决定合并所有 4 个子样本，我想再次以原始数据集结束。

以上述方式拆分我的数据集的最佳解决方案是什么？

最佳答案

这应该有效。书籍按颜色分组，然后从长度为 4 的下一个倍数的池中抽取一个 1:4 的数字，以确保平均分配。然后按样本编号拆分数据帧。

library(readr)
library(dplyr)
library(tidyr)

books <- read_delim(
'Book ID| sentence ID| Colour      | Sentences
    1      | 1          | Blue        | Text goes here
    1      | 2          | Blue        | Text goes here
    1      | 3          | Blue        | Text goes here
    2      | 4          | Red         | Text goes here
    2      | 5          | Red         | Text goes here
    3      | 6          | Green       | Text goes here
    4      | 7          | Orange      | Text goes here
    4      | 8          | Orange      | Text goes here
    4      | 9          | Orange      | Text goes here
    4      | 10         | Orange      | Text goes here
    4      | 11         | Orange      | Text goes here
    5      | 12         | Blue        | Text goes here
    5      | 13         | Blue        | Text goes here
    6      | 14         | Red         | Text goes here
    6      | 15         | Red         | Text goes here', 
'|', trim_ws = TRUE)

books %>%
  # sampling is done on a book ID level. We group by book 
  # and nest the sentences, to get only one row per book.
  group_by(`Book ID`) %>% 
  nest(book_data = c(`sentence ID`, Sentences)) %>% 

  # We want to split colours evenly. We therefore draw a sample number from 1:4
  # for each group of colours. To ensure an even split, we draw from a 
  # vector that is a repeat of 1:4 until it has a lenght, that is the 
  # first multiple of 4, that is >= the number of colours in a group.
  group_by(Colour) %>%
  mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>% 

  # Unnest the sentences again.
  unnest(book_data) %>% 

  # Split the data frame into lists by the sample number.
  split(.$sample)

$`1`
# A tibble: 4 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         5 Blue              12 Text goes here      1
2         5 Blue              13 Text goes here      1
3         6 Red               14 Text goes here      1
4         6 Red               15 Text goes here      1

$`2`
# A tibble: 2 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         2 Red                4 Text goes here      2
2         2 Red                5 Text goes here      2

$`3`
# A tibble: 1 x 5
# Groups:   Colour [1]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         3 Green              6 Text goes here      3

$`4`
# A tibble: 8 x 5
# Groups:   Colour [2]
  `Book ID` Colour `sentence ID` Sentences      sample
      <dbl> <chr>          <dbl> <chr>           <int>
1         1 Blue               1 Text goes here      4
2         1 Blue               2 Text goes here      4
3         1 Blue               3 Text goes here      4
4         4 Orange             7 Text goes here      4
5         4 Orange             8 Text goes here      4
6         4 Orange             9 Text goes here      4
7         4 Orange            10 Text goes here      4
8         4 Orange            11 Text goes here      4

关于r - 通过 ID 和 R 中的某个因子分布创建一个随机子样本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62387239/

r - 通过 ID 和 R 中的某个因子分布创建一个随机子样本

上一篇：python - 如何在 fastapi 中使用刷新 token ？

下一篇：python - 如何在sklearn高斯过程回归使用的优化函数中更改max_iter？