我正在使用 R 并拥有以下数据集,其中包含从书中取出的句子,并包含有关书籍 ID、封面颜色(颜色)以及与相应书籍匹配的句子 ID 的数据。
My dataset
Book ID| sentence ID| Colour | Sentences
1 | 1 | Blue | Text goes here
1 | 2 | Blue | Text goes here
1 | 3 | Blue | Text goes here
2 | 4 | Red | Text goes here
2 | 5 | Red | Text goes here
3 | 6 | Green | Text goes here
4 | 7 | Orange | Text goes here
4 | 8 | Orange | Text goes here
4 | 9 | Orange | Text goes here
4 | 10 | Orange | Text goes here
4 | 11 | Orange | Text goes here
5 | 12 | Blue | Text goes here
5 | 13 | Blue | Text goes here
6 | 14 | Red | Text goes here
6 | 15 | Red | Text goes here
.
我想在以下条件下抽取四个随机子样本(每个包含原始数据的 25%):
1)书籍颜色的分布应与原始数据集中的相同。如果有 10% 的蓝皮书,这也应该反射(reflect)在子样本中
2)子样本应该不按行数进行/拆分 (这是句子ID)但通过“书号” .这意味着如果对图书 ID 4 进行采样,则所有句子 7、8、9、10、11 都应在示例数据集中。
3) 此外,每个图书 ID 应该只在 4 个子样本之一中 - 这意味着如果我决定合并所有 4 个子样本,我想再次以原始数据集结束。
以上述方式拆分我的数据集的最佳解决方案是什么?
最佳答案
这应该有效。书籍按颜色分组,然后从长度为 4 的下一个倍数的池中抽取一个 1:4 的数字,以确保平均分配。然后按样本编号拆分数据帧。
library(readr)
library(dplyr)
library(tidyr)
books <- read_delim(
'Book ID| sentence ID| Colour | Sentences
1 | 1 | Blue | Text goes here
1 | 2 | Blue | Text goes here
1 | 3 | Blue | Text goes here
2 | 4 | Red | Text goes here
2 | 5 | Red | Text goes here
3 | 6 | Green | Text goes here
4 | 7 | Orange | Text goes here
4 | 8 | Orange | Text goes here
4 | 9 | Orange | Text goes here
4 | 10 | Orange | Text goes here
4 | 11 | Orange | Text goes here
5 | 12 | Blue | Text goes here
5 | 13 | Blue | Text goes here
6 | 14 | Red | Text goes here
6 | 15 | Red | Text goes here',
'|', trim_ws = TRUE)
books %>%
# sampling is done on a book ID level. We group by book
# and nest the sentences, to get only one row per book.
group_by(`Book ID`) %>%
nest(book_data = c(`sentence ID`, Sentences)) %>%
# We want to split colours evenly. We therefore draw a sample number from 1:4
# for each group of colours. To ensure an even split, we draw from a
# vector that is a repeat of 1:4 until it has a lenght, that is the
# first multiple of 4, that is >= the number of colours in a group.
group_by(Colour) %>%
mutate(sample = sample(rep_len(1:4, (n() + 3) %/% 4 * 4 ), n(), replace = F)) %>%
# Unnest the sentences again.
unnest(book_data) %>%
# Split the data frame into lists by the sample number.
split(.$sample)
$`1`
# A tibble: 4 x 5
# Groups: Colour [2]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 5 Blue 12 Text goes here 1
2 5 Blue 13 Text goes here 1
3 6 Red 14 Text goes here 1
4 6 Red 15 Text goes here 1
$`2`
# A tibble: 2 x 5
# Groups: Colour [1]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 2 Red 4 Text goes here 2
2 2 Red 5 Text goes here 2
$`3`
# A tibble: 1 x 5
# Groups: Colour [1]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 3 Green 6 Text goes here 3
$`4`
# A tibble: 8 x 5
# Groups: Colour [2]
`Book ID` Colour `sentence ID` Sentences sample
<dbl> <chr> <dbl> <chr> <int>
1 1 Blue 1 Text goes here 4
2 1 Blue 2 Text goes here 4
3 1 Blue 3 Text goes here 4
4 4 Orange 7 Text goes here 4
5 4 Orange 8 Text goes here 4
6 4 Orange 9 Text goes here 4
7 4 Orange 10 Text goes here 4
8 4 Orange 11 Text goes here 4
关于r - 通过 ID 和 R 中的某个因子分布创建一个随机子样本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62387239/