r - 如何从插入符号包拆分数据的createDataPartition功能？

从文档中:

For bootstrap samples, simple random sampling is used.

For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits.

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups.

For createDataPartition, the number of percentiles is set via the groups argument.

我不明白为什么需要这种“平衡”的东西。我认为我表面上了解它，但是任何其他见解都将真正有帮助。

最佳答案

这意味着，如果您有一个具有10000行的数据集ds

set.seed(42)
ds <- data.frame(values = runif(10000))

具有2个“类”且分布不均(9000与1000)

ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
#    1    2 
# 9000 1000

您可以创建一个示例，该示例尝试维护factor类的比率/“平衡”。

dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
#   1   2 
# 900 100

关于r - 如何从插入符号包拆分数据的createDataPartition功能？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40709722/

上一篇：r - 在 R 中使用 tm 的语料库功能处理大数据

下一篇：class - 为什么 Manifest 在构造函数中不可用？

相关文章：

r - 总结后如何考虑组内更大的日期

R:带有 ddply 的 for 循环

r - 在终端中更改 R 语言设置

r - ggplot 中的条隐藏负标签 : geom_bar

r - 如何在不同的子集上使用 data.table 的 j 创建多个新列

r - 插入符号分类阈值

r - 按列表中元素名称的子集列表

clojure - 删除集合列表中的所有子集

r - 机器学习任务中的加权类别

r - Caret - 基于多个变量创建分层数据集