R 从数据集中的定制子集中获取分位数和平均值

标签 r data.table mean quantile subsampling

我想获得定制子集中的分位数。例如在以下数据集中:

data = data.table(x=c(rep(1,9),rep(2,9)),y=c(rep(1:6,each=3)),z=1:18)

对于每一行i，我想知道在x=x[i]和y=的行中em>，z[i] 的 50%tile(以及进一步计算中的其他分位数，例如 10%tile、5%tile)。

预期的输出是

c(2,2,2,3.5,3.5,3.5,5,5,5,11,11,11,12.5,12.5,12.5,14,14,14)

对于每一行i，我想知道在x=x[i]和y=的行中em>，z[i] 的平均值。

预期输出为(与此数据集中的 1 相同，但在其他数据集中会有所不同)。

c(2,2,2,3.5,3.5,3.5,5,5,5,11,11,11,12.5,12.5,12.5,14,14,14)

我可以为它编写一个函数，并使用 apply 在每一行上循环它。然而，数据集有超过 30,000,000 行，这需要几天的时间。在 R data.table 或 tidyverse 或其他包中是否有更快的计算方法？

最佳答案

在data.table中使用非等值连接

data[data, quantile(z, 0.5), on = .(x = x, y <=y), by = .EACHI]$V1 #[1] 2.0 2.0 2.0 3.5 3.5 3.5 5.0 5.0 5.0 11.0 11.0 11.0 12.5 12.5 12.5 14.0 14.0 14.0

如果我们想创建一个列

data[data[unique(data[, .(x, y)]), quantile(z, 0.5), on = .(x = x, y <=y), by = .EACHI], z_mean := V1, on = .(x, y)]

-输出

> data x y z z_mean <num> <int> <int> <num> 1: 1 1 1 2.0 2: 1 1 2 2.0 3: 1 1 3 2.0 4: 1 2 4 3.5 5: 1 2 5 3.5 6: 1 2 6 3.5 7: 1 3 7 5.0 8: 1 3 8 5.0 9: 1 3 9 5.0 10: 2 4 10 11.0 11: 2 4 11 11.0 12: 2 4 12 11.0 13: 2 5 13 12.5 14: 2 5 14 12.5 15: 2 5 15 12.5 16: 2 6 16 14.0 17: 2 6 17 14.0 18: 2 6 18 14.0

关于R 从数据集中的定制子集中获取分位数和平均值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72354336/

上一篇：vuejs3 - 错误故事书 TypeError : Cannot read properties of undefined (reading 'get' )

下一篇：python - websocket 客户端中缺少一些消息？

相关文章：

r - 将日期和时间组合到日期列中进行绘图

r - 当 `unlist()` 或 `flatten()` 为列表时缺少因子

r - 如何提取每组的前 n 行？

python - 寻找平均持续时间(H :M:S) in python pandas

R - 数据框列中序列中数字的平均值

angularjs - JWT 身份验证适用于 $http.get，但不适用于 $http.post

如果 block 有 message = FALSE，则安静地渲染不起作用

r - R 中每个物种的平均个体数/公顷

在 Debian 中使用 data.table 运行脚本时 RStudio/R 崩溃

r - 计算R中每组连续连续值的长度