我有以下称为 x 的向量:
x <- c(1, 1, 4, 5, 4, 6, 1, 1)
x
#> [1] 1 1 4 5 4 6 1 1
我想计算所有重复值。在这种情况下,数字 1,1,1,1,4,4
是重复的,这意味着共有 6 个重复值。以下是一些尝试:
x <- c(1, 1, 4, 5, 4, 6, 1, 1)
# Wrong outputs
sum(duplicated(x))
#> [1] 4
sum(table(x)-1)
#> [1] 4
# Returns number of duplicated values in this case 1 and 4
nrow(data.frame(table(x))[data.frame(table(x))$Freq > 1,])
#> [1] 2
创建于 2022-12-08 reprex v2.0.2
所以我想知道是否有人知道如何计算所有重复值而不是计算具有重复值的数量?
最佳答案
其他选项:
sum(Filter(\(z) z > 1, table(x)))
sum(setdiff(table(x), 1L))
sum(x %in% x[duplicated(x)])
最后一个显然是最快的,akrun 紧随其后:
bench::mark(
sum(Filter(\(z) z > 1, table(x))),
sum(setdiff(table(x), 1L)),
sum(x %in% x[duplicated(x)]),
sum(table(x)[names(table(x)) %in% x[duplicated(x)]]),
sum(duplicated(x)|duplicated(x, fromLast = TRUE))
)
# # A tibble: 5 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 sum(Filter(function(z) z > 1, table(x))) 58us 67.5us 14335. 5.35KB 6.62 6499 3 453.4ms <int [1]> <Rprofmem [16 x 3]> <bench_tm [6,502]> <tibble [6,502 x 3]>
# 2 sum(setdiff(table(x), 1L)) 51.6us 60.9us 16046. 0B 6.56 7338 3 457.3ms <int [1]> <Rprofmem [0 x 3]> <bench_tm [7,341]> <tibble [7,341 x 3]>
# 3 sum(x %in% x[duplicated(x)]) 2.8us 3.2us 294065. 0B 0 10000 0 34ms <int [1]> <Rprofmem [0 x 3]> <bench_tm [10,000]> <tibble [10,000 x 3]>
# 4 sum(table(x)[names(table(x)) %in% x[duplicated(x)]]) 102.1us 123.4us 7957. 0B 4.26 3737 2 469.6ms <int [1]> <Rprofmem [0 x 3]> <bench_tm [3,739]> <tibble [3,739 x 3]>
# 5 sum(duplicated(x) | duplicated(x, fromLast = TRUE)) 4.3us 4.9us 194347. 0B 19.4 9999 1 51.4ms <int [1]> <Rprofmem [0 x 3]> <bench_tm [10,000]> <tibble [10,000 x 3]>
(免责声明:用这么小的数据分析代码真的是徒劳的……但我很好奇。)
关于r - 计算 R 中的所有重复值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74733926/