r - R 使用什么算法来搜索数据帧？

native R 中的 which() 方法使用什么算法来搜索数据帧？例如，如果我打电话

df[which(df$parameter == 3)]

R 使用什么算法来搜索这个数据框？二分查找？线性搜索？我在任何地方都找不到这方面的文档。如果它不使用上述 2 中的任何一个，它使用什么算法以及它的时间复杂度是多少？

最佳答案

如果你对这个问题感兴趣，可以看看the datatable vignette on the subject

默认情况下，R使用线性搜索。它扫描整个向量。这是data.table的一大特色。启用 binary search带索引表。在 data.table ，索引数据集的元素将在内存中物理排序，因此在其中搜索值非常快。 data.table中的二级指数(参见 here )将启用二进制搜索，但数据不会在连续的内存插槽中物理重新排序。

也许是 data.table小插图更明确:

It can be seen that with every search we reduce the number of searches by half. This is why binary search based subsets are incredibly fast. Since rows of each column of data.tables have contiguous locations in memory, the operations are performed in a very cache efficient manner (also contributes to speed).

In addition, since we obtain the matching row indices directly without having to create those huge logical vectors (equal to the number of rows in a data.table), it is quite memory efficient as well.

例子

让我们看一个例子。包裹microbenchmark用于比较。

我们将比较:

(线性)搜索 data.frame

过滤器 dplyr

(线性)搜索 data.table

(二进制)搜索 data.table带主键(使用 setkey)

(二进制)搜索 data.table带有辅助键(使用 setindex)

让我们创建所需的每个部分:

library(data.table)
library(dplyr)

df <- data.frame(
  x = rnorm(1e6),
  y = rnorm(1e6)
)
dt <- as.data.table(df) 

dt2 <- copy(dt)
dt3 <- copy(dt)

我们将在 dt2 上设置主键(二进制搜索 + 对象在内存中重新排序)和辅助键 dt3 (二分搜索但在内存中没有重新排序)

setkey(dt2, x)
setindex(dt3, x)

微基准:

m <- microbenchmark::microbenchmark(
  df[df$x<0,],
  df %>% dplyr::filter(x<0),
  dt[x<0],
  dt2[x<0],
  dt3[x<0],
  times = 20L
)

m 

Unit: milliseconds
                        expr      min       lq     mean   median       uq       max neval
              df[df$x < 0, ] 56.56557 57.54392 66.97838 61.39609 75.30391  91.42418    20
 df %>% dplyr::filter(x < 0) 23.24242 24.15183 34.64290 26.02232 34.91405 143.10476    20
                   dt[x < 0] 18.32496 18.96585 21.35255 20.25604 23.02666  33.25656    20
                  dt2[x < 0] 10.85129 10.94804 11.92941 11.21601 11.80469  18.29040    20
                  dt3[x < 0] 18.37789 18.47568 19.51928 18.76135 19.39782  26.90826    20

到目前为止，基本 R 方法是最慢的。在这个例子中，使用二级索引并没有提高太多的性能。但是主键可以!

ggplot2::autoplot(m)

关于r - R 使用什么算法来搜索数据帧？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61201343/

r - R 使用什么算法来搜索数据帧？

上一篇：python - numpy.astype(np.uint8) 如何转换 float 组？ -1.2997805 变成 255

下一篇：javascript - 数组的 typescript 泛型