我注意到很多情况下,Rcpp(或纯 C)中的循环内容可以“轻松”取消切换,但即使使用 -O3
优化,手动取消切换也有显着的性能优势.以下面的简单示例为例,在取消切换时我看到了 12% 的差异。
这是 Rcpp 或 R 特有的东西,还是我对我的代码或编译器优化的假设不正确?
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int which_first_equal_A(IntegerVector x, int a, bool not_equal = false) {
int n = x.length();
for (int i = 0; i < n; ++i) {
if (not_equal) {
if (x[i] != a) {
return i + 1;
}
} else {
if (x[i] == a) {
return i + 1;
}
}
}
return 0;
}
// [[Rcpp::export]]
int which_first_equal_B(IntegerVector x, int a, bool not_equal = false) {
int n = x.length();
if (not_equal) {
for (int i = 0; i < n; ++i) {
if (x[i] != a) {
return i + 1;
}
}
} else {
for (int i = 0; i < n; ++i) {
if (x[i] == a) {
return i + 1;
}
}
}
return 0;
}
/***R
x <- integer(1e9)
bench::mark(which_first_equal_A(x, 1L),
which_first_equal_B(x, 1L))
*/
/.R/Makevars
PKG_CXXFLAGS = -O3 -funswitch-loops
PKG_LIBS = -O3 -funswitch-loops
Rcpp::sourceCpp('~/rollers.cpp', verbose = TRUE, rebuild = TRUE)
#>
#> Generated extern "C" functions
#> --------------------------------------------------------
#>
#>
#> #include <Rcpp.h>
#> // which_first_equal_A
#> int which_first_equal_A(IntegerVector x, int a, bool not_equal);
#> RcppExport SEXP sourceCpp_1_which_first_equal_A(SEXP xSEXP, SEXP aSEXP, SEXP not_equalSEXP) {
#> BEGIN_RCPP
#> Rcpp::RObject rcpp_result_gen;
#> Rcpp::RNGScope rcpp_rngScope_gen;
#> Rcpp::traits::input_parameter< IntegerVector >::type x(xSEXP);
#> Rcpp::traits::input_parameter< int >::type a(aSEXP);
#> Rcpp::traits::input_parameter< bool >::type not_equal(not_equalSEXP);
#> rcpp_result_gen = Rcpp::wrap(which_first_equal_A(x, a, not_equal));
#> return rcpp_result_gen;
#> END_RCPP
#> }
#> // which_first_equal_B
#> int which_first_equal_B(IntegerVector x, int a, bool not_equal);
#> RcppExport SEXP sourceCpp_1_which_first_equal_B(SEXP xSEXP, SEXP aSEXP, SEXP not_equalSEXP) {
#> BEGIN_RCPP
#> Rcpp::RObject rcpp_result_gen;
#> Rcpp::RNGScope rcpp_rngScope_gen;
#> Rcpp::traits::input_parameter< IntegerVector >::type x(xSEXP);
#> Rcpp::traits::input_parameter< int >::type a(aSEXP);
#> Rcpp::traits::input_parameter< bool >::type not_equal(not_equalSEXP);
#> rcpp_result_gen = Rcpp::wrap(which_first_equal_B(x, a, not_equal));
#> return rcpp_result_gen;
#> END_RCPP
#> }
#>
#> Generated R functions
#> -------------------------------------------------------
#>
#> `.sourceCpp_1_DLLInfo` <- dyn.load('C:/Users/hughp/AppData/Local/Temp/RtmpGYnUKa/sourceCpp-x86_64-w64-mingw32-1.0.5/sourcecpp_538057654375/sourceCpp_2.dll')
#`>
#> which_first_equal_A <- Rcpp:::sourceCppFunction(function(x, a, not_equal = FALSE) {}, FALSE, `.sourceCpp_1_DLLInfo`, 'sourceCpp_1_which_first_equal_A')
#> which_first_equal_B <- Rcpp:::sourceCppFunction(function(x, a, not_equal = FALSE) {}, FALSE, `.sourceCpp_1_DLLInfo`, 'sourceCpp_1_which_first_equal_B')
#>
#> rm(`.sourceCpp_1_DLLInfo`)
#>
#> Building shared library
#> --------------------------------------------------------
#>
#> DIR: C:/Users/hughp/AppData/Local/Temp/RtmpGYnUKa/sourceCpp-x86_64-w64-mingw32-1.0.5/sourcecpp_538057654375
#>
#> C:/R/R-40~1.0/bin/x64/R CMD SHLIB --preclean -o "sourceCpp_2.dll" "rollers.cpp"
#> "C:/rtools40/mingw64/bin/"g++ -std=gnu++11 -I"C:/R/R-40~1.0/include" -DNDEBUG -I"C:/R/R-4.0.0/library/Rcpp/include" -I"C:/Users/hughp/Documents" -I"C:/Users/hughp/inst/include" -O3 -funswitch-loops -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -c rollers.cpp -o rollers.o
#> C:/rtools40/mingw64/bin/g++ -std=gnu++11 -shared -s -static-libgcc -o sourceCpp_6.dll tmp.def rollers.o -O3 -funswitch-loops -LC:/R/R-40~1.0/bin/x64 -lR
#>
#> > x <- integer(1e9)
#>
#> > bench::mark(which_first_equal_A(x, 1L),
#> + which_first_equal_B(x, 1L))
#> # A tibble: 2 x 14
#> expression min mean median max `itr/sec` mem_alloc n_gc
#> <chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 which_fir~ 576.724ms 576.724ms 576.724ms 576.724ms 1.73 2.492KB 0
#> 2 which_fir~ 504.358ms 504.358ms 504.358ms 504.358ms 1.98 2.492KB 0
#> # ... with 6 more variables: n_itr <int>, total_time <bch:tm>, result <list>,
#> # memory <list>, time <list>, gc <list>
由 reprex package 创建于 2020-11-28 (v0.3.0)
最佳答案
我在几台机器上用不同的编译器测试了你的代码,我可以确认这确实发生在 gcc 上,包括 4.8.5 和 9.1.0。
它不会发生在 Clang (Mac LLVM 10.0.0) 上(即 A 和 B 的速度相同)。
我还发现了该问题的“修复”:
不要使用 int
进行循环,在这种情况下使用 native 索引类型 R_xlen_t
或 size_t
。
// [[Rcpp::export]]
int which_first_equal_C(IntegerVector x, const int a, const bool not_equal) {
R_xlen_t n = x.length();
for (R_xlen_t i = 0; i < n; ++i) {
if (not_equal) {
if (x[i] != a) {
return i + 1;
}
} else {
if (x[i] == a) {
return i + 1;
}
}
}
return 0;
}
x <- integer(1e7)
microbenchmark::microbenchmark(A=which_first_equal_A(x, 1L, F),
B=which_first_equal_B(x, 1L, F),
C=which_first_equal_C(x, 1L, F), times=100)
Unit: milliseconds
expr min lq mean median uq max neval cld
A 5.651485 5.725665 6.110254 5.766312 5.891322 7.938720 100 c
B 4.819980 4.913308 5.315103 4.964663 5.460507 6.866738 100 b
C 4.560159 4.638324 5.029065 4.695320 5.233892 7.138785 100 a
x <- integer(1e9)
microbenchmark::microbenchmark(A=which_first_equal_A(x, 1L, F),
B=which_first_equal_B(x, 1L, F),
C=which_first_equal_C(x, 1L, F), times=5)
Unit: milliseconds
expr min lq mean median uq max neval cld
A 643.2995 643.3866 643.6305 643.4289 643.7391 644.2983 5 c
B 578.3032 579.0064 581.5769 579.5064 582.0040 589.0645 5 b
C 557.4450 557.4875 557.9307 557.7024 558.5055 558.5131 5 a
老实说,我真的不明白为什么 gcc 不能优化 int 上的循环,但结果不言自明。
PS:在适用的地方使用 const
关键字也没有坏处,但这样做不会影响性能。
关于r - 为什么这个循环没有被取消切换?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65048668/