r - 用长度相等或更大的字符串有效地替换固定位置的子字符串

标签 r rcpp

用另一个相等或更大长度的字符串替换固定位置子字符串的有效方法是什么?

例如,下面通过首先找到“abc”的位置然后替换它来替换子字符串“abc”:

sub("abc", "123", "iabc.def", fixed = TRUE)
#[1] "i123.def"

sub("abc", "1234", "iabc.def", fixed = TRUE)
#[1] "i1234.def"

但是,我们知道子字符串“abc”总是在字符位置 2、3 和 4。在这种情况下 ,有没有办法指定这些位置,以便不需要执行字符串匹配而使用字符索引?

我确实尝试使用 substr() 但当替换字符串大于被替换的子字符串时,它并没有像我希望的那样工作:
x <- "iabc.def"
substr(x, 2, 4) <- "123"
#[1] "i123.def"

x <- "iabc.def"
substr(x, 2, 4) <- "1234"
#[1] "i123.def"

非常感谢您抽出宝贵的时间,

托尼·布雷亚尔

附言以上可能是做我想做的最有效的方法,但我想我会问,以防万一有更好的方法:)

===== 时间 =====
#                             test elapsed  relative
# 7 francois.fx_wb(x, replacement)    0.94  1.000000
# 1                           f(x)    1.56  1.659574
# 6    francois.fx(x, replacement)    2.23  2.372340
# 5                      Sobala(x)    3.89  4.138298
# 2                    Hong.Ooi(x)    4.41  4.691489
# 3                        DWin(x)    5.57  5.925532
# 4                      hadley(x)    9.47 10.074468

上面的时间是从下面的代码生成的:
library(rbenchmark)
library(stringr)
library(Rcpp)
library(inline)

f <- function(x, replacement = "1234")  sub("abc", replacement, x, fixed = TRUE)

Hong.Ooi <- function(x, replacement = "1234") paste(substr(x, 1, 1), replacement, substr(x, 5, nchar(x)), sep = "")

DWin <- function(x, replacement =  paste("\\1", "1234", sep = "")) sub("^(.)abc", replacement, x)

Sobala <- function(x, replacement =  paste("\\1", "1234", sep = ""))  sub("^(.).{3}", replacement, x, perl=TRUE)

hadley <- function(x, replacement = "1234") {
  str_sub(x, 2, 4) <- replacement
  return(x)
}

francois.fx <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '

    const char* rep =as<const char*>(rep_) ;
    CharacterVector x(x_) ;
    int nrep = strlen( rep ) ;
    int n = x.size() ; 
    CharacterVector res(n) ;

    char buffer[1024] ;

    for(int i=0; i<n; i++) {
        const char* xi = x[i] ;
        if( strncmp( xi+1, "abc", 3 ) ) {
            res[i] = x[i] ;
        } else{
            buffer[0] = xi[0] ;
            strncpy( buffer + 1, rep, nrep ) ;
            strcpy( buffer + 1 + nrep, xi + 4 ) ;
            res[i] = buffer ;
        }
    }
    return res ;
', plugin = "Rcpp" )

francois.fx_wb <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '

    const char* rep =as<const char*>(rep_) ;
    int nrep = strlen( rep ) ;
    int n=Rf_length(x_) ;
    SEXP res = PROTECT( Rf_allocVector( STRSXP, n ) ) ;

    char buffer[1024] ;

    for(int i=0; i<n; i++) {
        const char* xi = char_get_string_elt(x_, i) ;
        if( strncmp( xi+1, "abc", 3 ) ) {
            set_string_elt( res, i, get_string_elt(x_,i)) ;
        } else{
            buffer[0] = xi[0] ;
            strncpy( buffer + 1, rep, nrep ) ;
            strcpy( buffer + 1 + nrep, xi + 4 ) ;
            char_set_string_elt(res, i, buffer ) ;
        }
    }
    UNPROTECT(1) ;
    return res ;
', plugin = "Rcpp" )


x <- rep("iabc.def", 1e6)
replacement <- "1234"
benchmark(f(x), Hong.Ooi(x), DWin(x), hadley(x), Sobala(x), francois.fx(x, replacement), francois.fx_wb(x, replacement),
          columns = c("test", "elapsed", "relative"),
          order = "relative",
          replications = 10)

最佳答案

这是一种基于 Rcpp 的解决方案。

fx <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '

    const char* rep =as<const char*>(rep_) ;
    CharacterVector x(x_) ;
    int nrep = strlen( rep ) ;
    int n = x.size() ; 
    CharacterVector res(n) ;

    char buffer[1024] ;

    for(int i=0; i<n; i++) {
        const char* xi = x[i] ;
        if( strncmp( xi+1, "abc", 3 ) ) {
            res[i] = x[i] ;
        } else{
            buffer[0] = xi[0] ;
            strncpy( buffer + 1, rep, nrep ) ;
            strcpy( buffer + 1 + nrep, xi + 4 ) ;
            res[i] = buffer ;
        }
    }
    return res ;
', plugin = "Rcpp" )

它对简单的子解决方案没有太大改进,因为对 R 中的字符串的写访问受到写屏障的保护。如果我在写障碍上作弊,我会得到更好的结果,但我并不完全意识到后果,所以我可能应该建议不要这样做:/
fx_wb <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '

    const char* rep =as<const char*>(rep_) ;
    int nrep = strlen( rep ) ;
    int n=Rf_length(x_) ;
    SEXP res = PROTECT( Rf_allocVector( STRSXP, n ) ) ;

    char buffer[1024] ;

    for(int i=0; i<n; i++) {
        const char* xi = char_get_string_elt(x_, i) ;
        if( strncmp( xi+1, "abc", 3 ) ) {
            set_string_elt( res, i, get_string_elt(x_,i)) ;
        } else{
            buffer[0] = xi[0] ;
            strncpy( buffer + 1, rep, nrep ) ;
            strcpy( buffer + 1 + nrep, xi + 4 ) ;
            char_set_string_elt(res, i, buffer ) ;
        }
    }
    UNPROTECT(1) ;
    return res ;
', plugin = "Rcpp" )

写屏障

R Internals manual描述写屏障:

A generational collector needs to efficiently ‘age’ the objects, especially list-like objects (including STRSXPs). This is done by ensuring that the elements of a list are regarded as at least as old as the list when they are assigned. This is handled by the functions SET_VECTOR_ELT and SET_STRING_ELT, which is why they are functions and not macros. Ensuring the integrity of such operations is termed the write barrier and is done by making the SEXP opaque and only providing access via functions (which cannot be used as lvalues in assignments in C).

All code in R extensions is by default behind the write barrier.



Luke Tierney's document描述了原因背后的逻辑:

The generational collector divides allocated nodes into generations based on some notion of age. Younger generations are collected more frequently than older ones. For this to work correctly, any younger nodes that are reachable only from older nodes must be handled properly. This is accomplished by a write barrier that monitors each assignment and takes appropriate action when a reference to a new node is placed in an older one.

关于r - 用长度相等或更大的字符串有效地替换固定位置的子字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8456010/

相关文章:

r - 是否可以使用 R 命令在 Windows 上安装 pandoc?

r - 将数字格式设置为两位数

r - 如何使用 knit 将环境传递给 Rcpp block ?

rcpp - R session 中止

c++ - 无法在 R 下编译 Rcpp 0.12.6

r - R 包 biomaRt 和此依赖项 RSQLite 出错

r - R 上 Packrat 的问题

使用 readOGR 与 readShapePoly 读取形状文件

r - 用于删除 NA 值的模板化 Rcpp 函数

r - 我可以使用 Rcpp 加速我的 R 代码吗?