r - 比较两列并更改第三列时如何使用 ifelse?

标签 r if-statement dataframe compare

我仍然觉得 R 中的 ifelse 结构有点困惑,我有以下数据框:

df <- structure(list(snp = structure(1:11, .Label = c("AL0009", "AL00014", "AL0021", "AL00046", "AL0047", "AS0005", "AS0014", "AS00021", "AS0047", "AS0071", "DR0001" ), class = "factor"), CHROMOSOME = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), COUNT_ALLELE = structure(c(1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, 1L), .Label = c("A", "C", "G"), class = "factor"),     OTHER_ALLELE = structure(c(3L, 3L, 2L, 1L, 3L, 2L, 2L, 1L,     1L, 1L, 3L), .Label = c("A", "C", "G"), class = "factor"),     `116601888` = c(0L, 0L, 0L, 2L, 2L, 0L, 0L, 0L, 0L, 0L, 2L     ), `116621563` = c(0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L,     1L), `117253533` = c(0L, 0L, 0L, 2L, 2L, 0L, 0L, 0L, 1L,     0L, 2L), `117423827` = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L,     1L, 1L, 2L)), .Names = c("snp", "CHROMOSOME", "COUNT_ALLELE", "OTHER_ALLELE", "11688", "11663", "11533", "13827" ), row.names = c(NA, 11L), class = "data.frame")

#        snp CHROMOSOME COUNT_ALLELE OTHER_ALLELE 11688 11663 11533 13827
# 1   AL0009          1            A            G     0     0     0     1
# 2  AL00014          1            A            G     0     0     0     1
# 3   AL0021          1            A            C     0     0     0     1
# 4  AL00046          1            G            A     2     1     2     1
# 5   AL0047          1            A            G     2     1     2     1
# 6   AS0005          1            A            C     0     0     0     0
# 7   AS0014          1            A            C     0     0     0     0
# 8  AS00021          1            C            A     0     1     0     0
# 9   AS0047          1            G            A     0     0     1     1
# 10  AS0071          1            G            A     0     0     0     1
# 11  DR0001          1            A            G     2     1     2     2

使用 TranslateAllele函数我想用相应的两个字母代码替换从第 5 列开始的列中的数字:
TranslateAllele <- function(COUNT_ALLELE, OTHER_ALLELE, genotype){
  if(genotype==0){
    print(paste(OTHER_ALLELE, OTHER_ALLELE, sep=""))
  } else if(genotype==1){
    print(paste(COUNT_ALLELE, OTHER_ALLELE, sep=""))
  } else if(genotype==2){
    print(paste(COUNT_ALLELE, COUNT_ALLELE, sep=""))
  }
}

因此,所需的输出如下:
#        snp CHROMOSOME COUNT_ALLELE OTHER_ALLELE 11688 11663 11533 13827
# 1   AL0009          1            A            G    GG    GG    GG    AG
# 2  AL00014          1            A            G    GG    GG    GG    AG
# 3   AL0021          1            A            C    CC    CC    CC    AC
# 4  AL00046          1            G            A    GG    GA    GG    GA
# 5   AL0047          1            A            G    AA    AG    AA    AG
# 6   AS0005          1            A            C    CC    CC    CC    CC
# 7   AS0014          1            A            C    CC    CC    CC    CC
# 8  AS00021          1            C            A    AA    CA    AA    AA
# 9   AS0047          1            G            A    AA    AA    GA    GA
# 10  AS0071          1            G            A    AA    AA    AA    GA
# 11  DR0001          1            A            G    AA    AG    AA    AA

最终我需要为 1.6M 行 x 1M 列执行此操作,因此我将无法简单地使用 for 循环:(

最佳答案

我倾向于避免ifelse .它有一些严重的缺点。以下是效率和简单性之间的折衷:

df[, 5:8] <- lapply(df[, 5:8], function(x, a, b) {
  x[x == 0] <- paste0(b, b)[x == 0]
  x[x == 1] <- paste0(a, b)[x == 1]
  x[x == 2] <- paste0(a, a)[x == 2]
  x
}, a = df$COUNT_ALLELE, b = df$OTHER_ALLELE)
#        snp CHROMOSOME COUNT_ALLELE OTHER_ALLELE 11688 11663 11533 13827
# 1   AL0009          1            A            G    GG    GG    GG    AG
# 2  AL00014          1            A            G    GG    GG    GG    AG
# 3   AL0021          1            A            C    CC    CC    CC    AC
# 4  AL00046          1            G            A    GG    GA    GG    GA
# 5   AL0047          1            A            G    AA    AG    AA    AG
# 6   AS0005          1            A            C    CC    CC    CC    CC
# 7   AS0014          1            A            C    CC    CC    CC    CC
# 8  AS00021          1            C            A    AA    CA    AA    AA
# 9   AS0047          1            G            A    AA    AA    GA    GA
# 10  AS0071          1            G            A    AA    AA    AA    GA
# 11  DR0001          1            A            G    AA    AG    AA    AA

但是,您的数据集有很多列。因此,您应该将 data.frame reshape 为长格式(假设您有足够的内存)以避免循环:
library(reshape2)
dfmelt <- melt(df, id.vars = c("snp", "CHROMOSOME", "COUNT_ALLELE", "OTHER_ALLELE"))

dfmelt$code <- paste0(df$OTHER_ALLELE, df$OTHER_ALLELE)
dfmelt[dfmelt$value == 1L,] <- within(dfmelt[dfmelt$value == 1L,], code <- paste0(COUNT_ALLELE, OTHER_ALLELE))
dfmelt[dfmelt$value == 2L,] <- within(dfmelt[dfmelt$value == 2L,], code <- paste0(COUNT_ALLELE, COUNT_ALLELE))

当然,您的数据是如此之大,以至于您将真正受益于使用包 data.table:
library(data.table)
setDT(df)
dfmelt <- melt(df, id.vars = c("snp", "CHROMOSOME", "COUNT_ALLELE", "OTHER_ALLELE"))
dfmelt[value == 0L, code := paste0(OTHER_ALLELE, OTHER_ALLELE)]
dfmelt[value == 1L, code := paste0(COUNT_ALLELE, OTHER_ALLELE)]
dfmelt[value == 2L, code := paste0(COUNT_ALLELE, COUNT_ALLELE)]

如果必须,可以dcast最后将长格式data.frame/data.table转为宽格式。但不应该有这样做的理由。

关于r - 比较两列并更改第三列时如何使用 ifelse?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35868728/

相关文章:

R优化双循环,矩阵操作

r - 通过成功的字符串替换来提高循环的性能?

r - 无法安装.packages() : system call failed: Cannot allocate memory; installation of package had non-zero exit status

r - 根据 R 中 V2 中设置的条件计算 V1 列中的值之和

python - Pandas DataFrame : replace all values in a column, 基于条件

python - Pandas Dataframe 按年份分组并查找顶部项目

r - 取消列出data.table中的嵌套列表列

bash - 在 Bash 中,双方括号 [[ ]] 是否优于单方括号 [ ]?

javascript - jquery问题onclick添加类和删除类

python - 在Dataframe python中的列中过滤所有带有NaT的行