我有一个代表细菌模型基因结构的文件。每行代表一个模型。行是固定长度的二进制字符串，其中存在基因(1 表示存在，0 表示不存在)。我的任务是比较每对模型的基因序列，并获得它们相似程度的分数，并计算相异矩阵。

一个文件中总共有 450 个模型(行)，共有 250 个文件。我有一个有效的代码，但是仅对一个文件完成整个工作大约需要 1.6 小时。

#Sample Data    
Generation: 0
    [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
    [1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
    [1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
    [0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
    [0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
    [1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]

我的代码的作用:

读取文件
将二进制字符串转换为数据框Gene、Model_1、Model_2、型号_3，...型号_450
运行嵌套的 for 循环来进行成对比较(仅顶部矩阵的一半) – 我取两个相应的列并添加然后计算总和为 2 的位置(即存在两种型号)
将数据写入文件
稍后创建矩阵

比较代码

generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")

start.time = Sys.time()

for(a in 1:length(generationFiles)){

  fname = generationFiles[a]

  geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)

  geneCount = str_count(geneData[1,1],"[1|0]")

  geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)

  #convert the string into a data frame
      for(i in 1:nrow(geneData)){

    #remove the square brackets
    dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)

    #removing white spaces
    dataRow = gsub(" ", "", dataRow, fixed = T)

    #splitting the string 
    dataRow = strsplit(dataRow, ",")

    #converting to numeric
    dataRow = as.numeric(unlist(dataRow))

    colName = paste("M_",i,sep = "")
    geneDF <- cbind(geneDF, dataRow)
    colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName

    dataRow <- NULL
  }

  summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
                          Uncommon = integer(), Absent = integer(), stringsAsFactors = F)

  modelNames = paste0("M_",c(1:450))

  secondaryLevel = modelNames

  fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")

  for(x in 1:449){

    secondaryLevel = secondaryLevel[-1]

    for(y in 1:length(secondaryLevel)){

      result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]

      summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
                                               Model2 = secondaryLevel[y],
                                               Common = sum(result == 2),
                                               Uncommon = sum(result == 1),
                                               Absent = sum(result == 0)))

    }


  }

  write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
  geneDF <- NULL
  summaryDF <- NULL
  geneData <-NULL

}

转换为矩阵

maxNum = max(summaryDF$Common)
  normalizeData = summaryDF[,c(1:3)]
  normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)

  normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2]))) 

  distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)

  distMatrixN = distMatrixN + t(distMatrixN)

有没有办法让进程运行得更快？有没有更有效的方法进行比较？

最佳答案

这段代码应该更快。 R 中的嵌套循环速度慢得像噩梦。像每次一行进行 rbind-ing 这样的操作也是 R 编程中最糟糕和最慢的想法之一。

生成 450 行，每行 20 个 0、1 元素。

M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))

生成组合(450, 2)个行对的列表

L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)

应用您想要的任何比较函数。在本例中，为每个行组合的同一位置上 1 的数量。如果您想计算不同的指标，只需编写另一个函数(x)，其中 M[x[1],] 是第一行，M[x[2],] > 是第二行。

O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))

在相当慢的 2.6 Ghz Sandy Bridge 上，代码大约需要 4 秒

获取包含结果的干净数据框，三列:第 1 行、第 2 行、两行之间的指标

data.frame(row1 = sapply(L, `[`, 1),
           row2 = sapply(L, `[`, 2),
           similarity_metric = do.call(rbind, O))

说实话，我没有彻底梳理您的代码来准确复制您正在做的事情。如果这不是您想要的(或者无法修改以实现您想要的)，请发表评论。

关于r - 成对二进制比较 - 优化 R 中的代码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49929889/

r - 成对二进制比较 - 优化 R 中的代码

比较代码

上一篇：c - 对 execvp 系统调用的调用如何工作？

下一篇：express - 如何在 Swagger 中定义包含复杂对象数组的示例请求正文？