r - 合并具有多个值的列

标签 r merge apply bioinformatics

我有一个数据框,cluster ,以及其中一列 cluster$Genes ,看起来像这样:

ENSG00000134684
ENSG00000188846, ENSG00000181163, ENSG00000114391
ENSG00000134684, ENSG00000175390
ENSG00000134684
ENSG00000134684, ENSG00000175390
...

列中每行的元素数量是任意的。我还有另一个数据框,expression ,看起来像这样:
ENSGID           a       b
ENSG00000134684  1       3
ENSG00000175390  2       0
ENSG00000000419  131.23  108.73
ENSG00000000457  7.11    8.68
ENSG00000000460  15.70   6.59
ENSG00000000938  0       0
ENSG00000000971  0.03    0.07
ENSG00000001036  59.22   58.3
...

...并且有大约 20000 行。我想做的是:
  • 对于 cluster$Genes 中每行中的所有元素,找到对应的ab
  • 计算 a 的最小值、最大值和平均值和 b (单独)对于 cluster$Genes 中的每一行
  • cluster 中创建六个新列数据框并用 (min.a, max.a, mean.a, min.b, max.b, mean.b) 填充它们值

  • 我试图找到某种方法来做到这一点,但进展并不顺利。在谷歌搜索帮助时,我想我可能会使用某种 apply ,我得到了一些代码。我认为这主要是胡言乱语,完全没有功能,我有点卡住了。这是我得到的:
    exp.lookup = function(genes) {
      genes.split = strsplit(genes, ', ')
      exp.hct = list()
      exp.hke = list()
      for ( gene in genes.split ) {
        exp.hct = c(exp.hct, merge(gene, means$hct, all.x=TRUE))
        exp.hke = c(exp.hke, merge(gene, means$hke, all.x=TRUE))
        return(c(exp.hct, exp.hke))
      }
    }
    
    apply(cluster['Genes'], 1, FUN=exp.lookup)
    

    有人有更好的想法,这可能真的有效吗?

    最佳答案

    重新创建初始数据:

    library(data.table)
    
    cluster<- as.data.table(list(Genes = c("ENSG00000134684",
                                           "ENSG00000188846, ENSG00000181163, ENSG00000114391", 
                                           "ENSG00000134684, ENSG00000175390", 
                                           "ENSG00000134684", 
                                           "ENSG00000134684, ENSG00000175390")))
    
    expression<- as.data.table(list(ENSGID = c("ENSG00000134684", "ENSG00000175390",
                                               "ENSG00000000419", "ENSG00000000457",
                                               "ENSG00000000460", "ENSG00000000938",
                                               "ENSG00000000971", "ENSG00000001036"),
                                    a = c(1,2,131.23,7.11,15.70, 0, 0.03, 59.22),
                                    b = c(3,0,108.73,8.68,6.59,0,0.07,58.3)))
    setkey(cluster, Genes)
    setkey(expression, ENSGID)
    

    解决方案:
    library(data.table)
    
    result<- function() {
      colnames<- c("min.a", "max.a", "mean.a", "min.b", "max.b", "mean.b")
      # 1. "(colnames)" is parenthesized to insure we are adding new columns from
      # colnames variable by reference and evaluates to character vector with 
      # new columns names
      # 2. ":=" is for adding new columns to existing data.table by reference
      # 3. "count(Genes)" calls count() function over "Genes" column, but as long
      # as we are using grouping "by = Genes", count() works with each row turn
      # by turn. And each row is a character vector.
      cluster[,(colnames):=count(Genes), by = Genes]
    }
    
    # get Genes row
    count<- function(charvector) {
      ENSGIDc<- strsplit(charvector, ", ")
      # 4. subsetting "expression" data.table rows by splitted "Genes" character 
      # vector named "ENSGIDc"...
      # 5. ... and then calculating column's maxes, mins and means
      expression[ENSGIDc, .(min(a, na.rm = T), max(a, na.rm = T),
                            mean(a, na.rm = T), min(b, na.rm = T), 
                            max(b, na.rm = T), mean(b, na.rm = T))]
      # 6. at this point we are returning resulting 1 row 6 columns data.table     
      # back to calling function, where it's added to "cluster" data.table
    }
    
    suppressWarnings(result())
    

    关于r - 合并具有多个值的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29649973/

    相关文章:

    r - 使用 ggplot2 >= 2.0 中的标签器混合空和 bquote-d 方面标签

    git - 将一个 Git 存储库移动到另一个带有分支的子文件夹中

    multithreading - 在 R 中使用具有多个线程的 data.table

    r - 带有 ggplot2 的多个直方图 - 位置

    r - 函数总是返回数字(0)

    php - 使用 php 将特定的 pdf 页面与 linux 命令合并

    javascript - 为什么 apply 会跳过用作参数的数组的第一个元素?

    python - pandas dataframe 使用 apply 为一组值添加多行

    r - ggplot2:轴上的花括号?

    当 R 中所有行的一列不同时,删除除一个重复行以外的所有行