r - 如何获得共享至少 4 列的公共(public)组的最大行集?

标签 r matrix subset intersection

我有一个包含基因名称和样本编号的矩阵。 每行都是一个逻辑向量,指示检测到基因的样本。基因必须至少出现在 8 个样本中的 4 个中才能到达此目的(仍然位于矩阵中)。即该矩阵中的所有基因都出现在 4 个或更多样本中。

       Sample1  Sample2  Sample3  Sample4 Sample5 Sample6  Sample7  Sample8 
gene1  TRUE     FALSE    TRUE     TRUE    TRUE    FALSE    FALSE    FALSE
gene2  FALSE    TRUE     FALSE    TRUE    FALSE   TRUE     TRUE     FALSE
gene3  TRUE     TRUE     FALSE    TRUE    FALSE   TRUE     TRUE     FALSE
gene4  FALSE    FALSE    TRUE     FALSE   TRUE    FALSE    FALSE    TRUE
gene5  TRUE     TRUE     TRUE     TRUE    TRUE    FALSE    TRUE     TRUE
gene6  FALSE    FALSE    TRUE     FALSE   FALSE   TRUE     TRUE     TRUE
gene7  TRUE     TRUE     FALSE    FALSE   TRUE    TRUE     FALSE    FALSE
gene8  TRUE     TRUE     TRUE     TRUE    FALSE   FALSE    FALSE    FALSE

我还可以说我有表达后者的样本列表,例如:

> gene1
[1] "Sample1"  "Sample3"  "Sample4"  "Sample5"

如何获得属于 4 个样本(列)的公共(public)组的最大基因组(行)?

编辑:这个问题源于尝试重新创建这个:

Outlier analysis is based on the assumption that samples (cells) of the same type also have a set of commonly-expressed genes.

The outlier algorithm iteratively trims the low-expressing genes in an expression file until 95% of the genes that remain are expressed above the Limit of Detection (LoD) value that you set for half of the samples.

The assumption is that the set of samples contains less than 50% outliers. This means that subsequent calculations will only include the half of the samples that have the highest expression for the trimmed gene list.

The trimmed gene list represents genes that are present above the LoD in at least half the samples or the most evenly expressed genes—though they might not be the highest or lowest in their expression value.

For the 50% of the samples that remain, a distribution is calculated that represents their combined expression values for the gene list defined above. For this distribution, the median represents the 50th percentile expression value for the set of data.

最佳答案

我猜您想找到任意 4 个样本中共存的基因。你可以尝试这样的事情:

n = 4               
combs = combn(seq_along(colnames(mat)), n, simplify = F)
Filter(function(x) length(x) > 1, 
       setNames(lapply(combs, function(i) names(which(rowSums(mat[, i]) == n))), 
                lapply(combs, function(x) paste0(colnames(mat)[x], collapse = "; "))))
#$`Sample1; Sample2; Sample3; Sample4`
#[1] "gene5" "gene8"
#
#$`Sample1; Sample2; Sample4; Sample7`
#[1] "gene3" "gene5"
#
#$`Sample1; Sample3; Sample4; Sample5`
#[1] "gene1" "gene5"
#
#$`Sample2; Sample4; Sample6; Sample7`
#[1] "gene2" "gene3"

其中“垫子”:

mat = structure(c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, 
FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, 
FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, 
TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, 
TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, 
FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, 
FALSE, TRUE, TRUE, TRUE, FALSE, FALSE), .Dim = c(8L, 8L), .Dimnames = list(
    c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7", 
    "gene8"), c("Sample1", "Sample2", "Sample3", "Sample4", "Sample5", 
    "Sample6", "Sample7", "Sample8")))

关于r - 如何获得共享至少 4 列的公共(public)组的最大行集?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28008543/

相关文章:

r - 将数据集分成 2 个较小的数据集

将数组的子集复制到 C 中的另一个数组/数组切片

r - 从 R 数据集中添加客户下一个返回日期

r - 将数据切片(即 n×n 矩阵)添加到 R 中的多维矩阵

python - 通过 3D 数据数组获取 x、y、z、平均值

c++ - 将值输入二维数组并打印

r - 选择具有类似 grep 部分匹配的 data.table 列

r - knitr 如何将警告信息保留在盒子内?

java - 第一人称相机胶卷

r - 选择成对重复的行 [A-B & B-A]