如果子组的坐标在 r 中的另一个子组内,则删除组内的行

标签 r dataframe dplyr group-by tidyverse

我有一个数据框,例如

Groups NAMES start end 
G1     A    1     50
G1     A    25    45
G1     B    20    51
G1     A    51    49
G2     A    200   400
G2     B    1     1600
G2     A    2000  3000
G2     B    4000  5000

想法是在每个 Groups 中查看 NAMES,其中 startend 坐标 AB

的坐标内

例如这里的例子:

Groups NAMES start end 
G1     A    1     50    <- A is outside any B coordinate 
G1     A    25    45    <- A is **inside** the B coord `20-51`,then I remove these B row. 
G1     B    20    51  
G1     A    51    49    <- A is outside any B coordinate 
G2     A    200   400   <- A is **inside** the B coordinate 1-1600, then I romove this B row. 
G2     B    1     1600
G2     A    2000  3000  <- A is outside any B coordinate 
G2     B    4000  5000  <- this one does not have any A inside it, then it will be kept in the output.

然后我应该得到输出:

Groups NAMES start end 
G1     A    1     50
G1     A    25    45
G1     A    51    49
G2     A    200   400
G2     A    2000  3000
G2     B    4000  5000

有人有想法吗?

这是 dput 格式的数据帧,如果它可以帮助你? :

   structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L), .Label = c("G1", "G2"), class = "factor"), NAMES = structure(c(1L, 
1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), 
    start = c(1L, 25L, 20L, 51L, 200L, 1L, 2000L, 4000L), end = c(50L, 
    45L, 51L, 49L, 400L, 1600L, 3000L, 5000L)), class = "data.frame", row.names = c(NA, 
-8L))

最佳答案

这是一种可能的方法。我们将按 NAMES 拆分 df,并按 Groups 将两部分相互连接以进行组内比较。只能删除 B 行,因此我们只希望跟踪这些行的行号。

然后,我们可以按 rowid 分组,根据行中是否包含 A 来标记 B 行。最后,过滤到 B 以保留并连接回 A 行。

library(tidyverse)
df <- structure(list(Groups = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("G1", "G2"), class = "factor"), NAMES = structure(c(1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L), .Label = c("A", "B"), class = "factor"), start = c(1L, 25L, 20L, 51L, 200L, 1L, 2000L, 4000L), end = c(50L, 45L, 51L, 49L, 400L, 1600L, 3000L, 5000L)), class = "data.frame", row.names = c(NA, -8L))

A <- filter(df, NAMES == "A")
B <- df %>%
  filter(NAMES == "B") %>%
  rowid_to_column()

comparison <- inner_join(A, B, by = "Groups") %>%
  mutate(A_in_B = start.x >= start.y & end.x <= end.y) %>%
  group_by(rowid) %>%
  summarise(keep_B = !any(A_in_B))
  
B %>%
  inner_join(comparison, by = "rowid") %>%
  filter(keep_B) %>%
  select(-rowid, -keep_B) %>%
  bind_rows(A) %>%
  arrange(Groups, NAMES)
#>   Groups NAMES start  end
#> 1     G1     A     1   50
#> 2     G1     A    25   45
#> 3     G1     A    51   49
#> 4     G2     A   200  400
#> 5     G2     A  2000 3000
#> 6     G2     B  4000 5000

reprex package 创建于 2021-07-27 (v1.0.0)

关于如果子组的坐标在 r 中的另一个子组内,则删除组内的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68547438/

相关文章:

Python Dataframe - 计算组的平均值并存储

r - dplyr 使用动态变量名称进行变异,同时尊重 group_by

python - 如何替换 dataframe 中列的某些值中的字符 '...'。 '.' 的个数不固定。

python-2.7 - 将带有逗号的 Pandas 字符串列更改为 Float

r - 为什么 string::str_split 在 dplyr::mutate 时不更新数据帧

r - dplyr:具有部分字符串匹配的inner_join

r - 在具有数值和字符的向量上使用 "larger than"/"smaller than"

r - 将整数顺序分配给十进制数

R:如何相对于 x 轴散布(抖动)点?

r - 从具有意外结果的二项式(K,p)中采样