r - 识别 R 数据框中因素之间的非重叠值

标签 r overlap iris-dataset

我想识别数据框中各组(因素)之间的所有非重叠值。我们用iris来说明。 iris 数据集测量了三种植物物种(setosaversicolor 和 < em>弗吉尼亚州)。所有三个物种在萼片长度和宽度的测量上都有重叠。在花瓣长度和宽度的测量中,setosaversicolorvirginica 都不重叠。


tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)

# or

ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()

但是对于大型数据集手动执行此操作是不切实际的,因此我想编写一个函数来识别数据帧中的因素之间的非重叠值,例如iris。输出可以是具有 TRUE 或 FALSE(分别表示非重叠和重叠)的矩阵列表,每个矩阵对应数据集中的每个变量。例如,iris 的输出将是 4 个矩阵的列表:

            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

            setosa   versicolor   virginica
setosa      NA       FALSE        FALSE   
versicolor  FALSE    NA           FALSE   
virginica   FALSE    FALSE        NA   

            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   

            setosa   versicolor   virginica
setosa      NA       TRUE         TRUE   
versicolor  TRUE     NA           FALSE   
virginica   TRUE     FALSE        NA   



这是 tidyverse 中的一种可能的解决方案


# build custom function
my_fun <- function(x){
    # build tibble from input data (colum with metric) and Species vector from iris
    myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
        # find min and max value per species
        dplyr::group_by(Species) %>%
        dplyr::summarise(mini = min(Vals), maxi = max(Vals)) 

    ret <- myDf %>%
        # build full join from data
        dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>% 
        # convert operation to row wise
        dplyr::rowwise() %>% 
        # if species are the same generate NA else check if between  - I do negate here as if they are overlapping you want it to be FALSE
        dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
        # make tibble wide to get the wanted layout
        tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
        # need it to be able to set row names

    # set row names from column
    row.names(ret) <- ret$Species_1
    # remove column used to name rows
    ret$Species_1 <- NULL

purrr::map(iris[, 1:4], ~my_fun(.x))

           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

           setosa versicolor virginica
setosa         NA      FALSE     FALSE
versicolor  FALSE         NA     FALSE
virginica   FALSE      FALSE        NA

           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

           setosa versicolor virginica
setosa         NA       TRUE      TRUE
versicolor   TRUE         NA     FALSE
virginica    TRUE      FALSE        NA

关于r - 识别 R 数据框中因素之间的非重叠值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75442568/



r - 使用 R 中的 spplot 设置 NA 值的颜色

html - 两个div重叠,如何解决?

pyspark - 汇总表结果中的舍入结果(pyspark)

r - R 中 Ranger 的 SHAP 重要性

r - 找到最小值/谷点并获取 R 中谷点开始和谷点结束的索引

r - 查找长度等于或大于 n 的最新 TRUE 序列

CSS Float 边框重叠问题

r - R 中两个范围的重叠量 [DescTools?]

r - 在 iris 数据集中的每个物种之间执行多个成对比较