我想识别数据框中各组(因素)之间的所有非重叠值。我们用iris
来说明。 iris
数据集测量了三种植物物种(setosa、versicolor 和 < em>弗吉尼亚州)。所有三个物种在萼片长度和宽度的测量上都有重叠。在花瓣长度和宽度的测量中,setosa 与 versicolor 和 virginica 都不重叠。
可以使用范围值或散点图等各种函数轻松手动可视化我想要的内容:
tapply(iris$Sepal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Sepal.Width, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Length, INDEX = iris$Species, FUN = range)
tapply(iris$Petal.Width, INDEX = iris$Species, FUN = range)
# or
library(ggplot2)
ggplot(iris, aes(Species, Sepal.Length)) + geom_point()
ggplot(iris, aes(Species, Sepal.Width)) + geom_point()
ggplot(iris, aes(Species, Petal.Length)) + geom_point()
ggplot(iris, aes(Species, Petal.Width)) + geom_point()
但是对于大型数据集手动执行此操作是不切实际的,因此我想编写一个函数来识别数据帧中的因素之间的非重叠值,例如iris
。输出可以是具有 TRUE 或 FALSE(分别表示非重叠和重叠)的矩阵列表,每个矩阵对应数据集中的每个变量。例如,iris
的输出将是 4 个矩阵的列表:
$1.Sepal.Length
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$2.Sepal.Width
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$3.Petal.Length
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
$4.Petal.Width
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
我接受不同输出的建议,只要它们能够识别所有不重叠的值。
最佳答案
这是 tidyverse
中的一种可能的解决方案
library(dplyr)
# build custom function
my_fun <- function(x){
# build tibble from input data (colum with metric) and Species vector from iris
myDf <- dplyr::tibble(Species = as.character(iris$Species), Vals = as.numeric(x)) %>%
# find min and max value per species
dplyr::group_by(Species) %>%
dplyr::summarise(mini = min(Vals), maxi = max(Vals))
ret <- myDf %>%
# build full join from data
dplyr::full_join(myDf, by = character(), suffix = c("_1", "_2")) %>%
# convert operation to row wise
dplyr::rowwise() %>%
# if species are the same generate NA else check if between - I do negate here as if they are overlapping you want it to be FALSE
dplyr::mutate(res = ifelse(Species_1 == Species_2, NA, !(dplyr::between(mini_1, mini_2, maxi_2) | dplyr::between(maxi_1, mini_2, maxi_2) | between(mini_2, mini_1, maxi_1) | dplyr::between(maxi_2, mini_1, maxi_1) ))) %>%
# make tibble wide to get the wanted layout
tidyr::pivot_wider(-c(mini_1, maxi_1, mini_2, maxi_2), names_from = Species_2, values_from = res) %>%
# need it to be able to set row names
as.data.frame()
# set row names from column
row.names(ret) <- ret$Species_1
# remove column used to name rows
ret$Species_1 <- NULL
return(ret)
}
purrr::map(iris[, 1:4], ~my_fun(.x))
$Sepal.Length
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$Sepal.Width
setosa versicolor virginica
setosa NA FALSE FALSE
versicolor FALSE NA FALSE
virginica FALSE FALSE NA
$Petal.Length
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
$Petal.Width
setosa versicolor virginica
setosa NA TRUE TRUE
versicolor TRUE NA FALSE
virginica TRUE FALSE NA
关于r - 识别 R 数据框中因素之间的非重叠值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75442568/