有人可以向我解释为什么我用普通的boxplot命令和ggplot2的outliers
获得不同数量的geom_boxplot
吗?
这里有一个例子:
x <- c(280.9, 135.9, 321.4, 333.7, 0.2, 71.3, 33.0, 102.6, 126.8, 194.8, 35.5,
107.3, 45.1, 107.2, 55.2, 28.1, 36.9, 24.3, 68.7, 163.5, 0.8, 31.8, 121.4,
84.7, 34.3, 25.2, 101.4, 203.2, 194.1, 27.9, 42.5, 47.0, 85.1, 90.4, 103.8,
45.1, 94.0, 36.0, 60.9, 97.1, 42.5, 96.4, 58.4, 174.0, 173.2, 164.1, 92.1,
41.9, 130.2, 94.7, 121.5, 261.4, 46.7, 16.3, 50.7, 112.9, 112.2, 242.5, 140.6,
112.6, 31.2, 36.7, 97.4, 140.5, 123.5, 42.9, 59.4, 94.5, 37.4, 232.2, 114.6,
60.7, 27.8, 115.5, 111.9, 60.1)
data <- data.frame(x)
boxplot(data$x)
ggplot(data, aes(y=x)) + geom_boxplot()
使用
boxplot
命令,我用4 outliers
得到下面的图。并使用
ggplot2
我可以通过5 outliers
获得下面的图。最佳答案
ggplot和boxplot使用略有不同的方法来计算统计信息。从?geom_boxplot
我们可以看到
The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot() function, and may be apparent with small samples. See boxplot.stats() for for more information on how hinge positions are calculated for boxplot().
如果您想要相同的结果,则可以使ggplot使用
boxplot.stats
# Function to use boxplot.stats to set the box-and-whisker locations
f.bxp = function(x) {
bxp = boxplot.stats(x)[["stats"]]
names(bxp) = c("ymin","lower", "middle","upper","ymax")
bxp
}
# Function to use boxplot.stats for the outliers
f.out = function(x) {
data.frame(y=boxplot.stats(x)[["out"]])
}
要在ggplot中使用这些功能:
ggplot(data, aes(0, y=x)) +
stat_summary(fun.data=f.bxp, geom="boxplot") +
stat_summary(fun.data=f.out, geom="point")
如果要复制ggplot native 使用的统计信息,请在
?geom_boxplot
中对这些统计信息进行说明,如下所示:ymin = lower whisker = smallest observation greater than or equal to lower hinge - 1.5 * IQR
lower = lower hinge, 25% quantile
notchlower = lower edge of notch = median - 1.58 * IQR / sqrt(n)
middle = median, 50% quantile
notchupper = upper edge of notch = median + 1.58 * IQR / sqrt(n)
upper = upper hinge, 75% quantile
ymax = upper whisker = largest observation less than or equal to upper hinge + 1.5 * IQR
我们可以据此计算:
y = sort(x)
iqr = quantile(y,0.75) - quantile(y,0.25)
ymin = y[which(y >= quantile(y,0.25) - 1.5*iqr)][1]
ymax = tail(y[which(y <= quantile(y,0.75) + 1.5*iqr)],1)
lower = quantile(y,0.25)
upper = quantile(y,0.75)
middle = quantile(y,0.5)
ggplot(data, aes(y=x)) +
geom_boxplot() +
geom_hline(aes(yintercept=c(ymin)), color='red', linetype='dashed') +
geom_hline(aes(yintercept=c(ymax)), color='red', linetype='dashed') +
geom_hline(aes(yintercept=c(lower)), color='red', linetype='dashed') +
geom_hline(aes(yintercept=c(upper)), color='red', linetype='dashed') +
geom_hline(aes(yintercept=c(middle)), color='red', linetype='dashed')
我们还可以使用
ggplot_build
直接从ggplot对象中提取这些统计信息p <- ggplot(data, aes(y=x)) + geom_boxplot()
ggplot_build(p)$data[1:5]
# ymin lower middle upper ymax
# 1 0.2 42.5 93.05 122 232.2
关于r - 与ggplot2不同数量的离群值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53794922/