r - 从通过散点图拟合的回归线中排除异常值,而不从图中移除异常值

标签 r ggplot2

我有如下数据,我在下面运行 ggplot 代码:

data <- structure(list(country_mean_rep = structure(c(73.6995708154506, 
93.5501285347044, 85.1529051987768, 91.1017369727047, 79.5562130177515, 
84.6751054852321, 89.8, 86.8826405867971, 94.2247191011236, 70.2321428571429, 
88.4107142857143), label = "label", format.stata = "%9.2f"), 
    country_mean_crime = c(0.0944206008583691, 0.0565552699228792, 
    0.0336391437308868, 0.205955334987593, 0.130177514792899, 
    0.282700421940928, 0.220512820512821, 0.415647921760391, 
    0.387640449438202, 0.200892857142857, 0.292207792207792), 
    country_name = structure(c(1L, 2L, 3L, 4L, 5L, 7L, 11L, 12L, 
    14L, 16L, 20L), .Label = c("Albania", "Armenia", "Azerbaijan", 
    "Belarus", "Bosnia and Herzegovina", "Brazil", "Bulgaria", 
    "Cambodia", "Chile", "CostaRica", "Croatia", "Czech", "Ecuador", 
    "Estonia", "FYROM", "Georgia", "Germany", "Greece", "Guyana", 
    "Hungary", "Ireland", "Kazakhstan", "Kenya", "Kyrgyzstan", 
    "Latvia", "Lithuania", "Malawi", "Mali", "Moldova", "Philippines", 
    "Poland", "Portugal", "Romania", "Russia", "Senegal", "Serbia&Montenegro", 
    "Slovakia", "Slovenia", "South Africa", "South Korea", "Spain", 
    "SriLanka", "Tajikistan", "Turkey", "Ukraine", "Uzbekistan", 
    "Vietnam"), class = "factor")), row.names = c(NA, -11L), class = c("data.table", 
"data.frame"))

# On which I like to run the following code:

ggplot(data, aes(x=country_mean_rep, y=country_mean_crime)) + 
  geom_point() + 
  geom_smooth(aes(colour="linear", fill="linear"), 
              method="lm", 
              formula=y ~ x, ) + 
  geom_smooth(aes(colour="quadratic", fill="quadratic"), 
              method="lm", 
              formula=y ~ x + I(x^2)) + 
  geom_smooth(aes(colour="cubic", fill="cubic"), 
              method="lm", 
              formula=y ~ x + I(x^2) + I(x^3)) + 
  labs(colour="Functional Form", fill="Functional Form") +
  geom_text(aes(label=country_name), nudge_y=0.02) +
  theme_bw()

enter image description here

现在假设捷克共和国是一个离群值,我想将其移除以进行拟合(尤其是线性拟合)。请注意,我知道示例中的捷克共和国没有任何问题,我需要知道这一点才能在我的实际数据中找到适当的异常值。

是否有某种方法可以仅将其从拟合中排除,同时将点保留在图中?

最佳答案

一种方法是包含不同的数据图:

ggplot(subset(data, country_name != 'Czech'), aes(x=country_mean_rep, y=country_mean_crime)) + 
  geom_smooth(aes(colour="linear", fill="linear"), 
              method="lm", 
              formula=y ~ x, ) + 
  geom_smooth(aes(colour="quadratic", fill="quadratic"), 
              method="lm", 
              formula=y ~ x + I(x^2)) + 
  geom_smooth(aes(colour="cubic", fill="cubic"), 
              method="lm", 
              formula=y ~ x + I(x^2) + I(x^3)) + 
  labs(colour="Functional Form", fill="Functional Form") +
  geom_point(data = data, inherit.aes = FALSE, aes(x = country_mean_rep, y = country_mean_crime)) +
  geom_text(data = data, aes(label=country_name, x = country_mean_rep, y = country_mean_crime), inherit.aes = FALSE, nudge_y=0.02) +
  theme_bw()

在这种情况下,3 个线性模型使用子集数据,而对 geom_pointgeom_text 的调用不继承原始美学。

关于r - 从通过散点图拟合的回归线中排除异常值,而不从图中移除异常值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68396603/

相关文章:

r - 如何在 R 中创建时间序列?

r - 即使所有值都 > 0,为什么 geom_histogram 从负 bin 下限开始?

r - 如何使用 R 访问隐藏的系统文件?

r - 使用 dplyr 的一周中所有天的平均乘客数量

R:在 ggplot 中按 2 个因子变量进行分层

r - 带有两个测量变量和一个因子的误差线的 ggplot

r - 在轴标签中混合字体样式ggplot2

r - r Markdown 的 Yaml header 中的单引号和双引号有什么区别?

r - 自动将 R Markdown 应用程序重定向到不同的链接

使用 data.table 在 R 中重建索引?