根据分组(不是一对一查找表)将 NA 替换为另一个表中的值

标签 r dplyr

我的目标是用另一个查找表中的值替换一个表中的值。有一个问题:该查找表不是 Replace na's with value from another df 中讨论的一对一查找表。但查找将基于多个列分组来完成。因此,如果根据查找表中的这些分组返回多个条目,则需要将所有条目填充到原始表中。

我能够完成这项任务,但我需要两件事上的帮助:

a) 我的代码真的很乱。每次我必须做类似的事情时,我最终都会花费大量时间试图弄清楚我做了什么,然后重新使用它。因此,我会欣赏任何更干净、更简单的东西。

b) 速度非常慢。我有多个 ifelse 语句。当我在有36M记录的实际数据上运行这个时,需要花费很多时间。

这是我的虚拟数据来源:

dput(DFile)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G2", 
"G3", "G3", "G3", "G3", "G4", "G4", "G4", "G4", "G5", "G5"), 
    Country_SV = c("United States", "United States", "United States", 
    "United States", "United States", "United States", "United States", 
    "United States", "United States", "United States", "United States", 
    "United States", "United States", "United States", "UK", 
    "UK"), Product_BU = c("Laptop", "Laptop", "Laptop", "Laptop", 
    "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", 
    "Laptop", "Laptop", "Laptop", "Laptop", "Power Cord", "Laptop"
    ), Prob_model3 = c(0, 79647405.9878251, 282615405.328728, 
    NA, NA, 363419594.065383, 0, 72870592.8458704, 260045174.088548, 
    369512727.253779, 0, 79906001.2878251, 285128278.558728, 
    405490639.873629, 234, NA), DoS.FY = c(2014, 2013, 2012, 
    NA, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 
    2015, 2016, NA), Insured = c("Covered", "Covered", "Covered", 
    NA, NA, "Not Covered", "Not Covered", "Not Covered", "Not Covered", 
    "Not Covered", "Not Covered", "Not Covered", "Not Covered", 
    "Not Covered", "Covered", NA)), .Names = c("Region_SL", "Country_SV", 
"Product_BU", "Prob_model3", "DoS.FY", "Insured"), row.names = c(NA, 
16L), class = "data.frame")

这是我的分组查找表:

dput(Master_Joined)
structure(list(Region_SL = c("G1", "G1", "G1", "G1", "G2", "G3", 
"G4", "G5", "G5", "G5"), Country_SV = c("United States", "United States", 
"United States", "United States", "United States", "United States", 
"United States", "UK", "UK", "UK"), Product_BU = c("Laptop", 
"Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Laptop", "Power Cord", 
"Laptop", "Laptop"), DoS.FY = c(2014, 2013, 2012, 2015, 2015, 
2015, 2015, 2016, 2017, 2017), Insured = c("Covered", "Covered", 
"Covered", "Uncovered", "Not Covered", "Not Covered", "Not Covered", 
"Covered", "Uncovered", "Covered")), .Names = c("Region_SL", 
"Country_SV", "Product_BU", "DoS.FY", "Insured"), row.names = c(NA, 
10L), class = "data.frame")

从某种意义上说,这是“分组”的,所有条目都是唯一的。

最后,这是我的代码:

#Which fields are missing?
Missing<-DFile[is.na(DFile$Prob_model3),]

Column_name<-colnames(DFile)[4]
colnames(DFile)[4]<-"temp_prob"

#Replace Prob_model3
DFile<-DFile %>%
  group_by(Region_SL, Country_SV, Product_BU) %>%
  dplyr::mutate(Average_Value = mean(temp_prob,na.rm = TRUE)) %>%
  rowwise() %>%
  dplyr::mutate(Col_name1 = ifelse(is.na(temp_prob),Average_Value,temp_prob)) %>%
  dplyr::select(Region_SL:Product_BU,DoS.FY,Insured,Col_name1)

colnames(DFile)[6]<-Column_name

  Missing$DoS.FY<-NULL

  Missing_FYear<-Missing %>% 
    inner_join(Master_Joined,by = c("Region_SL", "Country_SV", "Product_BU")) %>%
    group_by(Region_SL, Country_SV, Product_BU, DoS.FY, Insured.y) %>%
    dplyr::distinct() %>%
    left_join(Missing)

  Missing_FYear$Prob_model3<-NULL

  DFile <-DFile %>% 
    left_join(Missing_FYear,by = c("Region_SL", "Country_SV", "Product_BU", "Insured")) %>%
    dplyr::rowwise() %>%
    mutate(DoS.FY=ifelse((is.na(`DoS.FY.y`)|is.na(`DoS.FY.x`)),sum(`DoS.FY.y`,`DoS.FY.x`,na.rm=TRUE),`DoS.FY.x`), Insured_Combined = ifelse(is.na(Insured),Insured.y,Insured)) %>%
    dplyr::select(Region_SL:Product_BU,Prob_model3,DoS.FY, Insured_Combined)  

  colnames(DFile)[6]<-"Insured"
  #Check again
  Missing<-DFile[is.na(DFile$Prob_model3),] 

  if (nrow(Missing) > 1)
  { #you have NaNs, replace them with 0
    DFile[is.nan(DFile$Prob_model3),"Prob_model3"] <- 0
   }
  Missing<-DFile[is.na(DFile$Prob_model3),] 

预期输出:运行上述代码后的DFile

我衷心感谢您的帮助。我已经为这个问题苦苦挣扎了大约一周。

最佳答案

一个想法是找到具有 NARegion_SL。完成后,我们使用 plyrrbind.fill 来 rbind 到 new_df。然后,我们过滤掉任何带有 NA 的行(最后一列 - 第 6 列除外)。我们创建一个新变量 Prob_model4,它保存每组 Region_SL 的平均值。然后,我们使用coalesce“合并”这两列。

library(dplyr)
ind <- unique(as.integer(which(is.na(DFile), arr.ind = TRUE)[,1]))
new_df <- plyr::rbind.fill(Master_joined[Master_joined$Region_SL %in% DFile$Region_SL[ind],], DFile)

new_df %>% 
  arrange(Region_SL, Prob_model3) %>% 
  filter(complete.cases(.[-6])) %>% 
  group_by(Region_SL) %>% 
  mutate(Prob_model3 = replace(Prob_model3, is.na(Prob_model3), mean(Prob_model3, na.rm = T))) %>%  
  ungroup()

# A tibble: 21 × 6
#   Region_SL    Country_SV Product_BU DoS.FY     Insured Prob_model3
#       <chr>         <chr>      <chr>  <dbl>       <chr>       <dbl>
#1         G1 United States     Laptop   2014     Covered           0
#2         G1 United States     Laptop   2013     Covered    79647406
#3         G1 United States     Laptop   2012     Covered   282615405
#4         G1 United States     Laptop   2014     Covered   120754270
#5         G1 United States     Laptop   2013     Covered   120754270
#6         G1 United States     Laptop   2012     Covered   120754270
#7         G1 United States     Laptop   2015   Uncovered   120754270
#8         G2 United States     Laptop   2015 Not Covered   363419594
#9         G2 United States     Laptop   2015 Not Covered   363419594
#10        G3 United States     Laptop   2015 Not Covered           0
# ... with 11 more rows

关于根据分组(不是一对一查找表)将 NA 替换为另一个表中的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41758065/

相关文章:

r - 选择要从 R 包加载的函数

r - 如何按组计数并查找是否至少有一个观察值符合标准?

r - 获取最大值,对其进行分组,然后在 R 中获取原始计数

r - 如何将字符向量与 R 中 tibble 中的字符向量列表进行匹配?

r - 在 grid.arrange 中绘制图例、间隔良好的通用 y 轴和主标题

r - [,] 和 $ 之间逻辑语句的区别

r - 用于在可重复研究中格式化数值的包

r - 清理 R 数据框,以便在列中没有行值大于下一行值的 2 倍

r - 如果存在于同一组中,则折叠行

数据帧中的 Rstudio : use . N