R dplyr 总结错误?

标签 r dplyr tidyverse

stats <- read_csv('stats.csv')

## Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

我很确定在更新之前我得到了相同的行为 Rcpp .

使用 filterinvoke_map执行组聚合
test <- function(impl, size) {
  stats %>%
    filter(message.size==size & implementation==impl) %>%
    select(ts.in, ts.out) %>%
              process.time=end - begin,
              message.rate=size * 10000/as.double(process.time)/1024/1024)

invoke_map_df(test, crossing(impl=c('Camel', 'Spark'), size=c(1024, 1024*5, 1024*10)) %>% transpose())

## # A tibble: 6 x 4
##                 begin                 end process.time message.rate
##                <dttm>              <dttm>       <time>        <dbl>
## 1 2017-07-17 04:27:52 2017-07-17 04:28:13      21 secs    0.4650298
## 2 2017-07-17 04:30:25 2017-07-17 04:32:02      97 secs   30.2029639
## 3 2017-07-17 04:32:58 2017-07-17 04:36:17     199 secs   29.4440955
## 4 2017-07-17 04:18:31 2017-07-17 04:18:54      23 secs    0.4245924
## 5 2017-07-17 04:19:47 2017-07-17 04:21:29     102 secs   28.7224265
## 6 2017-07-17 04:23:10 2017-07-17 04:26:28     198 secs   29.5928030

使用 group_bysummarise
stats %>%
  group_by(implementation, message.size) %>%
            message.rate=total.size/as.numeric(duration)/1024/1024) %>%
  ungroup() %>%
  select(begin, end, duration, message.rate)

## # A tibble: 6 x 4
##                 begin                 end       duration message.rate
##                <dttm>              <dttm>         <time>        <dbl>
## 1 2017-07-17 04:27:52 2017-07-17 04:28:13 21.000000 secs    0.4650298
## 2 2017-07-17 04:30:25 2017-07-17 04:32:02  1.616667 secs   30.2029639
## 3 2017-07-17 04:32:58 2017-07-17 04:36:17  3.316667 secs   29.4440955
## 4 2017-07-17 04:18:31 2017-07-17 04:18:54 23.000000 secs    0.4245924
## 5 2017-07-17 04:19:47 2017-07-17 04:21:29  1.700000 secs   28.7224265
## 6 2017-07-17 04:23:10 2017-07-17 04:26:28  3.300000 secs   29.5928030

出于某种原因,process.time计算不正确,但令人惊讶的是 message.rate这取决于它是正确的!我在这里做错了吗?

使用 group_bydo
stats %>%
  group_by(implementation, message.size) %>%
            message.rate=total.size/as.numeric(duration)/1024/1024)) %>%
  ungroup() %>%
  select(begin, end, duration, message.rate)

## # A tibble: 6 x 4
##                 begin                 end duration message.rate
##                <dttm>              <dttm>   <time>        <dbl>
## 1 2017-07-17 04:27:52 2017-07-17 04:28:13  21 secs    0.4650298
## 2 2017-07-17 04:30:25 2017-07-17 04:32:02  97 secs   30.2029639
## 3 2017-07-17 04:32:58 2017-07-17 04:36:17 199 secs   29.4440955
## 4 2017-07-17 04:18:31 2017-07-17 04:18:54  23 secs    0.4245924
## 5 2017-07-17 04:19:47 2017-07-17 04:21:29 102 secs   28.7224265
## 6 2017-07-17 04:23:10 2017-07-17 04:26:28 198 secs   29.5928030

Do 的行为匹配 filterinvoke_map组合。

  • R Markdown 版http://rpubs.com/yeyan/292004
  • Git Repo(包含数据)https://github.com/yeyan/dplyr_summarise
    # Warning: Installed Rcpp (0.12.12) different from Rcpp used to build dplyr (0.12.11)

    # You may need to run install.packages for multiple times and restart the R session in the process

