r - 计算数据帧最后 3M 值的平均值并将它们添加到数据帧中,重复 18 次,无需在 R 中使用循环

标签 r dataframe data-wrangling

我正在尝试计算最后 3M 值的平均值并将它们添加到数据框的底部,然后使用这些值计算 3M 的平均值(基本上是 2 个月的数据加上新添加的平均值)并重复此18次。

我正在尝试找到一种有效的方法来做到这一点,这样耗时更少。我厌倦了用双循环来做这件事,但后来找到了一种使用一个循环和lapply()的方法。

但我想知道是否有更好的方法来避免循环。

library(dplyr)
library(forecast)
library(readxl)
library(data.table)
library(clock)
library(lubridate)
library(tsibble)

df <- read_excel("C:/X/X/X- X/X/Book7.xlsx",sheet = "Loop")

freq = 18

colnames(df)[1]="Dates"
Dates <- df$Dates

Working <- df[,-1] 

#--------------------------------------- Creation of Functions ---------------------------------------#

Moving_Average_3M <- function(Working)
{
  last_3_row <- tail(Working,3)
  
  # Convert the `last_3_row` object to a two-dimensional object as tail() function returns a vector
  last_3_row_df <- data.frame(last_3_row)
  
  # Calculate the mean of the last three rows
  mean_last_3 <- data.frame(colMeans(last_3_row_df,na.rm = TRUE))
  
  return(mean_last_3)
}

Rename_Col_and_bind <- function(Working,Output)
{
  colnames(Output) <- colnames(Working)
  
  Working <- rbind(Working,Output)
  
  return(Working)
}

#--------------------------------------- End of Creation of Functions ---------------------------------------#

#------------------------------------------ Loops for Execution ---------------------------------------------#

for(i in 1:freq)
{
  Output <- data.frame(lapply(Working,Moving_Average_3M))
  
  Working <- Rename_Col_and_bind(Working,Output)
  
}

view(Output)

我正在使用的数据框如下:

structure(list(Dates = c("2019-01-01", "2019-02-01", "2019-03-01", 
"2019-04-01", "2019-05-01", "2019-06-01", "2019-07-01", "2019-08-01", 
"2019-09-01", "2019-10-01", "2019-11-01", "2019-12-01", "2020-01-01", 
"2020-02-01", "2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", 
"2020-07-01", "2020-08-01", "2020-09-01", "2020-10-01", "2020-11-01", 
"2020-12-01", "2021-01-01", "2021-02-01", "2021-03-01", "2021-04-01", 
"2021-05-01", "2021-06-01", "2021-07-01", "2021-08-01", "2021-09-01", 
"2021-10-01", "2021-11-01", "2021-12-01", "2022-01-01", "2022-02-01", 
"2022-03-01", "2022-04-01", "2022-05-01", "2022-06-01", "2022-07-01", 
"2022-08-01", "2022-09-01", "2022-10-01"), `XYZ|851` = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 206, 1814, 2324, 772, 1116, 1636, 1906, 
957, 829, 911, 786, 938, 1313, 2384, 1554, 1777, 1635, 1534, 
1015, 827, 982, 685, 767, 511, 239, 5400, 1301, 426, 261, 201, 
33, 27, 28, 46, 11, 55, 47), `XYZ|574` = c(0, 0, 0, 0, 0, 0, 
0, 0, 74, 179, 464, 880, 324, 184, 90, 170, 140, 96, 78, 83, 
83, 121, 245, 9000, 332, 123, 117, 138, 20, 42, 70, 70, 42, 103, 
490, 7500, 488, 245, 142, 95, 63, 343, 57, 113, 100, 105)), row.names = c(NA, 
-46L), class = c("tbl_df", "tbl", "data.frame"))

如上所述,两次迭代后的简约输出如下: 这是此处用于获取两次迭代的循环:

for(i in 1:2)
{
  Output <- data.frame(lapply(Working,Moving_Average_3M))
  
  Working <- Rename_Col_and_bind(Working,Output)
  
}

Working 数据框的输出如下:

 structure(list(`XYZ|851` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 206, 
1814, 2324, 772, 1116, 1636, 1906, 957, 829, 911, 786, 938, 1313, 
2384, 1554, 1777, 1635, 1534, 1015, 827, 982, 685, 767, 511, 
239, 5400, 1301, 426, 261, 201, 33, 27, 28, 46, 11, 55, 47, 37.6666666666667, 
46.5555555555556), `XYZ|574` = c(0, 0, 0, 0, 0, 0, 0, 0, 74, 
179, 464, 880, 324, 184, 90, 170, 140, 96, 78, 83, 83, 121, 245, 
9000, 332, 123, 117, 138, 20, 42, 70, 70, 42, 103, 490, 7500, 
488, 245, 142, 95, 63, 343, 57, 113, 100, 105, 106, 103.666666666667
)), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", 
"10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", 
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30", "31", 
"32", "33", "34", "35", "36", "37", "38", "39", "40", "41", "42", 
"43", "44", "45", "46", "last_3_row", "last_3_row1"), class = c("tbl_df", 
"tbl", "data.frame"))

为了进一步解释这一点,为了清晰起见,我添加了一个 Excel 图像:

蓝色图像是输出,与您在 Working 数据框中看到的输出相同,并且使用的公式以黄色突出显示。

enter image description here

最佳答案

决议前的一些想法:

  • 这是一个累积平均值,因此简单的矢量化计算不起作用
  • 这不是一个滚动操作
  • 这是一种归约(ala Reducepurrr::reduce),因为一个值的计算依赖于行(并且计算)之前;它更像是一种递归方法,尽管我们不会为此明确使用递归
  • 旁注:向对象迭代添加 (rbind) 行在概念上是可行的,但效率极低且扩展性很差;因此,我将预分配一次空间(用 NA 填充)并用新值填充行,而不是在每次迭代中进行 rbind
# preallocate the extra rows
Working2 <- rbind(Working, Working[1:18,][NA,])

for (i in (nrow(Working)+1):nrow(Working2)) 
  Working2[i,-1] <- lapply(Working2[i - 1:3,-1], mean)
as.data.frame(Working2)
#         Dates    XYZ|851   XYZ|574
# 1  2019-01-01    0.00000    0.0000
# 2  2019-02-01    0.00000    0.0000
# 3  2019-03-01    0.00000    0.0000
# 4  2019-04-01    0.00000    0.0000
# 5  2019-05-01    0.00000    0.0000
# 6  2019-06-01    0.00000    0.0000
# 7  2019-07-01    0.00000    0.0000
# 8  2019-08-01    0.00000    0.0000
# 9  2019-09-01    0.00000   74.0000
# 10 2019-10-01  206.00000  179.0000
# 11 2019-11-01 1814.00000  464.0000
# 12 2019-12-01 2324.00000  880.0000
# 13 2020-01-01  772.00000  324.0000
# 14 2020-02-01 1116.00000  184.0000
# 15 2020-03-01 1636.00000   90.0000
# 16 2020-04-01 1906.00000  170.0000
# 17 2020-05-01  957.00000  140.0000
# 18 2020-06-01  829.00000   96.0000
# 19 2020-07-01  911.00000   78.0000
# 20 2020-08-01  786.00000   83.0000
# 21 2020-09-01  938.00000   83.0000
# 22 2020-10-01 1313.00000  121.0000
# 23 2020-11-01 2384.00000  245.0000
# 24 2020-12-01 1554.00000 9000.0000
# 25 2021-01-01 1777.00000  332.0000
# 26 2021-02-01 1635.00000  123.0000
# 27 2021-03-01 1534.00000  117.0000
# 28 2021-04-01 1015.00000  138.0000
# 29 2021-05-01  827.00000   20.0000
# 30 2021-06-01  982.00000   42.0000
# 31 2021-07-01  685.00000   70.0000
# 32 2021-08-01  767.00000   70.0000
# 33 2021-09-01  511.00000   42.0000
# 34 2021-10-01  239.00000  103.0000
# 35 2021-11-01 5400.00000  490.0000
# 36 2021-12-01 1301.00000 7500.0000
# 37 2022-01-01  426.00000  488.0000
# 38 2022-02-01  261.00000  245.0000
# 39 2022-03-01  201.00000  142.0000
# 40 2022-04-01   33.00000   95.0000
# 41 2022-05-01   27.00000   63.0000
# 42 2022-06-01   28.00000  343.0000
# 43 2022-07-01   46.00000   57.0000
# 44 2022-08-01   11.00000  113.0000
# 45 2022-09-01   55.00000  100.0000
# 46 2022-10-01   47.00000  105.0000
# 47       <NA>   37.66667  106.0000
# 48       <NA>   46.55556  103.6667
# 49       <NA>   43.74074  104.8889
# 50       <NA>   42.65432  104.8519
# 51       <NA>   44.31687  104.4691
# 52       <NA>   43.57064  104.7366
# 53       <NA>   43.51395  104.6859
# 54       <NA>   43.80049  104.6305
# 55       <NA>   43.62836  104.6843
# 56       <NA>   43.64760  104.6669
# 57       <NA>   43.69215  104.6606
# 58       <NA>   43.65604  104.6706
# 59       <NA>   43.66526  104.6660
# 60       <NA>   43.67115  104.6658
# 61       <NA>   43.66415  104.6675
# 62       <NA>   43.66685  104.6664
# 63       <NA>   43.66738  104.6666
# 64       <NA>   43.66613  104.6668

然后您可以根据需要填写日期

(我使用 as.data.frame(Working2) 只是为了显示所有小数,因为 tibble 的 print 方法经常隐藏一些精度。)

关于r - 计算数据帧最后 3M 值的平均值并将它们添加到数据帧中,重复 18 次,无需在 R 中使用循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77188481/

相关文章:

r - 如何在 R 中使用矢量化根据条件更改 DF 值?

r - 使用 dplyr 动态创建列

r - 我可以使用什么函数来完成和填充缺失的时间序列观测值,避免在序列开始日期之前完成?

r 用于从地址中提取英国邮政编码的正则表达式未排序

html - Shiny - 将下拉菜单(选择标签)的大小(填充?)更改得更小

python - 如何将多索引数据框列与简单数据框相匹配并相乘?

Python pandas dataframe "Date"在 xlsx 和 csv 中索引不同的格式

r - 将长数据除以R中另一个数据集中的值

r - 计算多列的总计百分比

r - 从频率表创建密度