我有一个很长的数据框,其中包含来自桅杆的气象数据。它包含在不同高度( data$value )的不同参数(风速、风向、气温等,在 data$param 中)同时进行的观察( data$z )

我正在尝试通过 $time 有效地对这些数据进行切片,然后将函数应用于收集的所有数据。通常函数应用于单个 $param一次(即我对风速应用不同的函数而不是对气温应用不同的函数)。


我目前的方法是使用data.frameddply .


# find good data ----
df <- data[((data$param == "wind speed") &

然后我在 df 上运行我的函数使用 ddply() :
df.tav <- ddply(df,
               function(x) {
                      y <-data.frame(V1 = sum(x$value) + sum(x$z),
                                     V2 = sum(x$value) / sum(x$z))

通常 V1 和 V2 是对其他函数的调用。这些只是例子。不过,我确实需要对同一数据运行多个函数。


我有订单(数百个)要处理的塔,每个塔都有一年的数据和 10-12 个高度,所以我正在寻找更快的东西。

data <-  structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 
1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 
1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0, 
0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160, 
160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60, 
60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature", 
"humidity", "barometric pressure", "wind direction", "turbulence", 
"wind speed", "wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"wind direction", "turbulence", "wind speed", "wind direction", 
"turbulence", "wind speed", "wind direction", "turbulence", "wind speed", 
"temperature", "barometric pressure", "humidity", "wind direction", 
"wind speed", "turbulence", "wind direction"), value = c(-2.5, 
41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32, 
4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35, 
250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9, 
0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time", 
"z", "param", "value"), row.names = c(NA, 40L), class = "data.frame")


使用 data.table :

dt = data.table(data)

setkey(dt, param)  # sort by param to look it up fast

dt[J('wind speed')][!is.na(value),
                    list(sum(value) + sum(z), sum(value)/sum(z)),
                    by = time]
#                  time      V1         V2
#1: 2009-12-31 18:10:00 1177.57 0.04209735
#2: 2009-12-31 18:20:00  102.18 0.02180000

# make dt smaller because I'm lazy
dt = dt[param %in% c('wind direction', 'wind speed')]

# now let's start - create another data.table
# that will have param and corresponding function
fns = data.table(p = c('wind direction', 'wind speed'),
                 fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))),
                 key = 'p')
                p     fn
1: wind direction <call>    # the fn column contains functions
2:     wind speed <call>    # i.e. this is getting fancy!

# now we can evaluate different functions for different params,
# sliced by param and time
dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)},
   by = list(param, time)]
#            param                time           V1
#1: wind direction 2009-12-31 18:10:00 3.712400e+03
#2: wind direction 2009-12-31 18:20:00 7.027000e+02
#3:     wind speed 2009-12-31 18:10:00 4.209735e-02
#4:     wind speed 2009-12-31 18:20:00 2.180000e-02

附言我认为我必须使用 param 的事实以某种方式之前 evaleval工作是一个错误。

更新:截至 version 1.8.11此错误已修复,以下工作正常:
dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)]

