Rolling functions
froll.RdFast rolling functions to calculate aggregates on a sliding window. For a user-defined rolling function see frollapply. For "time-aware" (irregularly spaced time series) rolling function see frolladapt.
Usage
frollmean(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"),
na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA)
frollsum(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"),
na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA)
frollmax(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"),
na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA)
frollmin(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"),
na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA)
frollprod(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"),
na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA)
frollmedian(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"),
na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA)
frollvar(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"),
na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA)
frollsd(x, n, fill=NA, algo=c("fast","exact"), align=c("right","left","center"),
na.rm=FALSE, has.nf=NA, adaptive=FALSE, partial=FALSE, give.names=FALSE, hasNA)Arguments
- x
Integer, numeric or logical vector, coerced to numeric, on which sliding window calculates an aggregate function. It supports vectorized input, then it needs to be a
data.table,data.frameor alist, in which case a rolling function is applied to each column/vector.- n
Integer, non-negative, non-NA, rolling window size. This is the total number of included values in aggregate function. In case of an adaptive rolling function, the window size has to be provided as a vector for each individual value of
x. It supports vectorized input, then it needs to be a vector, or in case of an adaptive rolling alistof vectors.- fill
Numeric; value to pad by for an incomplete window iteration. Defaults to
NA. When partial=TRUE this argument is ignored.- algo
Character, default
"fast". When set to"exact", a slower (in some cases more accurate) algorithm is used. It will use multiple cores where available. See Details for more information.- align
Character, specifying the "alignment" of the rolling window, defaulting to
"right"."right"covers preceding rows (the window ends on the current value);"left"covers following rows (the window starts on the current value);"center"is halfway in between (the window is centered on the current value, biased towards"left"whennis even).- na.rm
Logical, default
FALSE. Should missing values be removed when calculating aggregate function on a window?- has.nf
Logical. If it is known whether
xcontains non-finite values (NA,NaN,Inf,-Inf), then setting this toTRUEorFALSEmay speed up computation. Defaults toNA. See has.nf argument section below for details.- adaptive
Logical, default
FALSE. Should the rolling function be calculated adaptively? See Adaptive rolling functions section below for details.- partial
Logical, default
FALSE. Should the rolling window size(s) provided innbe computed also for leading incomplete running window? Seepartialargument section below for details.- give.names
Logical, default
FALSE. WhenTRUE, names are automatically generated corresponding to names ofxand names ofn. If answer is an atomic vector, then the argument is ignored, see examples.- hasNA
Logical. Deprecated, use
has.nfargument instead.
Details
froll* functions accept vector, list, data.frame or data.table. Functions operate on a single vector; when passing a non-atomic input, then the function is applied column-by-column, not to the complete set of columns at once.
Argument n allows multiple values to apply rolling function on multiple window sizes. If adaptive=TRUE, then n can be a list to specify multiple window sizes for adaptive rolling computation. See Adaptive rolling functions section below for details.
When multiple columns or multiple window widths are provided, then they are run in parallel. The exception is for algo="exact" or adaptive=TRUE, which runs in parallel even for single column and single window width. By default, data.table uses only half of available CPUs, see setDTthreads for details on how to tune CPU usage.
Setting options(datatable.verbose=TRUE) will display various information about how rolling function processed. It will not print information in real-time but only at the end of the processing.
Value
For a non vectorized input (x is not a list, and n specifies a single rolling window) a vector is returned, for convenience. Thus, rolling functions can be used conveniently within data.table syntax. For a vectorized input a list is returned.
Note
Be aware that rolling functions operate on the physical order of input. If the intent is to roll values in a vector by a logical window, for example an hour, or a day, then one has to ensure that there are no gaps in the input, or use an adaptive rolling function to handle gaps, for which we provide helper function frolladapt to generate adaptive window size.
has.nf argument
has.nf can be used to speed up processing in cases when it is known if x contains (or not) non-finite values (NA, NaN, Inf, -Inf).
Default
has.nf=NAuses faster implementation that does not support non-finite values, but when non-finite values are detected it will re-run non-finite aware implementation.has.nf=TRUEuses non-finite aware implementation straightaway.has.nf=FALSEuses faster implementation that does not support non-finite values. Then depending on the rolling function it will either:(mean, sum, prod, var, sd) detect non-finite, re-run non-finite aware.
(max, min, median) does not detect non-finites and may silently produce an incorrect answer.
In general has.nf=FALSE && any(!is.finite(x)) should be considered undefined behavior. Therefore has.nf=FALSE should be used with care.
Implementation
Most of the rolling functions have 4 different implementations. First factor that decides which implementation is used is the adaptive argument (either TRUE or FALSE), see section below for details. Then for each of those two algorithms there are usually two implementations depending on the algo argument.
algo="fast"uses "online", single pass, algorithm.max and min rolling function will not do only a single pass but, on average, they will compute
length(x)/nnested loops. The larger the window, the greater the advantage over the exact algorithm, which computeslength(x)nested loops. Note that exact uses multiple CPUs so for a small window sizes and many CPUs it may actually be faster than fast. However, in such cases the elapsed timings will likely be far below a single second.median will use a novel algorithm described by Jukka Suomela in his paper Median Filtering is Equivalent to Sorting (2014). See references section for the link. Implementation here is extended to support arbitrary length of input and an even window size. Despite extensive validation of results this function should be considered experimental. When missing values are detected it will fall back to slower
algo="exact"implementation.var and sd will use numerically stable Welford's online algorithm.
Not all functions have fast implementation available. As of now, adaptive max, min, median, var and sd do not have fast adaptive implementation, therefore it will automatically fall back to exact adaptive implementation. Similarly, non-adaptive fast implementations of median, var and sd will fall back to exact implementations if they detect any non-finite values in the input.
datatable.verboseoption can be used to check that.
algo="exact"will make the rolling functions use a more computationally-intensive algorithm. For each observation in the input vector it will compute a function on a rolling window from scratch (complexity \(O(n^2)\)).Depending on the function, this algorithm may suffer less from floating point rounding error (the same consideration applies to base
mean).In case of mean, it will additionally make an extra pass to perform floating point error correction. Error corrections might not be truly exact on some platforms (like Windows) when using multiple threads.
Adaptive rolling functions
Adaptive rolling functions are a special case where each observation has its own corresponding rolling window width. Therefore, values passed to n argument must be series corresponding to observations in x. If multiple windows are meant to be computed, then a list of integer vectors is expected; each list element must be an integer vector of window size corresponding to observations in x; see Examples. Due to the logic or implementation of adaptive rolling functions, the following restrictions apply:
aligndoes not support"center".if a list of vectors is passed to
x, then all vectors within it must have equal length due to the fact that length of adaptive window widths must match the length of vectors inx.
partial argument
partial=TRUE is used to calculate rolling moments only within the input itself. That is, at the boundaries (say, observation 2 for n=4 and align="right"), we don't consider observations before the first as "missing", but instead shrink the window to be size n=2.
In practice, this is the same as an adaptive window, and could be accomplished, albeit less concisely, with a well-chosen n and adaptive=TRUE.
In fact, we implement partial=TRUE using the same algorithms as adaptive=TRUE. Therefore partial=TRUE inherits the limitations of adaptive rolling functions, see above. Adaptive functions use more complex algorithms; if performance is important, partial=TRUE should be avoided in favour of computing only missing observations separately after the rolling function; see examples.
zoo package users notice
Users coming from most popular package for rolling functions zoo might expect following differences in data.table implementation
rolling function will always return result of the same length as input.
filldefaults toNA.fillaccepts only constant values. It does not support for na.locf or other functions.aligndefaults to"right".na.rmis respected, and other functions are not needed when input containsNA.integers and logical are always coerced to numeric.
when
adaptive=FALSE(default), thennmust be a numeric vector. List is not accepted.when
adaptive=TRUE, thennmust be vector of length equal tonrow(x), or list of such vectors.
Examples
# single vector and single window
frollmean(1:6, 3)
#> [1] NA NA 2 3 4 5
d = as.data.table(list(1:6/2, 3:8/4))
# rollmean of single vector and single window
frollmean(d[, V1], 3)
#> [1] NA NA 1.0 1.5 2.0 2.5
# multiple columns at once
frollmean(d, 3)
#> [[1]]
#> [1] NA NA 1.0 1.5 2.0 2.5
#>
#> [[2]]
#> [1] NA NA 1.00 1.25 1.50 1.75
#>
# multiple windows at once
frollmean(d[, .(V1)], c(3, 4))
#> [[1]]
#> [1] NA NA 1.0 1.5 2.0 2.5
#>
#> [[2]]
#> [1] NA NA NA 1.25 1.75 2.25
#>
# multiple columns and multiple windows at once
frollmean(d, c(3, 4))
#> [[1]]
#> [1] NA NA 1.0 1.5 2.0 2.5
#>
#> [[2]]
#> [1] NA NA NA 1.25 1.75 2.25
#>
#> [[3]]
#> [1] NA NA 1.00 1.25 1.50 1.75
#>
#> [[4]]
#> [1] NA NA NA 1.125 1.375 1.625
#>
## three calls above will use multiple cores when available
# other functions
frollsum(d, 3:4)
#> [[1]]
#> [1] NA NA 3.0 4.5 6.0 7.5
#>
#> [[2]]
#> [1] NA NA NA 5 7 9
#>
#> [[3]]
#> [1] NA NA 3.00 3.75 4.50 5.25
#>
#> [[4]]
#> [1] NA NA NA 4.5 5.5 6.5
#>
frollmax(d, 3:4)
#> [[1]]
#> [1] NA NA 1.5 2.0 2.5 3.0
#>
#> [[2]]
#> [1] NA NA NA 2.0 2.5 3.0
#>
#> [[3]]
#> [1] NA NA 1.25 1.50 1.75 2.00
#>
#> [[4]]
#> [1] NA NA NA 1.50 1.75 2.00
#>
frollmin(d, 3:4)
#> [[1]]
#> [1] NA NA 0.5 1.0 1.5 2.0
#>
#> [[2]]
#> [1] NA NA NA 0.5 1.0 1.5
#>
#> [[3]]
#> [1] NA NA 0.75 1.00 1.25 1.50
#>
#> [[4]]
#> [1] NA NA NA 0.75 1.00 1.25
#>
frollprod(d, 3:4)
#> [[1]]
#> [1] NA NA 0.75 3.00 7.50 15.00
#>
#> [[2]]
#> [1] NA NA NA 1.5 7.5 22.5
#>
#> [[3]]
#> [1] NA NA 0.93750 1.87500 3.28125 5.25000
#>
#> [[4]]
#> [1] NA NA NA 1.40625 3.28125 6.56250
#>
frollmedian(d, 3:4)
#> [[1]]
#> [1] NA NA 1.0 1.5 2.0 2.5
#>
#> [[2]]
#> [1] NA NA NA 1.25 1.75 2.25
#>
#> [[3]]
#> [1] NA NA 1.00 1.25 1.50 1.75
#>
#> [[4]]
#> [1] NA NA NA 1.125 1.375 1.625
#>
frollvar(d, 3:4)
#> [[1]]
#> [1] NA NA 0.25 0.25 0.25 0.25
#>
#> [[2]]
#> [1] NA NA NA 0.4166667 0.4166667 0.4166667
#>
#> [[3]]
#> [1] NA NA 0.0625 0.0625 0.0625 0.0625
#>
#> [[4]]
#> [1] NA NA NA 0.1041667 0.1041667 0.1041667
#>
frollsd(d, 3:4)
#> [[1]]
#> [1] NA NA 0.5 0.5 0.5 0.5
#>
#> [[2]]
#> [1] NA NA NA 0.6454972 0.6454972 0.6454972
#>
#> [[3]]
#> [1] NA NA 0.25 0.25 0.25 0.25
#>
#> [[4]]
#> [1] NA NA NA 0.3227486 0.3227486 0.3227486
#>
# partial=TRUE
x = 1:6/2
n = 3
ans1 = frollmean(x, n, partial=TRUE)
# same using adaptive=TRUE
an = function(n, len) c(seq.int(n), rep.int(n, len-n))
ans2 = frollmean(x, an(n, length(x)), adaptive=TRUE)
all.equal(ans1, ans2)
#> [1] TRUE
# speed up by using partial only for incomplete observations
ans3 = frollmean(x, n)
ans3[seq.int(n-1L)] = frollmean(x[seq.int(n-1L)], n, partial=TRUE)
all.equal(ans1, ans3)
#> [1] TRUE
# give.names
frollsum(list(x=1:5, y=5:1), c(tiny=2, big=4), give.names=TRUE)
#> $x_tiny
#> [1] NA 3 5 7 9
#>
#> $x_big
#> [1] NA NA NA 10 14
#>
#> $y_tiny
#> [1] NA 9 7 5 3
#>
#> $y_big
#> [1] NA NA NA 14 10
#>
# has.nf=FALSE should be used with care
frollmax(c(1,2,NA,4,5), 2)
#> [1] NA 2 NA NA 5
frollmax(c(1,2,NA,4,5), 2, has.nf=FALSE)
#> [1] NA 2 2 4 5
# use verbose=TRUE for extra insight
.op = options(datatable.verbose = TRUE)
frollsd(c(1:5,NA,7:8), 4)
#> frollfunR: allocating memory for results 1x1
#> frollfunR: computing 1 column(s) and 1 window(s) sequentially as there is only single rolling computation
#> frollfunR: 1:
#> frollsdFast: calling sqrt(frollvarFast(...))
#> frollvarFast: running for input length 8, window 4, hasnf 0, narm 0
#> frollvarFast: non-finite values are present in input, re-running with extra care for NFs
#> frollvarFast: non-finite values are present in input, redirecting to frollvarExact using has.nf=TRUE
#> frollvarExact: running in parallel for input length 8, window 4, hasnf 1, narm 0
#> frollvarExact: non-finite values are present in input, na.rm=FALSE and algo='exact' propagates NFs properply, no need to re-run
#> frollfun: processing fun 7 algo 0 took 0.000s
#> frollfunR: processing of 1 column(s) and 1 window(s) took 0.000s
#> [1] NA NA NA 1.290994 1.290994 NA NA NA
options(.op)
# performance vs exactness
set.seed(108)
x = sample(c(rnorm(1e3, 1e6, 5e5), 5e9, 5e-9))
n = 15
ma = function(x, n, na.rm=FALSE) {
ans = rep(NA_real_, nx<-length(x))
for (i in n:nx) ans[i] = mean(x[(i-n+1):i], na.rm=na.rm)
ans
}
fastma = function(x, n, na.rm) {
if (!missing(na.rm)) stop("NAs are unsupported, wrongly propagated by cumsum")
cs = cumsum(x)
scs = shift(cs, n)
scs[n] = 0
as.double((cs-scs)/n)
}
system.time(ans1<-ma(x, n))
#> user system elapsed
#> 0.006 0.000 0.006
system.time(ans2<-fastma(x, n))
#> user system elapsed
#> 0.001 0.000 0.001
system.time(ans3<-frollmean(x, n))
#> user system elapsed
#> 0 0 0
system.time(ans4<-frollmean(x, n, algo="exact"))
#> user system elapsed
#> 0.001 0.000 0.000
system.time(ans5<-frollapply(x, n, mean))
#> user system elapsed
#> 0.009 0.000 0.009
anserr = list(
fastma = ans2-ans1,
froll_fast = ans3-ans1,
froll_exact = ans4-ans1,
frollapply = ans5-ans1
)
errs = sapply(lapply(anserr, abs), sum, na.rm=TRUE)
sapply(errs, format, scientific=FALSE) # roundoff
#> fastma froll_fast froll_exact frollapply
#> "0.00001287466" "0.00000001833541" "0" "0"