Enhanced data.frame
data.table.Rddata.table inherits from data.frame. It offers fast and memory efficient: file reader and writer, aggregations, updates, equi, non-equi, rolling, range and interval joins, in a short and flexible syntax, for faster development.
It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix. Since a data.table is a data.frame, it is compatible with R functions and packages that accept only data.frames.
Type vignette(package="data.table") to get started. The Introduction to data.table vignette introduces data.table's x[i, j, by] syntax and is a good place to start. If you have read the vignettes and the help page below, please read the data.table support guide.
Please check the homepage for up to the minute live NEWS.
Tip: one of the quickest ways to learn the features is to type example(data.table) and study the output at the prompt.
Usage
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL, stringsAsFactors=FALSE)
# S3 method for class 'data.table'
`[`(x, i, j, by, keyby, with = TRUE,
nomatch = NA,
mult = "all",
roll = FALSE,
rollends = if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
else c(TRUE,FALSE),
which = FALSE,
.SDcols,
verbose = getOption("datatable.verbose"), # default: FALSE
allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE
drop = NULL, on = NULL, env = NULL,
showProgress = getOption("datatable.showProgress", interactive()))Arguments
- ...
Just as
...indata.frame. Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.- keep.rownames
If
...is amatrixordata.frame,TRUEwill retain the rownames of that object in a column namedrn.- check.names
Just as
check.namesindata.frame.- key
Character vector of one or more column names which is passed to
setkey.- stringsAsFactors
Logical (default is
FALSE). Convert allcharactercolumns tofactors?- x
A
data.table.- i
Integer, logical or character vector, single column numeric
matrix, expression of column names,list,data.frameordata.table.integerandlogicalvectors work the same way they do in[.data.frameexcept logicalNAs are treated as FALSE.expressionis evaluated within the frame of thedata.table(i.e. it sees column names as if they are variables) and can evaluate to any of the other types.character,listanddata.frameinput toiis converted into adata.tableinternally usingas.data.table.If
iis adata.table, the columns inito be matched againstxcan be specified using one of these ways:onargument (see below). It allows for bothequi-and the newly implementednon-equijoins.If not,
xmust be keyed. Key can be set usingsetkey. Ifiis also keyed, then first key column ofiis matched against first key column ofx, second against second, etc..If
iis not keyed, then first column ofiis matched against first key column ofx, second column ofiagainst second key column ofx, etc...This is summarised in code as
min(length(key(x)), if (haskey(i)) length(key(i)) else ncol(i)).
Using
on=is recommended (even during keyed joins) as it helps understand the code better and also allows for non-equi joins.When the binary operator
==alone is used, an equi join is performed. In SQL terms,x[i]then performs a right join by default.iprefixed with!signals a not-join or not-select.Support for non-equi join was recently implemented, which allows for other binary operators
>=, >, <= and <.See
vignette("datatable-keys-fast-subset")andvignette("datatable-secondary-indices-and-auto-indexing").Advanced: When
iis a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.- j
When
with=TRUE(default),jis evaluated within the frame of the data.table; i.e., it sees column names as if they are variables. This allows to not just select columns inj, but alsocomputeon them e.g.,x[, a]andx[, sum(a)]returnsx$aandsum(x$a)as a vector respectively.x[, .(a, b)]andx[, .(sa=sum(a), sb=sum(b))]returns a two column data.table each, the first simply selecting columnsa, band the second computing their sums.As long as
jreturns alist, each element of the list becomes a column in the resultingdata.table. When the output ofjis not alist, the output is returned as-is (e.g.x[ , a]returns the column vectora), unlessbyis used, in which case it is implicitly wrapped inlistfor convenience (e.g.x[ , sum(a), by=b]will create a column namedV1with valuesum(a)for each group).The expression
.()is a shorthand alias tolist(); they both mean the same. (An exception is made for the use of.()within a call tobquote, where.()is left unchanged.)When
jis a vector of column names or positions to select (as indata.frame), there is no need to usewith=FALSE. Note thatwith=FALSEis still necessary when using a logical vector with lengthncol(x)to include/exclude columns. Note: if a logical vector with lengthk < ncol(x)is passed, it will be filled to lengthncol(x)withFALSE, which is different fromdata.frame, where the vector is recycled.Advanced:
jalso allows the use of special read-only symbols:.SD,.N,.I,.GRP,.BY. Seespecial-symbolsand the Examples below for more.Advanced: When
iis adata.table, the columns ofican be referred to injby using the prefixi., e.g.,X[Y, .(val, i.val)]. Herevalrefers toX's column andi.valY's.Advanced: Columns of
xcan now be referred to using the prefixx.and is particularly useful during joining to refer tox's join columns as they are otherwise masked byi's. For example,X[Y, .(x.a-i.a, b), on="a"].- by
Column names are seen as if they are variables (as in
jwhenwith=TRUE). Thedata.tableis then grouped by thebyandjis evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups.byaccepts:A single unquoted column name: e.g.,
DT[, .(sa=sum(a)), by=x]a
list()of expressions of column names: e.g.,DT[, .(sa=sum(a)), by=.(x=x>0, y)]a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end): e.g.,
DT[, sum(a), by="x,y,z"]a character vector of column names: e.g.,
DT[, sum(a), by=c("x", "y")]or of the form
startcol:endcol: e.g.,DT[, sum(a), by=x:z]
Advanced: When
iis alist(ordata.frameordata.table),DT[i, j, by=.EACHI]evaluatesjfor the groups inDTthat each row inijoins to. That is, you can join (ini) and aggregate (inj) simultaneously. We call this grouping by each i. See this StackOverflow answer for a more detailed explanation until we roll out vignettes.Advanced: In the
X[Y, j]form of grouping, thejexpression sees variables inXfirst, thenY. We call this join inherited scope. If the variable is not inXorYthen the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.- keyby
Same as
by, but with an additionalsetkey()run on thebycolumns of the result, for convenience. It is common practice to usekeyby=routinely when you wish the result to be sorted. May also beTRUEorFALSEwhenbyis provided as an alternative way to accomplish the same operation.- with
By default
with=TRUEandjis evaluated within the frame ofx; column names can be used as variables. In the case of overlapping variable names insidexand in parent scope, you can use the double dot prefix..colsto explicitly refer to thecolsvariable in parent scope and not fromx.When
jis a character vector of column names, a numeric vector of column positions to select, or of the formstartcol:endcol, the value returned is always adata.table.New code should rarely use this argument, which was originally needed for similarity to data.frame. For example, to select columns from a character vector
cols, in data.frame we dox[, cols], which has several equivalents in data.table:x[, .SD, .SDcols=cols],x[, ..cols],x[, cols, env = list(cols = I(cols))], orx[, cols, with=FALSE].- nomatch
When a row in
ihas no match tox,nomatch=NA(default) meansNAis returned.NULL(or0for backward compatibility) means no rows will be returned for that row ofi.- mult
When
iis alist(ordata.frameordata.table) and multiple rows inxmatch to the row ini,multcontrols which are returned:"all"(default),"first"or"last".- roll
When
iis adata.tableand its row matches to all but the lastxjoin column, and its value in the lastijoin column falls in a gap (including after the last observation inxfor that group), then:+Inf(orTRUE) rolls the prevailing value inxforward. It is also known as last observation carried forward (LOCF).-Infrolls backwards instead; i.e., next observation carried backward (NOCB).finite positive or negative number limits how far values are carried forward or backward.
"nearest" rolls the nearest value instead.
Rolling joins apply to the last join column, generally a date but can be any variable. It is particularly fast using a modified binary search.
A common idiom is to select a contemporaneous regular time series (
dts) across a set of identifiers (ids):DT[CJ(ids,dts),roll=TRUE]whereDThas a 2-column key (id,date) andCJstands for cross join.- rollends
A logical vector length 2 (a single logical is recycled) indicating whether values falling before the first value or after the last value for a group should be rolled as well.
If
rollends[2]=TRUE, it will roll the last value forward.TRUEby default for LOCF andFALSEfor NOCB rolls.If
rollends[1]=TRUE, it will roll the first value backward.TRUEby default for NOCB andFALSEfor LOCF rolls.
When
rollis a finite number, that limit is also applied when rolling the ends.- which
TRUEreturns the row numbers ofxthatimatches to. IfNA, returns the row numbers ofithat have no match inx. By defaultFALSEand the rows inxthat match are returned.- .SDcols
Specifies the columns of
xto be included in the special symbol.SDwhich stands forSubset of data.table. May be character column names, numeric positions, logical, a function name such asis.numeric, or a function call such aspatterns()..SDcolsis particularly useful for speed when applying a function through a subset of (possible very many) columns by group; e.g.,DT[, lapply(.SD, sum), by="x,y", .SDcols=301:350].For convenient interactive use, the form
startcol:endcolis also allowed (as inby), e.g.,DT[, lapply(.SD, sum), by=x:y, .SDcols=a:f].Inversion (column dropping instead of keeping) can be accomplished be prepending the argument with
!or-(there's no difference between these), e.g..SDcols = !c('x', 'y').Finally, you can filter columns to include in
.SDbased on their names according to regular expressions via.SDcols=patterns(regex1, regex2, ...). The included columns will be the intersection of the columns identified by each pattern; pattern unions can easily be specified with|in a regex. You can filter columns onvaluesby passing a function, e.g..SDcols=is.numeric. You can also invert a pattern as usual with.SDcols=!patterns(...)or.SDcols=!is.numeric.- verbose
TRUEturns on status and information messages to the console. Turn this on by default usingoptions(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.- allow.cartesian
FALSEprevents joins that would result in more thannrow(x)+nrow(i)rows. This is usually caused by duplicate values ini's join columns, each of which join to the same group inxover and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve indata.table: where every row inijoins to every row inx(anrow(x)*nrow(i)row result). 'cartesian' is just meant in a 'large multiplicative' sense, so FALSE does not always prevent a traditional cartesian join.- drop
Never used by
data.table. Do not use. It needs to be here becausedata.tableinherits fromdata.frame. Seevignette("datatable-faq").- on
Indicate which columns in
xshould be joined with which columns inialong with the type of binary operator to join with (see non-equi joins below on this). When specified, this overrides the keys set onxandi. When.NATURALkeyword provided then natural join is made (join on common columns). There are multiple ways of specifying theonargument:As an unnamed character vector, e.g.,
X[Y, on=c("a", "b")], used when columnsaandbare common to bothXandY.Foreign key joins: As a named character vector when the join columns have different names in
XandY. For example,X[Y, on=c(x1="y1", x2="y2")]joinsXandYby matching columnsx1andx2inXwith columnsy1andy2inY, respectively.From v1.9.8, you can also express foreign key joins using the binary operator
==, e.g.X[Y, on=c("x1==y1", "x2==y2")].NB: shorthand like
X[Y, on=c("a", V2="b")]is also possible if, e.g., column"a"is common between the two tables.For convenience during interactive scenarios, it is also possible to use
.()syntax asX[Y, on=.(a, b)].From v1.9.8, (non-equi) joins using binary operators
>=, >, <=, <are also possible, e.g.,X[Y, on=c("x>=a", "y<=b")], or for interactive use asX[Y, on=.(x>=a, y<=b)].
Note that providing
onis required forX[Y]joins whenXis unkeyed. See examples as well asvignette("datatable-secondary-indices-and-auto-indexing").- env
List or an environment, passed to
substitute2for substitution of parameters ini,jandby(orkeyby). Useverboseto preview constructed expressions. For more details seevignette("datatable-programming").- showProgress
TRUEshows progress indicator with estimated time to completion for lengthy "by" operations.
Details
data.table builds on base R functionality to reduce 2 types of time:
programming time (easier to write, read, debug and maintain), and
compute time (fast and memory efficient).
The general form of data.table syntax is:
DT[ i, j, by ] # + extra arguments
| | |
| | -------> grouped by what?
| -------> what to do?
---> on which rows?
The way to read this out loud is: "Take DT, subset rows by i, then compute j grouped by by. Here are some basic usage examples expanding on this definition. See the vignette (and examples) for working examples.
X[, a] # return col 'a' from X as vector. If not found, search in parent frame.
X[, .(a)] # same as above, but return as a data.table.
X[, sum(a)] # return sum(a) as a vector (with same scoping rules as above)
X[, .(sum(a)), by=c] # get sum(a) grouped by 'c'.
X[, sum(a), by=c] # same as above, .() can be omitted in j and by on single expression for convenience
X[, sum(a), by=c:f] # get sum(a) grouped by all columns in between 'c' and 'f' (both inclusive)
X[, sum(a), keyby=b] # get sum(a) grouped by 'b', and sort that result by the grouping column 'b'
X[, sum(a), by=b, keyby=TRUE] # same order as above, but using sorting flag
X[, sum(a), by=b][order(b)] # same order as above, but by chaining compound expressions
X[c>1, sum(a), by=c] # get rows where c>1 is TRUE, and on those rows, get sum(a) grouped by 'c'
X[Y, .(a, b), on="c"] # get rows where Y$c == X$c, and select columns 'X$a' and 'X$b' for those rows
X[Y, .(a, i.a), on="c"] # get rows where Y$c == X$c, and then select 'X$a' and 'Y$a' (=i.a)
X[Y, sum(a*i.a), on="c", by=.EACHI] # for *each* 'Y$c', get sum(a*i.a) on matching rows in 'X$c'
X[, plot(a, b), by=c] # j accepts any expression, generates plot for each group and returns no data
# see ?assign to add/update/delete columns by reference using the same consistent interfaceA data.table query may be invoked on a data.frame using functional form DT(...), see examples. The class of the input is retained.
A data.table is a list of vectors, just like a data.frame. However :
it never has or uses rownames. Rownames based indexing can be done by setting a key of one or more columns or done ad-hoc using the
onargument (now preferred).it has enhanced functionality in
[.data.tablefor fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
See the see also section for the several other methods that are available for operating on data.tables efficiently.
References
https://r-datatable.com (data.table homepage)
https://en.wikipedia.org/wiki/Binary_search
Note
If keep.rownames or check.names are supplied they must be written in full because R does not allow partial argument names after .... For example, data.table(DF, keep=TRUE) will create a
column called keep containing TRUE and this is correct behaviour; data.table(DF, keep.rownames=TRUE) was intended.
POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime. They are implicitly converted to POSIXct type with warning. You may also be interested in IDateTime instead; it has methods to convert to and from POSIXlt.
See also
special-symbols, data.frame, [.data.frame, as.data.table, setkey, setorder, setDT, setDF, J, SJ, CJ, merge.data.table, tables, test.data.table, IDateTime, unique.data.table, copy, :=, setalloccol, truelength, rbindlist, setNumericRounding, datatable-optimize, fsetdiff, funion, fintersect, fsetequal, anyDuplicated, uniqueN, rowid, rleid, na.omit, frank, rowwiseDT
Examples
if (FALSE) { # \dontrun{
example(data.table) # to run these examples yourself
} # }
DF = data.frame(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
DF
#> x y v
#> 1 b 1 1
#> 2 b 3 2
#> 3 b 6 3
#> 4 a 1 4
#> 5 a 3 5
#> 6 a 6 6
#> 7 c 1 7
#> 8 c 3 8
#> 9 c 6 9
DT
#> x y v
#> <char> <num> <int>
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: a 1 4
#> 5: a 3 5
#> 6: a 6 6
#> 7: c 1 7
#> 8: c 3 8
#> 9: c 6 9
identical(dim(DT), dim(DF)) # TRUE
#> [1] TRUE
identical(DF$a, DT$a) # TRUE
#> [1] TRUE
is.list(DF) # TRUE
#> [1] TRUE
is.list(DT) # TRUE
#> [1] TRUE
is.data.frame(DT) # TRUE
#> [1] TRUE
tables()
#> NAME NROW NCOL MB COLS KEY
#> 1: DT 9 3 0 x,y,v [NULL]
#> Total: 0MB using type_size
# basic row subset operations
DT[2] # 2nd row
#> x y v
#> <char> <num> <int>
#> 1: b 3 2
DT[3:2] # 3rd and 2nd row
#> x y v
#> <char> <num> <int>
#> 1: b 6 3
#> 2: b 3 2
DT[order(x)] # no need for order(DT$x)
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
#> 4: b 1 1
#> 5: b 3 2
#> 6: b 6 3
#> 7: c 1 7
#> 8: c 3 8
#> 9: c 6 9
DT[order(x), ] # same as above. The ',' is optional
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
#> 4: b 1 1
#> 5: b 3 2
#> 6: b 6 3
#> 7: c 1 7
#> 8: c 3 8
#> 9: c 6 9
DT[y>2] # all rows where DT$y > 2
#> x y v
#> <char> <num> <int>
#> 1: b 3 2
#> 2: b 6 3
#> 3: a 3 5
#> 4: a 6 6
#> 5: c 3 8
#> 6: c 6 9
DT[y>2 & v>5] # compound logical expressions
#> x y v
#> <char> <num> <int>
#> 1: a 6 6
#> 2: c 3 8
#> 3: c 6 9
DT[!2:4] # all rows other than 2:4
#> x y v
#> <char> <num> <int>
#> 1: b 1 1
#> 2: a 3 5
#> 3: a 6 6
#> 4: c 1 7
#> 5: c 3 8
#> 6: c 6 9
DT[-(2:4)] # same
#> x y v
#> <char> <num> <int>
#> 1: b 1 1
#> 2: a 3 5
#> 3: a 6 6
#> 4: c 1 7
#> 5: c 3 8
#> 6: c 6 9
# select|compute columns data.table way
DT[, v] # v column (as vector)
#> [1] 1 2 3 4 5 6 7 8 9
DT[, list(v)] # v column (as data.table)
#> v
#> <int>
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
#> 6: 6
#> 7: 7
#> 8: 8
#> 9: 9
DT[, .(v)] # same as above, .() is a shorthand alias to list()
#> v
#> <int>
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
#> 6: 6
#> 7: 7
#> 8: 8
#> 9: 9
DT[, sum(v)] # sum of column v, returned as vector
#> [1] 45
DT[, .(sum(v))] # same, but return data.table (column autonamed V1)
#> V1
#> <int>
#> 1: 45
DT[, .(sv=sum(v))] # same, but column named "sv"
#> sv
#> <int>
#> 1: 45
DT[, .(v, v*2)] # return two column data.table, v and v*2
#> v V2
#> <int> <num>
#> 1: 1 2
#> 2: 2 4
#> 3: 3 6
#> 4: 4 8
#> 5: 5 10
#> 6: 6 12
#> 7: 7 14
#> 8: 8 16
#> 9: 9 18
# subset rows and select|compute data.table way
DT[2:3, sum(v)] # sum(v) over rows 2 and 3, return vector
#> [1] 5
DT[2:3, .(sum(v))] # same, but return data.table with column V1
#> V1
#> <int>
#> 1: 5
DT[2:3, .(sv=sum(v))] # same, but return data.table with column sv
#> sv
#> <int>
#> 1: 5
DT[2:5, cat(v, "\n")] # just for j's side effect
#> 2 3 4 5
#> NULL
# select columns the data.frame way
DT[, 2] # 2nd column, returns a data.table always
#> y
#> <num>
#> 1: 1
#> 2: 3
#> 3: 6
#> 4: 1
#> 5: 3
#> 6: 6
#> 7: 1
#> 8: 3
#> 9: 6
colNum = 2
DT[, ..colNum] # same, .. prefix conveys one-level-up in calling scope
#> y
#> <num>
#> 1: 1
#> 2: 3
#> 3: 6
#> 4: 1
#> 5: 3
#> 6: 6
#> 7: 1
#> 8: 3
#> 9: 6
DT[["v"]] # same as DT[, v] but faster if called in a loop
#> [1] 1 2 3 4 5 6 7 8 9
# grouping operations - j and by
DT[, sum(v), by=x] # ad hoc by, order of groups preserved in result
#> x V1
#> <char> <int>
#> 1: b 6
#> 2: a 15
#> 3: c 24
DT[, sum(v), keyby=x] # same, but order the result on by cols
#> Key: <x>
#> x V1
#> <char> <int>
#> 1: a 15
#> 2: b 6
#> 3: c 24
DT[, sum(v), by=x, keyby=TRUE] # same, but using sorting flag
#> Key: <x>
#> x V1
#> <char> <int>
#> 1: a 15
#> 2: b 6
#> 3: c 24
DT[, sum(v), by=x][order(x)] # same but by chaining expressions together
#> x V1
#> <char> <int>
#> 1: a 15
#> 2: b 6
#> 3: c 24
# fast ad hoc row subsets (subsets as joins)
DT["a", on="x"] # same as x == "a" but uses binary search (fast)
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
# NB: requires DT to be keyed!
DT["a", on=.(x)] # same, for convenience, no need to quote every column
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
# NB: works regardless of whether or not DT is keyed!
DT[.("a"), on="x"] # same
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
DT[x=="a"] # same, single "==" internally optimised to use binary search (fast)
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
DT[x!="b" | y!=3] # not yet optimized, currently vector scan subset
#> x y v
#> <char> <num> <int>
#> 1: b 1 1
#> 2: b 6 3
#> 3: a 1 4
#> 4: a 3 5
#> 5: a 6 6
#> 6: c 1 7
#> 7: c 3 8
#> 8: c 6 9
DT[.("b", 3), on=c("x", "y")] # join on columns x,y of DT; uses binary search (fast)
#> x y v
#> <char> <num> <int>
#> 1: b 3 2
DT[.("b", 3), on=.(x, y)] # same, but using on=.()
#> x y v
#> <char> <num> <int>
#> 1: b 3 2
DT[.("b", 1:2), on=c("x", "y")] # no match returns NA
#> x y v
#> <char> <int> <int>
#> 1: b 1 1
#> 2: b 2 NA
DT[.("b", 1:2), on=.(x, y), nomatch=NULL] # no match row is not returned
#> x y v
#> <char> <int> <int>
#> 1: b 1 1
DT[.("b", 1:2), on=c("x", "y"), roll=Inf] # locf, nomatch row gets rolled by previous row
#> x y v
#> <char> <int> <int>
#> 1: b 1 1
#> 2: b 2 1
DT[.("b", 1:2), on=.(x, y), roll=-Inf] # nocb, nomatch row gets rolled by next row
#> x y v
#> <char> <int> <int>
#> 1: b 1 1
#> 2: b 2 2
DT["b", sum(v*y), on="x"] # on rows where DT$x=="b", calculate sum(v*y)
#> [1] 25
# all together now
DT[x!="a", sum(v), by=x] # get sum(v) by "x" for each i != "a"
#> x V1
#> <char> <int>
#> 1: b 6
#> 2: c 24
DT[!"a", sum(v), by=.EACHI, on="x"] # same, but using subsets-as-joins
#> x V1
#> <char> <int>
#> 1: b 6
#> 2: c 24
DT[c("b","c"), sum(v), by=.EACHI, on="x"] # same
#> x V1
#> <char> <int>
#> 1: b 6
#> 2: c 24
DT[c("b","c"), sum(v), by=.EACHI, on=.(x)] # same, using on=.()
#> x V1
#> <char> <int>
#> 1: b 6
#> 2: c 24
# joins as subsets
X = data.table(x=c("c","b"), v=8:7, foo=c(4,2))
X
#> x v foo
#> <char> <int> <num>
#> 1: c 8 4
#> 2: b 7 2
DT[X, on="x"] # right join
#> x y v i.v foo
#> <char> <num> <int> <int> <num>
#> 1: c 1 7 8 4
#> 2: c 3 8 8 4
#> 3: c 6 9 8 4
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
X[DT, on="x"] # left join
#> x v foo y i.v
#> <char> <int> <num> <num> <int>
#> 1: b 7 2 1 1
#> 2: b 7 2 3 2
#> 3: b 7 2 6 3
#> 4: a NA NA 1 4
#> 5: a NA NA 3 5
#> 6: a NA NA 6 6
#> 7: c 8 4 1 7
#> 8: c 8 4 3 8
#> 9: c 8 4 6 9
DT[X, on="x", nomatch=NULL] # inner join
#> x y v i.v foo
#> <char> <num> <int> <int> <num>
#> 1: c 1 7 8 4
#> 2: c 3 8 8 4
#> 3: c 6 9 8 4
#> 4: b 1 1 7 2
#> 5: b 3 2 7 2
#> 6: b 6 3 7 2
DT[!X, on="x"] # not join
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
DT[X, on=c(y="v")] # join using column "y" of DT with column "v" of X
#> x y v i.x foo
#> <char> <int> <int> <char> <num>
#> 1: <NA> 8 NA c 4
#> 2: <NA> 7 NA b 2
DT[X, on="y==v"] # same as above (v1.9.8+)
#> x y v i.x foo
#> <char> <int> <int> <char> <num>
#> 1: <NA> 8 NA c 4
#> 2: <NA> 7 NA b 2
DT[X, on=.(y<=foo)] # NEW non-equi join (v1.9.8+)
#> x y v i.x i.v
#> <char> <num> <int> <char> <int>
#> 1: b 4 1 c 8
#> 2: b 4 2 c 8
#> 3: a 4 4 c 8
#> 4: a 4 5 c 8
#> 5: c 4 7 c 8
#> 6: c 4 8 c 8
#> 7: b 2 1 b 7
#> 8: a 2 4 b 7
#> 9: c 2 7 b 7
DT[X, on="y<=foo"] # same as above
#> x y v i.x i.v
#> <char> <num> <int> <char> <int>
#> 1: b 4 1 c 8
#> 2: b 4 2 c 8
#> 3: a 4 4 c 8
#> 4: a 4 5 c 8
#> 5: c 4 7 c 8
#> 6: c 4 8 c 8
#> 7: b 2 1 b 7
#> 8: a 2 4 b 7
#> 9: c 2 7 b 7
DT[X, on=c("y<=foo")] # same as above
#> x y v i.x i.v
#> <char> <num> <int> <char> <int>
#> 1: b 4 1 c 8
#> 2: b 4 2 c 8
#> 3: a 4 4 c 8
#> 4: a 4 5 c 8
#> 5: c 4 7 c 8
#> 6: c 4 8 c 8
#> 7: b 2 1 b 7
#> 8: a 2 4 b 7
#> 9: c 2 7 b 7
DT[X, on=.(y>=foo)] # NEW non-equi join (v1.9.8+)
#> x y v i.x i.v
#> <char> <num> <int> <char> <int>
#> 1: b 4 3 c 8
#> 2: a 4 6 c 8
#> 3: c 4 9 c 8
#> 4: b 2 2 b 7
#> 5: b 2 3 b 7
#> 6: a 2 5 b 7
#> 7: a 2 6 b 7
#> 8: c 2 8 b 7
#> 9: c 2 9 b 7
DT[X, on=.(x, y<=foo)] # NEW non-equi join (v1.9.8+)
#> x y v i.v
#> <char> <num> <int> <int>
#> 1: c 4 7 8
#> 2: c 4 8 8
#> 3: b 2 1 7
DT[X, .(x,y,x.y,v), on=.(x, y>=foo)] # Select x's join columns as well
#> x y x.y v
#> <char> <num> <num> <int>
#> 1: c 4 6 9
#> 2: b 2 3 2
#> 3: b 2 6 3
DT[X, on="x", mult="first"] # first row of each group
#> x y v i.v foo
#> <char> <num> <int> <int> <num>
#> 1: c 1 7 8 4
#> 2: b 1 1 7 2
DT[X, on="x", mult="last"] # last row of each group
#> x y v i.v foo
#> <char> <num> <int> <int> <num>
#> 1: c 6 9 8 4
#> 2: b 6 3 7 2
DT[X, sum(v), by=.EACHI, on="x"] # join and eval j for each row in i
#> x V1
#> <char> <int>
#> 1: c 24
#> 2: b 6
DT[X, sum(v)*foo, by=.EACHI, on="x"] # join inherited scope
#> x V1
#> <char> <num>
#> 1: c 96
#> 2: b 12
DT[X, sum(v)*i.v, by=.EACHI, on="x"] # 'i,v' refers to X's v column
#> x V1
#> <char> <int>
#> 1: c 192
#> 2: b 42
DT[X, on=.(x, v>=v), sum(y)*foo, by=.EACHI] # NEW non-equi join with by=.EACHI (v1.9.8+)
#> x v V1
#> <char> <int> <num>
#> 1: c 8 36
#> 2: b 7 NA
# setting keys
kDT = copy(DT) # (deep) copy DT to kDT to work with it.
setkey(kDT,x) # set a 1-column key. No quotes, for convenience.
setkeyv(kDT,"x") # same (v in setkeyv stands for vector)
v="x"
setkeyv(kDT,v) # same
haskey(kDT) # TRUE
#> [1] TRUE
key(kDT) # "x"
#> [1] "x"
# fast *keyed* subsets
kDT["a"] # subset-as-join on *key* column 'x'
#> Key: <x>
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
kDT["a", on="x"] # same, being explicit using 'on=' (preferred)
#> Key: <x>
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
# all together
kDT[!"a", sum(v), by=.EACHI] # get sum(v) for each i != "a"
#> Key: <x>
#> x V1
#> <char> <int>
#> 1: b 6
#> 2: c 24
# multi-column key
setkey(kDT,x,y) # 2-column key
setkeyv(kDT,c("x","y")) # same
# fast *keyed* subsets on multi-column key
kDT["a"] # join to 1st column of key
#> Key: <x, y>
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
kDT["a", on="x"] # on= is optional, but is preferred
#> Key: <x, y>
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
kDT[.("a")] # same, .() is an alias for list()
#> Key: <x, y>
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
kDT[list("a")] # same
#> Key: <x, y>
#> x y v
#> <char> <num> <int>
#> 1: a 1 4
#> 2: a 3 5
#> 3: a 6 6
kDT[.("a", 3)] # join to 2 columns
#> Key: <x, y>
#> x y v
#> <char> <num> <int>
#> 1: a 3 5
kDT[.("a", 3:6)] # join 4 rows (2 missing)
#> x y v
#> <char> <int> <int>
#> 1: a 3 5
#> 2: a 4 NA
#> 3: a 5 NA
#> 4: a 6 6
kDT[.("a", 3:6), nomatch=NULL] # remove missing
#> Key: <x, y>
#> x y v
#> <char> <int> <int>
#> 1: a 3 5
#> 2: a 6 6
kDT[.("a", 3:6), roll=TRUE] # locf rolling join
#> x y v
#> <char> <int> <int>
#> 1: a 3 5
#> 2: a 4 5
#> 3: a 5 5
#> 4: a 6 6
kDT[.("a", 3:6), roll=Inf] # same as above
#> x y v
#> <char> <int> <int>
#> 1: a 3 5
#> 2: a 4 5
#> 3: a 5 5
#> 4: a 6 6
kDT[.("a", 3:6), roll=-Inf] # nocb rolling join
#> x y v
#> <char> <int> <int>
#> 1: a 3 5
#> 2: a 4 6
#> 3: a 5 6
#> 4: a 6 6
kDT[!.("a")] # not join
#> Key: <x, y>
#> x y v
#> <char> <num> <int>
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: c 1 7
#> 5: c 3 8
#> 6: c 6 9
kDT[!"a"] # same
#> Key: <x, y>
#> x y v
#> <char> <num> <int>
#> 1: b 1 1
#> 2: b 3 2
#> 3: b 6 3
#> 4: c 1 7
#> 5: c 3 8
#> 6: c 6 9
# more on special symbols, see also ?"special-symbols"
DT[.N] # last row
#> x y v
#> <char> <num> <int>
#> 1: c 6 9
DT[, .N] # total number of rows in DT
#> [1] 9
DT[, .N, by=x] # number of rows in each group
#> x N
#> <char> <int>
#> 1: b 3
#> 2: a 3
#> 3: c 3
DT[, .SD, .SDcols=x:y] # select columns 'x' through 'y'
#> Index: <x>
#> x y
#> <char> <num>
#> 1: b 1
#> 2: b 3
#> 3: b 6
#> 4: a 1
#> 5: a 3
#> 6: a 6
#> 7: c 1
#> 8: c 3
#> 9: c 6
DT[ , .SD, .SDcols = !x:y] # drop columns 'x' through 'y'
#> v
#> <int>
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
#> 6: 6
#> 7: 7
#> 8: 8
#> 9: 9
DT[ , .SD, .SDcols = patterns('^[xv]')] # select columns matching '^x' or '^v'
#> Index: <x>
#> x v
#> <char> <int>
#> 1: b 1
#> 2: b 2
#> 3: b 3
#> 4: a 4
#> 5: a 5
#> 6: a 6
#> 7: c 7
#> 8: c 8
#> 9: c 9
DT[, .SD[1]] # first row of all columns
#> x y v
#> <char> <num> <int>
#> 1: b 1 1
DT[, .SD[1], by=x] # first row of 'y' and 'v' for each group in 'x'
#> x y v
#> <char> <num> <int>
#> 1: b 1 1
#> 2: a 1 4
#> 3: c 1 7
DT[, c(.N, lapply(.SD, sum)), by=x] # get rows *and* sum columns 'v' and 'y' by group
#> x N y v
#> <char> <int> <num> <int>
#> 1: b 3 10 6
#> 2: a 3 10 15
#> 3: c 3 10 24
DT[, .I[1], by=x] # row number in DT corresponding to each group
#> x V1
#> <char> <int>
#> 1: b 1
#> 2: a 4
#> 3: c 7
DT[, grp := .GRP, by=x] # add a group counter column
#> Index: <x>
#> x y v grp
#> <char> <num> <int> <int>
#> 1: b 1 1 1
#> 2: b 3 2 1
#> 3: b 6 3 1
#> 4: a 1 4 2
#> 5: a 3 5 2
#> 6: a 6 6 2
#> 7: c 1 7 3
#> 8: c 3 8 3
#> 9: c 6 9 3
DT[ , dput(.BY), by=.(x,y)] # .BY is a list of singletons for each group
#> list(x = "b", y = 1)
#> list(x = "b", y = 3)
#> list(x = "b", y = 6)
#> list(x = "a", y = 1)
#> list(x = "a", y = 3)
#> list(x = "a", y = 6)
#> list(x = "c", y = 1)
#> list(x = "c", y = 3)
#> list(x = "c", y = 6)
#> x y x y
#> <char> <num> <char> <num>
#> 1: b 1 b 1
#> 2: b 3 b 3
#> 3: b 6 b 6
#> 4: a 1 a 1
#> 5: a 3 a 3
#> 6: a 6 a 6
#> 7: c 1 c 1
#> 8: c 3 c 3
#> 9: c 6 c 6
X[, DT[.BY, y, on="x"], by=x] # join within each group
#> x V1
#> <char> <num>
#> 1: c 1
#> 2: c 3
#> 3: c 6
#> 4: b 1
#> 5: b 3
#> 6: b 6
DT[, {
# write each group to a different file
fwrite(.SD, file.path(tempdir(), paste0('x=', .BY$x, '.csv')))
}, by=x]
#> Empty data.table (0 rows and 1 cols): x
dir(tempdir())
#> [1] "bslib-6fa9f2659af53b4d3fd43916167b0c41"
#> [2] "downlit"
#> [3] "file6397dd3fbce"
#> [4] "x=a.csv"
#> [5] "x=b.csv"
#> [6] "x=c.csv"
# add/update/delete by reference (see ?assign)
print(DT[, z:=42L]) # add new column by reference
#> Index: <x>
#> x y v grp z
#> <char> <num> <int> <int> <int>
#> 1: b 1 1 1 42
#> 2: b 3 2 1 42
#> 3: b 6 3 1 42
#> 4: a 1 4 2 42
#> 5: a 3 5 2 42
#> 6: a 6 6 2 42
#> 7: c 1 7 3 42
#> 8: c 3 8 3 42
#> 9: c 6 9 3 42
print(DT[, z:=NULL]) # remove column by reference
#> Index: <x>
#> x y v grp
#> <char> <num> <int> <int>
#> 1: b 1 1 1
#> 2: b 3 2 1
#> 3: b 6 3 1
#> 4: a 1 4 2
#> 5: a 3 5 2
#> 6: a 6 6 2
#> 7: c 1 7 3
#> 8: c 3 8 3
#> 9: c 6 9 3
print(DT["a", v:=42L, on="x"]) # subassign to existing v column by reference
#> Index: <x>
#> x y v grp
#> <char> <num> <int> <int>
#> 1: b 1 1 1
#> 2: b 3 2 1
#> 3: b 6 3 1
#> 4: a 1 42 2
#> 5: a 3 42 2
#> 6: a 6 42 2
#> 7: c 1 7 3
#> 8: c 3 8 3
#> 9: c 6 9 3
print(DT["b", v2:=84L, on="x"]) # subassign to new column by reference (NA padded)
#> Index: <x>
#> x y v grp v2
#> <char> <num> <int> <int> <int>
#> 1: b 1 1 1 84
#> 2: b 3 2 1 84
#> 3: b 6 3 1 84
#> 4: a 1 42 2 NA
#> 5: a 3 42 2 NA
#> 6: a 6 42 2 NA
#> 7: c 1 7 3 NA
#> 8: c 3 8 3 NA
#> 9: c 6 9 3 NA
DT[, m:=mean(v), by=x][] # add new column by reference by group
#> Index: <x>
#> x y v grp v2 m
#> <char> <num> <int> <int> <int> <num>
#> 1: b 1 1 1 84 2
#> 2: b 3 2 1 84 2
#> 3: b 6 3 1 84 2
#> 4: a 1 42 2 NA 42
#> 5: a 3 42 2 NA 42
#> 6: a 6 42 2 NA 42
#> 7: c 1 7 3 NA 8
#> 8: c 3 8 3 NA 8
#> 9: c 6 9 3 NA 8
# NB: postfix [] is shortcut to print()
# advanced usage
DT = data.table(x=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6), a=1:9, b=9:1)
DT[, sum(v), by=.(y%%2)] # expressions in by
#> y V1
#> <num> <num>
#> 1: 1 9
#> 2: 0 4
DT[, sum(v), by=.(bool = y%%2)] # same, using a named list to change by column name
#> bool V1
#> <num> <num>
#> 1: 1 9
#> 2: 0 4
DT[, .SD[2], by=x] # get 2nd row of each group
#> x v y a b
#> <char> <num> <num> <int> <int>
#> 1: b 1 3 2 8
#> 2: a 2 3 5 5
#> 3: c 2 3 8 2
DT[, tail(.SD,2), by=x] # last 2 rows of each group
#> x v y a b
#> <char> <num> <num> <int> <int>
#> 1: b 1 3 2 8
#> 2: b 1 6 3 7
#> 3: a 2 3 5 5
#> 4: a 1 6 6 4
#> 5: c 2 3 8 2
#> 6: c 2 6 9 1
DT[, lapply(.SD, sum), by=x] # sum of all (other) columns for each group
#> x v y a b
#> <char> <num> <num> <int> <int>
#> 1: b 3 10 6 24
#> 2: a 5 10 15 15
#> 3: c 5 10 24 6
DT[, .SD[which.min(v)], by=x] # nested query by group
#> x v y a b
#> <char> <num> <num> <int> <int>
#> 1: b 1 1 1 9
#> 2: a 1 6 6 4
#> 3: c 1 1 7 3
DT[, list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x, y%%2)] # by 2 expressions
#> x y MySum MyMin MyMax
#> <char> <num> <num> <num> <num>
#> 1: b 1 2 1 1
#> 2: b 0 1 1 1
#> 3: a 1 4 2 2
#> 4: a 0 1 1 1
#> 5: c 1 3 1 2
#> 6: c 0 2 2 2
# programmatic query with env=
DT[, .(funvar = fun(var)), by=grp_var,
env = list(fun="sum", var="a", funvar="sum_a_by_x", grp_var="x")]
#> x sum_a_by_x
#> <char> <int>
#> 1: b 6
#> 2: a 15
#> 3: c 24
DT[, .(a = .(a), b = .(b)), by=x] # list columns
#> x a b
#> <char> <list> <list>
#> 1: b 1,2,3 9,8,7
#> 2: a 4,5,6 6,5,4
#> 3: c 7,8,9 3,2,1
DT[, .(seq = min(a):max(b)), by=x] # j is not limited to just aggregations
#> x seq
#> <char> <int>
#> 1: b 1
#> 2: b 2
#> 3: b 3
#> 4: b 4
#> 5: b 5
#> 6: b 6
#> 7: b 7
#> 8: b 8
#> 9: b 9
#> 10: a 4
#> 11: a 5
#> 12: a 6
#> 13: c 7
#> 14: c 6
#> 15: c 5
#> 16: c 4
#> 17: c 3
DT[, sum(v), by=x][V1<20] # compound query
#> x V1
#> <char> <num>
#> 1: b 3
#> 2: a 5
#> 3: c 5
DT[, sum(v), by=x][order(-V1)] # ordering results
#> x V1
#> <char> <num>
#> 1: a 5
#> 2: c 5
#> 3: b 3
DT[, c(.N, lapply(.SD,sum)), by=x] # get number of observations and sum per group
#> x N v y a b
#> <char> <int> <num> <num> <int> <int>
#> 1: b 3 3 10 6 24
#> 2: a 3 5 10 15 15
#> 3: c 3 5 10 24 6
DT[, {tmp <- mean(y);
.(a = a-tmp, b = b-tmp)
}, by=x] # anonymous lambda in 'j', j accepts any valid
#> x a b
#> <char> <num> <num>
#> 1: b -2.3333333 5.6666667
#> 2: b -1.3333333 4.6666667
#> 3: b -0.3333333 3.6666667
#> 4: a 0.6666667 2.6666667
#> 5: a 1.6666667 1.6666667
#> 6: a 2.6666667 0.6666667
#> 7: c 3.6666667 -0.3333333
#> 8: c 4.6666667 -1.3333333
#> 9: c 5.6666667 -2.3333333
# expression. TO REMEMBER: every element of
# the list becomes a column in result.
pdf("new.pdf")
DT[, plot(a,b), by=x] # can also plot in 'j'
#> Empty data.table (0 rows and 1 cols): x
dev.off()
#> pdf
#> 2
# using rleid, get max(y) and min of all cols in .SDcols for each consecutive run of 'v'
DT[, c(.(y=max(y)), lapply(.SD, min)), by=rleid(v), .SDcols=v:b]
#> rleid y v y a b
#> <int> <num> <num> <num> <int> <int>
#> 1: 1 6 1 1 1 7
#> 2: 2 3 2 1 4 5
#> 3: 3 6 1 1 6 3
#> 4: 4 6 2 3 8 1
# Support guide and links:
# https://github.com/Rdatatable/data.table/wiki/Support
if (FALSE) { # \dontrun{
if (interactive()) {
vignette(package="data.table") # 9 vignettes
test.data.table() # 6,000 tests
# keep up to date with latest stable version on CRAN
update.packages()
# get the latest devel version that has passed all tests
update_dev_pkg()
# read more at:
# https://github.com/Rdatatable/data.table/wiki/Installation
}
} # }