Keys and fast binary search based subset

2026-01-30

Translations of this document are available in: en | es | fr

This vignette is aimed at those who are already familiar with data.table syntax, its general form, how to subset rows in i, select and compute on columns, add/modify/delete columns by reference in j and group by using by. If you’re not familiar with these concepts, please read the vignettes vignette("datatable-intro", package="data.table") and vignette("datatable-reference-semantics", package="data.table") first.


Data

We will use the same flights data as in the vignette("datatable-intro", package="data.table") vignette.

flights <- fread("flights14.csv")
head(flights)
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 1 1 14 13 AA JFK LAX 359 2475 9
2014 1 1 -3 13 AA JFK LAX 363 2475 11
2014 1 1 2 9 AA JFK LAX 351 2475 19
2014 1 1 -8 -26 AA LGA PBI 157 1035 7
2014 1 1 2 1 AA JFK LAX 350 2475 13
2014 1 1 4 0 AA EWR LAX 339 2454 18
dim(flights)
# [1] 253316     11

Introduction

In this vignette, we will

1. Keys

a) What is a key?

In the vignette("datatable-intro", package="data.table") vignette, we saw how to subset rows in i using logical expressions, row numbers and using order(). In this section, we will look at another way of subsetting incredibly fast - using keys.

But first, let’s start by looking at data.frames. All data.frames have a row names attribute. Consider the data.frame DF below.

set.seed(1L)
DF = data.frame(ID1 = sample(letters[1:2], 10, TRUE),
                ID2 = sample(1:3, 10, TRUE),
                val = sample(10),
                stringsAsFactors = FALSE,
                row.names = sample(LETTERS[1:10]))
DF
ID1 ID2 val
I a 1 10
D a 3 9
G a 1 4
A a 1 7
B a 1 1
E b 1 8
C b 2 3
J b 1 2
F b 1 5
H a 2 6
rownames(DF)
#  [1] "I" "D" "G" "A" "B" "E" "C" "J" "F" "H"

We can subset a particular row using its row name as shown below:

DF["C", ]
ID1 ID2 val
C b 2 3

i.e., row names are more or less an index to rows of a data.frame. However,

  1. Each row is limited to exactly one row name.

    But, a person (for example) has at least two names - a first and a second name. It is useful to organise a telephone directory by surname then first name.

  2. And row names should be unique.

    rownames(DF) = sample(LETTERS[1:5], 10, TRUE)
    # Warning: non-unique values when setting 'row.names': 'C', 'D'
    # Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
    

Now let’s convert it to a data.table.

DT = as.data.table(DF)
DT
ID1 ID2 val
a 1 10
a 3 9
a 1 4
a 1 7
a 1 1
b 1 8
b 2 3
b 1 2
b 1 5
a 2 6
rownames(DT)
#  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Instead, in data.tables we set and use keys. Think of a key as supercharged rownames.

Keys and their properties

  1. We can set keys on multiple columns and the column can be of different typesinteger, numeric, character, factor, integer64 etc. list and complex types are not supported yet.

  2. Uniqueness is not enforced, i.e., duplicate key values are allowed. Since rows are sorted by key, any duplicates in the key columns will appear consecutively.

  3. Setting a key does two things:

    a. physically reorders the rows of the data.table by the column(s) provided by reference, always in increasing order.

    b. marks those columns as key columns by setting an attribute called sorted to the data.table.

    Since the rows are reordered, a data.table can have at most one key because it can not be sorted in more than one way.

For the rest of the vignette, we will work with flights data set.

b) Set, get and use keys on a data.table

– How can we set the column origin as key in the data.table flights?

setkey(flights, origin)
head(flights)
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 1 1 4 0 AA EWR LAX 339 2454 18
2014 1 1 -5 -17 AA EWR MIA 161 1085 16
2014 1 1 191 185 AA EWR DFW 214 1372 16
2014 1 1 -1 -2 AA EWR DFW 214 1372 14
2014 1 1 -3 -10 AA EWR MIA 154 1085 6
2014 1 1 4 -17 AA EWR DFW 215 1372 9
## alternatively we can provide character vectors to the function 'setkeyv()'
# setkeyv(flights, "origin") # useful to program with

set* and :=:

In data.table, the := operator and all the set* (e.g., setkey, setorder, setnames etc.) functions are the only ones which modify the input object by reference.

Once you key a data.table by certain columns, you can subset by querying those key columns using the .() notation in i. Recall that .() is an alias to list().

– Use the key column origin to subset all rows where the origin airport matches “JFK”

flights[.("JFK")]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 1 1 14 13 AA JFK LAX 359 2475 9
2014 1 1 -3 13 AA JFK LAX 363 2475 11
2014 1 1 2 9 AA JFK LAX 351 2475 19
2014 1 1 2 1 AA JFK LAX 350 2475 13
2014 1 1 -2 -18 AA JFK LAX 338 2475 21
2014 10 31 -4 -21 UA JFK SFO 337 2586 17
2014 10 31 -2 -37 UA JFK SFO 344 2586 18
2014 10 31 0 -33 UA JFK LAX 320 2475 17
2014 10 31 -6 -38 UA JFK SFO 343 2586 9
2014 10 31 -6 -38 UA JFK LAX 323 2475 11
## alternatively
# flights[J("JFK")] (or)
# flights[list("JFK")]

– How can we get the column(s) a data.table is keyed by?

Using the function key().

key(flights)
# [1] "origin"

c) Keys and multiple columns

To refresh, keys are like supercharged row names. We can set key on multiple columns and they can be of multiple types.

– How can I set keys on both origin and dest columns?

setkey(flights, origin, dest)
head(flights)
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 1 2 -2 -25 EV EWR ALB 30 143 7
2014 1 3 88 79 EV EWR ALB 29 143 23
2014 1 4 220 211 EV EWR ALB 32 143 15
2014 1 4 35 19 EV EWR ALB 32 143 7
2014 1 5 47 42 EV EWR ALB 26 143 8
2014 1 5 66 62 EV EWR ALB 31 143 23
## or alternatively
# setkeyv(flights, c("origin", "dest")) # provide a character vector of column names

key(flights)
# [1] "origin" "dest"  

– Subset all rows using key columns where first key column origin matches “JFK” and second key column dest matches “MIA”

flights[.("JFK", "MIA")]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 1 1 -1 -17 AA JFK MIA 161 1089 15
2014 1 1 7 -8 AA JFK MIA 166 1089 9
2014 1 1 2 -1 AA JFK MIA 164 1089 12
2014 1 1 6 3 AA JFK MIA 157 1089 5
2014 1 1 6 -12 AA JFK MIA 154 1089 17
2014 10 31 -1 -22 AA JFK MIA 148 1089 16
2014 10 31 -3 -20 AA JFK MIA 146 1089 8
2014 10 31 2 -17 AA JFK MIA 150 1089 6
2014 10 31 -3 -12 AA JFK MIA 150 1089 5
2014 10 31 29 4 AA JFK MIA 146 1089 19

How does the subset work here?

– Subset all rows where just the first key column origin matches “JFK”

key(flights)
# [1] "origin" "dest"  
flights[.("JFK")] ## or in this case simply flights["JFK"], for convenience
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 1 1 10 4 B6 JFK ABQ 280 1826 20
2014 1 2 134 161 B6 JFK ABQ 252 1826 22
2014 1 7 6 6 B6 JFK ABQ 269 1826 20
2014 1 8 15 -15 B6 JFK ABQ 259 1826 20
2014 1 9 45 32 B6 JFK ABQ 267 1826 20
2014 10 31 0 -18 DL JFK TPA 142 1005 8
2014 10 31 1 -8 B6 JFK TPA 149 1005 19
2014 10 31 -2 -22 B6 JFK TPA 145 1005 14
2014 10 31 -8 -5 B6 JFK TPA 149 1005 9
2014 10 31 -4 -18 B6 JFK TPA 145 1005 8

– Subset all rows where just the second key column dest matches “MIA”

flights[.(unique(origin), "MIA")]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 1 1 -5 -17 AA EWR MIA 161 1085 16
2014 1 1 -3 -10 AA EWR MIA 154 1085 6
2014 1 1 -5 -8 AA EWR MIA 157 1085 11
2014 1 1 43 42 UA EWR MIA 155 1085 15
2014 1 1 60 49 UA EWR MIA 162 1085 21
2014 10 31 -11 -8 AA LGA MIA 157 1096 13
2014 10 31 -5 -11 AA LGA MIA 150 1096 9
2014 10 31 -2 10 AA LGA MIA 156 1096 6
2014 10 31 -2 -16 AA LGA MIA 156 1096 19
2014 10 31 1 -11 US LGA MIA 164 1096 15

What’s happening here?

2. Combining keys with j and by

All we have seen so far is the same concept – obtaining row indices in i, but just using a different method – using keys. It shouldn’t be surprising that we can do exactly the same things in j and by as seen from the previous vignettes. We will highlight this with a few examples.

a) Select in j

– Return arr_delay column as a data.table corresponding to origin = "LGA" and dest = "TPA".

key(flights)
# [1] "origin" "dest"  
flights[.("LGA", "TPA"), .(arr_delay)]
arr_delay
1
14
-17
-4
-12
39
-24
-12
21
-11

b) Chaining

– On the result obtained above, use chaining to order the column in decreasing order.

flights[.("LGA", "TPA"), .(arr_delay)][order(-arr_delay)]
arr_delay
486
380
351
318
300
-40
-43
-46
-48
-49

c) Compute or do in j

– Find the maximum arrival delay corresponding to origin = "LGA" and dest = "TPA".

flights[.("LGA", "TPA"), max(arr_delay)]
# [1] 486

d) sub-assign by reference using := in j

We have seen this example already in the vignette("datatable-reference-semantics", package="data.table") vignette. Let’s take a look at all the hours available in the flights data.table:

# get all 'hours' in flights
flights[, sort(unique(hour))]
#  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

We see that there are totally 25 unique values in the data. Both 0 and 24 hours seem to be present. Let’s go ahead and replace 24 with 0, but this time using key.

setkey(flights, hour)
key(flights)
# [1] "hour"
flights[.(24), hour := 0L]
key(flights)
# NULL

Now, there shouldn’t be any 24 in the hour column.

flights[, sort(unique(hour))]
#  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

e) Aggregation using by

Let’s set the key back to origin, dest first.

setkey(flights, origin, dest)
key(flights)
# [1] "origin" "dest"  

– Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month

ans <- flights["JFK", max(dep_delay), keyby = month]
head(ans)
month V1
1 881
2 1014
3 920
4 1241
5 853
6 798
key(ans)
# [1] "month"

3. Additional arguments - mult and nomatch

a) The mult argument

We can choose, for each query, if “all” the matching rows should be returned, or just the “first” or “last” using the mult argument. The default value is “all” - what we’ve seen so far.

– Subset only the first matching row from all rows where origin matches “JFK” and dest matches “MIA”

flights[.("JFK", "MIA"), mult = "first"]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 1 1 6 3 AA JFK MIA 157 1089 5

– Subset only the last matching row of all the rows where origin matches “LGA”, “JFK”, “EWR” and dest matches “XNA”

flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last"]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 5 23 163 148 MQ LGA XNA 158 1147 18
JFK XNA
2014 2 3 231 268 EV EWR XNA 184 1131 12

b) The nomatch argument

We can choose if queries that do not match should return NA or be skipped altogether using the nomatch argument.

– From the previous example, Subset all rows only if there’s a match

flights[.(c("LGA", "JFK", "EWR"), "XNA"), mult = "last", nomatch = NULL]
year month day dep_delay arr_delay carrier origin dest air_time distance hour
2014 5 23 163 148 MQ LGA XNA 158 1147 18
2014 2 3 231 268 EV EWR XNA 184 1131 12

4. binary search vs vector scans

We have seen so far how we can set and use keys to subset. But what’s the advantage? For example, instead of doing:

# key by origin,dest columns
flights[.("JFK", "MIA")]

we could have done:

flights[origin == "JFK" & dest == "MIA"]

One advantage very likely is shorter syntax. But even more than that, binary search based subsets are incredibly fast.

As the time goes data.table gets new optimization and currently the latter call is automatically optimized to use binary search.
To use slow vector scan key needs to be removed.

setkey(flights, NULL)
flights[origin == "JFK" & dest == "MIA"]

a) Performance of binary search approach

To illustrate, let’s create a sample data.table with 20 million rows and three columns and key it by columns x and y.

set.seed(2L)
N = 2e7L
DT = data.table(x = sample(letters, N, TRUE),
                y = sample(1000L, N, TRUE),
                val = runif(N))
print(object.size(DT), units = "MiB")
# 381.5 MiB

DT is ~380MiB. It is not really huge, but this will do to illustrate the point.

From what we have seen in the Introduction to data.table section, we can subset those rows where columns x = "g" and y = 877 as follows:

key(DT)
# NULL
## (1) Usual way of subsetting - vector scan approach
t1 <- system.time(ans1 <- DT[x == "g" & y == 877L])
t1
#    user  system elapsed 
#   0.379   0.121   0.500 
head(ans1)
x y val
g 877 0.571
g 877 0.749
g 877 0.036
g 877 0.281
g 877 0.837
g 877 0.439
dim(ans1)
# [1] 762   3

Now let’s try to subset by using keys.

setkeyv(DT, c("x", "y"))
key(DT)
# [1] "x" "y"
## (2) Subsetting using keys
t2 <- system.time(ans2 <- DT[.("g", 877L)])
t2
#    user  system elapsed 
#   0.001   0.000   0.001 
head(ans2)
x y val
g 877 0.571
g 877 0.749
g 877 0.036
g 877 0.281
g 877 0.837
g 877 0.439
dim(ans2)
# [1] 762   3
identical(ans1$val, ans2$val)
# [1] TRUE

b) Why does keying a data.table result in blazing fast subsets?

To understand that, let’s first look at what vector scan approach (method 1) does.

Vector scan approach

This is what we call a vector scan approach. And this is quite inefficient, especially on larger tables and when one needs repeated subsetting, because it has to scan through all the rows each time.

Now let us look at binary search approach (method 2). Recall from Properties of key - setting keys reorders the data.table by key columns. Since the data is sorted, we don’t have to scan through the entire length of the column! We can instead use binary search to search a value in O(log n) as opposed to O(n) in case of vector scan approach, where n is the number of rows in the data.table.

Binary search approach

Here’s a very simple illustration. Let’s consider the (sorted) numbers shown below:

1, 5, 10, 19, 22, 23, 30

Suppose we’d like to find the matching position of the value 1, using binary search, this is how we would proceed - because we know that the data is sorted.

A vector scan approach on the other hand would have to scan through all the values (here, 7).

It can be seen that with every search we reduce the number of searches by half. This is why binary search based subsets are incredibly fast. Since rows of each column of data.tables have contiguous locations in memory, the operations are performed in a very cache efficient manner (also contributes to speed).

In addition, since we obtain the matching row indices directly without having to create those huge logical vectors (equal to the number of rows in a data.table), it is quite memory efficient as well.

Summary

In this vignette, we have learnt another method to subset rows in i by keying a data.table. Setting keys allows us to perform blazing fast subsets by using binary search. In particular, we have seen how to

Key based subsets are incredibly fast and are particularly useful when the task involves repeated subsetting. But it may not be always desirable to set key and physically reorder the data.table. In the next next vignette (vignette("datatable-secondary-indices-and-auto-indexing", package="data.table")), we will address this using a new feature – secondary indexes.