Fast overlap joins
foverlaps.RdA fast binary-search based overlap join of two data.tables.
This is very much inspired by findOverlaps function from the Bioconductor
package IRanges (see link below under See Also).
Usually, x is a very large data.table with small interval ranges, and
y is much smaller keyed data.table with relatively larger
interval spans. For a usage in genomics, see the examples section.
NOTE: This is still under development, meaning it is stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.
Arguments
- x, y
data.tables.yneeds to be keyed, but not necessarilyx. See examples.- by.x, by.y
A vector of column names (or numbers) to compute the overlap joins. The last two columns in both
by.xandby.yshould each correspond to thestartandendinterval columns inxandyrespectively. We should always havestart <= end. Ifxis keyed,by.xis equal tokey(x), elsekey(y).by.ydefaults tokey(y).- maxgap
Non-negative integer, i.e.,
maxgap >= 0. Default is 0 (no gap). For intervals[a,b]and[c,d], wherea<=bandc<=d, whenc > bord < a, the two intervals don't overlap. If the gap between these two intervals is<= maxgap, these two intervals are considered as overlapping. Note: This is not yet implemented.- minoverlap
Positive integer, i.e.,
minoverlap > 0. Default is 1. For intervals[a,b]and[c,d], wherea<=bandc<=d, whenc<=bandd>=a, the two intervals overlap. If the length of overlap between these two intervals is>= minoverlap, then these two intervals are considered to be overlapping. Note: This is not yet implemented.- type
Default value is
any. Allowed values areany,within,start,endandequal.The types shown here are identical in functionality to the function
findOverlapsin the bioconductor packageIRanges. Let[a,b]and[c,d]be intervals inxandywitha<=bandc<=d. Fortype="start", the intervals overlap iffa == c. Fortype="end", the intervals overlap iffb == d. Fortype="within", the intervals overlap iffa>=c and b<=d. Fortype="equal", the intervals overlap iffa==c and b==d. Fortype="any", as long asc<=b and d>=a, they overlap. In addition to these requirements, they also have to satisfy theminoverlapargument as explained above.NB:
maxgapargument, when > 0, is to be interpreted according to the type of the overlap. This will be updated oncemaxgapis implemented.- mult
When multiple rows in
ymatch to the row inx,mult=.controls which values are returned -"all"(default),"first"or"last".- nomatch
When a row (with interval say,
[a,b]) inxhas no match iny,nomatch=NA(default) meansNAis returned fory's non-by.ycolumns for that row ofx.nomatch=NULL(or0for backward compatibility) means no rows will be returned for that row ofx.- which
When
TRUE, ifmult="all"returns a two columndata.tablewith the first column corresponding tox's row number and the second corresponding toy's. Whennomatch=NA, no matches returnNAfory, and ifnomatch=NULL, those rows where no match is found will be skipped; ifmult="first" or "last", a vector of length equal to the number of rows inxis returned, with no-match entries filled withNAor0corresponding to thenomatchargument. Default isFALSE, which returns a join with the rows iny.- verbose
TRUEturns on status and information messages to the console. Turn this on by default usingoptions(datatable.verbose=TRUE). The quantity and types of verbosity may be expanded in future.
Details
Very briefly, foverlaps() collapses the two-column interval in y
to one-column of unique values to generate a lookup table, and
then performs the join depending on the type of overlap, using the
already available binary search feature of data.table. The time
(and space) required to generate the lookup is therefore proportional
to the number of unique values present in the interval columns of y
when combined together.
Overlap joins takes advantage of the fact that y is sorted to speed-up
finding overlaps. Therefore y has to be keyed (see ?setkey)
prior to running foverlaps(). A key on x is not necessary,
although it might speed things further. The columns in by.x
argument should correspond to the columns specified in by.y. The last
two columns should be the interval columns in both by.x and
by.y. The first interval column in by.x should always be <= the
second interval column in by.x, and likewise for by.y. The
storage.mode of the interval columns must be either double
or integer. It therefore works with bit64::integer64 type as well.
The lookup generation step could be quite time consuming if the number
of unique values in y are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for y.
Value
A new data.table by joining over the interval columns (along with other
additional identifier columns) specified in by.x and by.y.
NB: When which=TRUE: a) mult="first" or "last" returns a
vector of matching row numbers in y, and b) when
mult="all" returns a data.table with two columns with the first
containing row numbers of x and the second column with corresponding
row numbers of y.
nomatch=NA|NULL also influences whether non-matching rows are returned
or not, as explained above.
Examples
require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
#> xid yid
#> <int> <int>
#> 1: 1 NA
#> 2: 2 2
#> 3: 2 3
#> 4: 3 2
#> 5: 4 NA
foverlaps(x, y, type="any") ## return overlap join
#> start end val1 i.start i.end val2
#> <num> <num> <int> <num> <num> <int>
#> 1: NA NA NA 5 8 7
#> 2: 20 35 2 31 50 8
#> 3: 30 45 3 31 50 8
#> 4: 20 35 2 22 25 9
#> 5: NA NA NA 16 18 10
foverlaps(x, y, type="any", mult="first") ## returns only first match
#> start end val1 i.start i.end val2
#> <num> <num> <int> <num> <num> <int>
#> 1: NA NA NA 5 8 7
#> 2: 20 35 2 31 50 8
#> 3: 20 35 2 22 25 9
#> 4: NA NA NA 16 18 10
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'
#> start end val1 i.start i.end val2
#> <num> <num> <int> <num> <num> <int>
#> 1: NA NA NA 5 8 7
#> 2: NA NA NA 31 50 8
#> 3: 20 35 2 22 25 9
#> 4: NA NA NA 16 18 10
## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
#> xid yid
#> <int> <int>
#> 1: 1 NA
#> 2: 2 2
#> 3: 3 3
#> 4: 4 3
#> 5: 5 3
foverlaps(x, y, type="any")
#> chr start end geneid i.start i.end
#> <char> <num> <num> <char> <num> <num>
#> 1: Chr1 NA NA <NA> 5 11
#> 2: Chr1 15 18 b 10 20
#> 3: Chr2 1 55 c 1 4
#> 4: Chr2 1 55 c 25 52
#> 5: Chr2 1 55 c 50 60
foverlaps(x, y, type="any", nomatch=NULL)
#> chr start end geneid i.start i.end
#> <char> <num> <num> <char> <num> <num>
#> 1: Chr1 15 18 b 10 20
#> 2: Chr2 1 55 c 1 4
#> 3: Chr2 1 55 c 25 52
#> 4: Chr2 1 55 c 50 60
foverlaps(x, y, type="within", which=TRUE)
#> xid yid
#> <int> <int>
#> 1: 1 NA
#> 2: 2 NA
#> 3: 3 3
#> 4: 4 3
#> 5: 5 NA
foverlaps(x, y, type="within")
#> chr start end geneid i.start i.end
#> <char> <num> <num> <char> <num> <num>
#> 1: Chr1 NA NA <NA> 5 11
#> 2: Chr1 NA NA <NA> 10 20
#> 3: Chr2 1 55 c 1 4
#> 4: Chr2 1 55 c 25 52
#> 5: Chr2 NA NA <NA> 50 60
foverlaps(x, y, type="start")
#> chr start end geneid i.start i.end
#> <char> <num> <num> <char> <num> <num>
#> 1: Chr1 NA NA <NA> 5 11
#> 2: Chr1 NA NA <NA> 10 20
#> 3: Chr2 1 55 c 1 4
#> 4: Chr2 NA NA <NA> 25 52
#> 5: Chr2 NA NA <NA> 50 60
## x and y have different column names - specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"),
type="any", which=TRUE)
#> xid yid
#> <int> <int>
#> 1: 1 NA
#> 2: 2 2
#> 3: 3 3
#> 4: 4 3
#> 5: 5 3