Determine Duplicate Rows

duplicated returns a logical vector indicating which rows of a data.table are duplicates of a row with smaller subscripts.

unique returns a data.table with duplicated rows removed, by columns specified in by argument. When no by then duplicated rows by all columns are removed.

anyDuplicated returns the index i of the first duplicated entry if there is one, and 0 otherwise.

uniqueN is equivalent to length(unique(x)) when x is an atomic vector, and nrow(unique(x)) when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.

Usage

# S3 method for class 'data.table'
duplicated(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...)

# S3 method for class 'data.table'
unique(x, incomparables=FALSE, fromLast=FALSE,
by=seq_along(x), cols=NULL, ...)

# S3 method for class 'data.table'
anyDuplicated(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...)

uniqueN(x, by=if (is.list(x)) seq_along(x) else NULL, na.rm=FALSE)

Arguments

x: A data.table. uniqueN accepts atomic vectors and data.frames as well.
...: Not used at this time.
incomparables: Not used. Here for S3 method consistency.
fromLast: Logical indicating if duplication should be considered from the reverse side. For duplicated, this means the last (or rightmost) of identical elements will correspond to duplicated = FALSE. For unique, this means the last (or rightmost) of identical elements will be kept. See examples.
by: character or integer vector indicating which combinations of columns from x to use for uniqueness checks. By default all columns are being used. That was changed recently for consistency to data.frame methods. In version < 1.9.8 default was key(x).
cols: Columns (in addition to by) from x to include in the resulting data.table.
na.rm: Logical (default is FALSE). Should missing values (including NaN) be removed?

Details

Because data.tables are usually sorted by key, tests for duplication are especially quick when only the keyed columns are considered. Unlike unique.data.frame, paste is not used to ensure equality of floating point data. It is instead accomplished directly and is therefore quite fast. data.table provides setNumericRounding to handle cases where limitations in floating point representation is undesirable.

v1.9.4 introduces anyDuplicated method for data.tables and is similar to base in functionality. It also implements the logical argument fromLast for all three functions, with default value FALSE.

Note: When cols is specified, the resulting table will have columns c(by, cols), in that order.

Value

duplicated returns a logical vector of length nrow(x) indicating which rows are duplicates.

unique returns a data table with duplicated rows removed.

anyDuplicated returns a integer value with the index of first duplicate. If none exists, 0L is returned.

uniqueN returns the number of unique elements in the vector, data.frame or data.table.

Examples

DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3),
                  C = rep(1:2, 6), key = c("A", "B"))
duplicated(DT)
#>  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
unique(DT)
#> Key: <A, B>
#>         A     B     C
#>     <int> <int> <int>
#>  1:     1     1     1
#>  2:     1     1     2
#>  3:     1     2     2
#>  4:     2     2     1
#>  5:     2     2     2
#>  6:     2     3     1
#>  7:     2     3     2
#>  8:     3     3     1
#>  9:     3     4     2
#> 10:     3     4     1

duplicated(DT, by="B")
#>  [1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
unique(DT, by="B")
#> Key: <A, B>
#>        A     B     C
#>    <int> <int> <int>
#> 1:     1     1     1
#> 2:     1     2     2
#> 3:     2     3     1
#> 4:     3     4     2

duplicated(DT, by=c("A", "C"))
#>  [1] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
unique(DT, by=c("A", "C"))
#> Key: <A, B>
#>        A     B     C
#>    <int> <int> <int>
#> 1:     1     1     1
#> 2:     1     1     2
#> 3:     2     2     1
#> 4:     2     2     2
#> 5:     3     3     1
#> 6:     3     4     2

DT = data.table(a=c(2L,1L,2L), b=c(1L,2L,1L))   # no key
unique(DT)                   # rows 1 and 2 (row 3 is a duplicate of row 1)
#>        a     b
#>    <int> <int>
#> 1:     2     1
#> 2:     1     2

DT = data.table(a=c(3.142, 4.2, 4.2, 3.142, 1.223, 1.223), b=rep(1,6))
unique(DT)                   # rows 1,2 and 5
#>        a     b
#>    <num> <num>
#> 1: 3.142     1
#> 2: 4.200     1
#> 3: 1.223     1

DT = data.table(a=tan(pi*(1/4 + 1:10)), b=rep(1,10))   # example from ?all.equal
length(unique(DT$a))         # 10 strictly unique floating point values
#> [1] 10
all.equal(DT$a,rep(1,10))    # TRUE, all within tolerance of 1.0
#> [1] TRUE
DT[,which.min(a)]            # row 10, the strictly smallest floating point value
#> [1] 10
identical(unique(DT),DT[1])  # TRUE, stable within tolerance
#> [1] FALSE
identical(unique(DT),DT[10]) # FALSE
#> [1] FALSE

# fromLast = TRUE vs. FALSE
DT <- data.table(A = c(1, 1, 2, 2, 3), B = c(1, 2, 1, 1, 2), C = c("a", "b", "a", "b", "a"))

duplicated(DT, by="B", fromLast=FALSE) # rows 3,4,5 are duplicates
#> [1] FALSE FALSE  TRUE  TRUE  TRUE
unique(DT, by="B", fromLast=FALSE) # equivalent: DT[!duplicated(DT, by="B", fromLast=FALSE)]
#>        A     B      C
#>    <num> <num> <char>
#> 1:     1     1      a
#> 2:     1     2      b

duplicated(DT, by="B", fromLast=TRUE) # rows 1,2,3 are duplicates
#> [1]  TRUE  TRUE  TRUE FALSE FALSE
unique(DT, by="B", fromLast=TRUE) # equivalent: DT[!duplicated(DT, by="B", fromLast=TRUE)]
#>        A     B      C
#>    <num> <num> <char>
#> 1:     2     1      b
#> 2:     3     2      a

# anyDuplicated
anyDuplicated(DT, by=c("A", "B"))    # 3L
#> [1] 4
any(duplicated(DT, by=c("A", "B")))  # TRUE
#> [1] TRUE

# uniqueN, unique rows on key columns
uniqueN(DT, by = key(DT))
#> [1] 5
# uniqueN, unique rows on all columns
uniqueN(DT)
#> [1] 5
# uniqueN while grouped by "A"
DT[, .(uN=uniqueN(.SD)), by=A]
#>        A    uN
#>    <num> <int>
#> 1:     1     2
#> 2:     2     2
#> 3:     3     1

# uniqueN's na.rm=TRUE
x = sample(c(NA, NaN, runif(3)), 10, TRUE)
uniqueN(x, na.rm = FALSE) # 5, default
#> [1] 4
uniqueN(x, na.rm=TRUE) # 3
#> [1] 3

Usage

Arguments

Details

Value

See also

Examples