0 votes
1 view
in R Programming by (5k points)

I'm having a little trouble understanding the pass-by-reference properties of data.table. Some operations seem to 'break' the reference, and I'd like to understand exactly what's happening.

On creating a data.table from other data.table (via <-, then updating the new table by :=, the original table is also altered. This is expected, as per:

?data.table::copy

Here's an example:

library(data.table)

DT <- data.table(a=c(1,2), b=c(11,12))

print(DT)

#      a  b

# [1,] 1 11

# [2,] 2 12

newDT <- DT        # reference, not copy

newDT[1, a := 100] # modify new DT

print(DT)          # DT is modified too.

#        a  b

# [1,] 100 11

# [2,]   2 12

However, if I insert a non-:= based modification between the <- assignment and the := lines above, DT is now no longer modified:

DT = data.table(a=c(1,2), b=c(11,12))

newDT <- DT        

newDT$b[2] <- 200  # new operation

newDT[1, a := 100]

print(DT)

#      a  b

# [1,] 1 11

# [2,] 2 12

So it seems that the newDT$b[2] <- 200 line somehow 'breaks' the reference. I'd guess that this invokes a copy somehow, but I would like to understand fully how R is treating these operations, to ensure I don't introduce potential bugs in my code.

I'd very much appreciate if someone could explain this to me.

1 Answer

0 votes
by (23.2k points)

In R, using <- , =, or -> operators on data.table objects for sub-assignment make a copy of the whole object before performing any operations on it.

For large data tables, these operators are not preferred as they consume a huge amount of working memory by copying the whole object.

For example:

library(data.table)

DT <- data.table(a=c(1,2), b=c(11,12))

print(DT)

   a  b

1: 1 11

2: 2 12

newDT <- DT      #Creating new data table

Modifying newDT table:

newDT$a[1] <- 200

print(newDT)

     a  b

1: 200 11

2:   2 12

Original data table remains the same:

print(DT) 

   a b 

1: 1 11 

2: 2 12

While the data.table operators := and set() are used for large data tables to assign by reference to whatever object they are passed. So if that object was previously copied (by a sub-assignment using <- or an explicit copy(DT)) then it's the copy that gets modified by reference. data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects. 

For example:

library(data.table) 

DT <- data.table(a=c(1,2), b=c(11,12)) 

print(DT) 

   a b 

1: 1 11 

2: 2 12 

newDT <- DT #creating new data table 

newDT[1, a := 100] #modifying new data table print(newDT) 

     a b 

1: 100 11 

2: 2 12 

print(DT) #The original data table also gets modified. 

     a b 

1: 100 11 

2: 2 12

 

...