Quickly reading very large tables as dataframes

Question

asked Jul 10, 2019 in R Programming by Ajinkya757 (5.3k points)

I have very large tables (30 million rows) that I would like to load as a data frame in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names and does not have any pathological characters that I have to worry about.

I know that reading in a table as a list using a scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert this to a data frame appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way of doing this? Or quite possibly a completely different approach to the problem?

1 Answer

anonymous · Answer 1 · 2019-07-11T06:14:52+0000

You can use the fread function from the data.table package in R to import large tables in a very short time.

For example:

To check the time elapsed by fread function to read a table of 1 million rows:

library(data.table) n=1e6 DT = data.table( a=sample(1:1000,n,replace=TRUE), b=sample(1:1000,n,replace=TRUE), c=rnorm(n), d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE), e=rnorm(n), f=sample(1:1000,n,replace=TRUE) ) DT[2,b:=NA_integer_] DT[4,c:=NA_real_] DT[3,d:=NA_character_] DT[5,d:=""] DT[2,e:=+Inf] DT[3,e:=-Inf] write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE) cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")

To read from fread function:

require(data.table) system.time(DT <- fread("test.csv"))

Output:

system.time(DT <- fread("test.csv"))
user system elapsed
0.07 0.05 0.09

Comparison of fread function with other functions:

## user system elapsed Method
##2.59 0.08 2.70 read.csv (first time)
##2.61 0.09 2.72 read.csv (second time)
##1.08 0.06 1.14 Optimized read.table
##0.13 0.03 0.08 fread

Quickly reading very large tables as dataframes

1 Answer

Related questions

Browse Categories

Browse By Domains

Popular Courses

Popular Tutorials

Popular Resources