Back

Explore Courses Blog Tutorials Interview Questions
0 votes
2 views
in R Programming by (5.3k points)

I have very large tables (30 million rows) that I would like to load as a data frame in R.  read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names and does not have any pathological characters that I have to worry about.

I know that reading in a table as a list using a scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

But some of my attempts to convert this to a data frame appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

Is there a better way of doing this? Or quite possibly a completely different approach to the problem?

1 Answer

0 votes
by

You can use the fread function from the data.table package in R to import large tables in a very short time.

For example:

To check the time elapsed by fread function to read a table of 1 million rows:

 library(data.table) n=1e6 DT = data.table( a=sample(1:1000,n,replace=TRUE), b=sample(1:1000,n,replace=TRUE), c=rnorm(n), d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE), e=rnorm(n), f=sample(1:1000,n,replace=TRUE) ) DT[2,b:=NA_integer_] DT[4,c:=NA_real_] DT[3,d:=NA_character_] DT[5,d:=""] DT[2,e:=+Inf] DT[3,e:=-Inf] write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE) cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")  

To read from fread function:

 require(data.table) system.time(DT <- fread("test.csv"))  

Output:

 system.time(DT <- fread("test.csv")) 

user system elapsed 

0.07  0.05  0.09  

Comparison of fread function with other functions:

 ## user system elapsed Method

##2.59 0.08 2.70 read.csv (first time)

##2.61 0.09 2.72 read.csv (second time) 

##1.08 0.06 1.14 Optimized read.table 

##0.13 0.03 0.08 fread  

Browse Categories

...