R works with matrixes much faster than it does with data.frames. Here’s a benchmark using my own data from the package ldsr. In that package I use data.table but here I will convert the data to data.frame and use only base R.
# Package for benchmarking library(microbenchmark)
# Get data from the ldsr package Qa <- as.data.frame(ldsr::P1annual) # Response pc <- as.data.frame(ldsr::P1pc) # Inputs # Make a data.frame for lm() dt <- merge(Qa, pc, by = 'year') dt$year <- NULL # Make input matrix and response vector y <- dt[, 1] x <- as.matrix(dt[, -1]) # Funcitons to build backward stepwise linear regression with data.frames and matrices lmDT <- function(dt) lm(Qa ~ ., data = dt) lmMat <- function(x, y) lm(y ~ x)
microbenchmark(lmDT(dt), times = 500, unit = 'ms')
## Unit: milliseconds ## expr min lq mean median uq max neval ## lmDT(dt) 1.140536 1.300695 1.919813 1.550645 1.807281 19.49418 500
microbenchmark(lmMat(x, y), times = 500, unit = 'ms')
## Unit: milliseconds ## expr min lq mean median uq max neval ## lmMat(x, y) 0.517094 0.57186 0.721037 0.615164 0.7141885 10.18904 500
The data.frame code takes twice the time of the vector and matrix code. Now if we do stepwise variable selection instead of using all variables, the difference is even bigger.
microbenchmark(step(lmDT(dt), trace = 0), times = 500, unit = 'ms')
## Unit: milliseconds ## expr min lq mean median uq ## step(lmDT(dt), trace = 0) 28.74455 30.87916 36.57506 33.58721 41.44933 ## max neval ## 184.1026 500
microbenchmark(step(lmMat(x, y), trace = 0), times = 500, unit = 'ms')
## Unit: milliseconds ## expr min lq mean median uq ## step(lmMat(x, y), trace = 0) 2.026346 2.196694 2.644744 2.342843 2.652335 ## max neval ## 13.35719 500
The data.frame code is now ten times slower.