Matrix vs data.frame in lm()

R works with matrixes much faster than it does with data.frames. Here’s a benchmark using my own data from the package ldsr. In that package I use data.table but here I will convert the data to data.frame and use only base R.

# Package for benchmarking
library(microbenchmark)
# Get data from the ldsr package
Qa <- as.data.frame(ldsr::P1annual) # Response
pc <- as.data.frame(ldsr::P1pc)     # Inputs

# Make a data.frame for lm()
dt <- merge(Qa, pc, by = 'year')
dt$year <- NULL
# Make input matrix and response vector
y <- dt[, 1]
x <- as.matrix(dt[, -1])
# Funcitons to build backward stepwise linear regression with data.frames and matrices
lmDT <- function(dt) lm(Qa ~ ., data = dt)
lmMat <- function(x, y) lm(y ~ x)

microbenchmark(lmDT(dt), times = 500, unit = 'ms')

## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##  lmDT(dt) 1.140536 1.300695 1.919813 1.550645 1.807281 19.49418   500

microbenchmark(lmMat(x, y), times = 500, unit = 'ms')

## Unit: milliseconds
##         expr      min      lq     mean   median        uq      max neval
##  lmMat(x, y) 0.517094 0.57186 0.721037 0.615164 0.7141885 10.18904   500

The data.frame code takes twice the time of the vector and matrix code. Now if we do stepwise variable selection instead of using all variables, the difference is even bigger.

microbenchmark(step(lmDT(dt), trace = 0), times = 500, unit = 'ms')

## Unit: milliseconds
##                       expr      min       lq     mean   median       uq
##  step(lmDT(dt), trace = 0) 28.74455 30.87916 36.57506 33.58721 41.44933
##       max neval
##  184.1026   500

microbenchmark(step(lmMat(x, y), trace = 0), times = 500, unit = 'ms')

## Unit: milliseconds
##                          expr      min       lq     mean   median       uq
##  step(lmMat(x, y), trace = 0) 2.026346 2.196694 2.644744 2.342843 2.652335
##       max neval
##  13.35719   500

The data.frame code is now ten times slower.