using broom to gather model results

Han Oostdijk

2020/06/30

Date last run: 06Jul2020

Introduction

Today I was experimenting with some models with a factor variable. I wanted to collect the results of the various models (combinations of the factor variable with other variables). For this I remembered the package broom.
This entry documents the code that I used on the iris data set.

Load the packages that are needed

HOQCutil::silent_library(c('stringr','broom','tibble','purrr','knitr'))

Function to construct vector with formulas

In the function construct_formulas we create a character vector with all the formulas with intercept where the dependent variable is the first column of a given data.frame and the independent variables are combinations of the other columns. We will use this function on the data set iris that contains the variable Species as a factor variable.

construct_formulas <- function(df){
  x = do.call(expand.grid,purrr::map(names(df)[-1],~c(' ',.)))
  x = do.call(paste,x)
  x[1]= '1'
  x = stringr::str_squish(x)
  paste("Sepal.Length ~ ", stringr::str_replace_all(x, ' ', ' + '))
}

Function to calculate model for one formula

In the function broom_glm we use the formula to calculate the model and use the function in the third argument to collect data from the calculation results. The default function for the third argument is broom::glance but also broom::tidy and broom::augment can be used. The model formula is inserted as first column of the result.

broom_glm <- function (formchar,df, broomfun=broom::glance) {
  mod = glm(formchar,data = df)
  df1 = broomfun(mod)
  cbind(tibble::tibble(formula=formchar),df1)
}

Use the functions

In this section the functions are used to:

Only the first 6 rows of the result tables are shown.

models = construct_formulas(iris)

df1 = purrr::map_dfr(models,~broom_glm(.,iris)) 
knitr::kable(head(df1))
formula null.deviance df.null logLik AIC BIC deviance df.residual
Sepal.Length ~ 1 102.1683 149 -184.03977 372.0795 378.1008 102.16833 149
Sepal.Length ~ Sepal.Width 102.1683 149 -182.99584 371.9917 381.0236 100.75610 148
Sepal.Length ~ Petal.Length 102.1683 149 -77.02021 160.0404 169.0723 24.52503 148
Sepal.Length ~ Sepal.Width + Petal.Length 102.1683 149 -46.51275 101.0255 113.0680 16.32876 147
Sepal.Length ~ Petal.Width 102.1683 149 -101.11073 208.2215 217.2534 33.81489 148
Sepal.Length ~ Sepal.Width + Petal.Width 102.1683 149 -91.91036 191.8207 203.8633 29.91110 147
df2 = purrr::map_dfr(models,~broom_glm(.,iris,broom::tidy)) 
knitr::kable(head(df2))
formula term estimate std.error statistic p.value
Sepal.Length ~ 1 (Intercept) 5.8433333 0.0676113 86.425375 0.0000000
Sepal.Length ~ Sepal.Width (Intercept) 6.5262226 0.4788963 13.627631 0.0000000
Sepal.Length ~ Sepal.Width Sepal.Width -0.2233611 0.1550809 -1.440287 0.1518983
Sepal.Length ~ Petal.Length (Intercept) 4.3066034 0.0783890 54.938900 0.0000000
Sepal.Length ~ Petal.Length Petal.Length 0.4089223 0.0188913 21.646019 0.0000000
Sepal.Length ~ Sepal.Width + Petal.Length (Intercept) 2.2491402 0.2479696 9.070224 0.0000000
df3 = purrr::map_dfr(models,~broom_glm(.,iris,broom::augment)) 
knitr::kable(head(df3))
formula Sepal.Length .fitted .se.fit .resid .hat .sigma .cooksd .std.resid Sepal.Width Petal.Length Petal.Width Species
Sepal.Length ~ 1 5.1 5.843333 0.0676113 -0.7433333 0.0066667 0.8285941 0.0054445 -0.9006812 NA NA NA NA
Sepal.Length ~ 1 4.9 5.843333 0.0676113 -0.9433333 0.0066667 0.8272083 0.0087684 -1.1430169 NA NA NA NA
Sepal.Length ~ 1 4.7 5.843333 0.0676113 -1.1433333 0.0066667 0.8254906 0.0128806 -1.3853527 NA NA NA NA
Sepal.Length ~ 1 4.6 5.843333 0.0676113 -1.2433333 0.0066667 0.8245067 0.0152322 -1.5065205 NA NA NA NA
Sepal.Length ~ 1 5.0 5.843333 0.0676113 -0.8433333 0.0066667 0.8279425 0.0070079 -1.0218490 NA NA NA NA
Sepal.Length ~ 1 5.4 5.843333 0.0676113 -0.4433333 0.0066667 0.8300540 0.0019366 -0.5371776 NA NA NA NA

The best models here based on AIC resp. BIC are

formula null.deviance df.null logLik AIC BIC deviance df.residual
Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width + Species 102.1683 149 -32.55801 79.11602 100.1905 13.55649 144
formula null.deviance df.null logLik AIC BIC deviance df.residual
Sepal.Length ~ Sepal.Width + Petal.Length + Species 102.1683 149 -34.78746 81.57492 99.63873 13.96551 145

Session Info

This document was produced on 06Jul2020 with the following R environment:

  #> R version 4.0.2 (2020-06-22)
  #> Platform: x86_64-w64-mingw32/x64 (64-bit)
  #> Running under: Windows 10 x64 (build 18363)
  #> 
  #> Matrix products: default
  #> 
  #> locale:
  #> [1] LC_COLLATE=English_United States.1252 
  #> [2] LC_CTYPE=English_United States.1252   
  #> [3] LC_MONETARY=English_United States.1252
  #> [4] LC_NUMERIC=C                          
  #> [5] LC_TIME=English_United States.1252    
  #> 
  #> attached base packages:
  #> [1] stats     graphics  grDevices utils     datasets  methods   base     
  #> 
  #> other attached packages:
  #> [1] knitr_1.29    purrr_0.3.4   tibble_3.0.1  broom_0.5.6   stringr_1.4.0
  #> 
  #> loaded via a namespace (and not attached):
  #>  [1] Rcpp_1.0.4.6     magrittr_1.5     tidyselect_1.1.0 HOQCutil_0.1.22 
  #>  [5] lattice_0.20-41  R6_2.4.1         rlang_0.4.6      highr_0.8       
  #>  [9] dplyr_1.0.0      tools_4.0.2      grid_4.0.2       nlme_3.1-148    
  #> [13] xfun_0.15        htmltools_0.4.0  ellipsis_0.3.0   digest_0.6.25   
  #> [17] lifecycle_0.2.0  crayon_1.3.4     tidyr_1.1.0      vctrs_0.3.1     
  #> [21] glue_1.4.1       evaluate_0.14    rmarkdown_2.3    stringi_1.4.6   
  #> [25] compiler_4.0.2   pillar_1.4.3     backports_1.1.6  generics_0.0.2  
  #> [29] pkgconfig_2.0.3