A note on AIC scores for quasi-families in #rstats

A summary note on recent set of #rstats discoveries in estimating AIC scores to better understand a quasipoisson family in GLMS relative to treating data as poisson.

Conceptual GLM workflow rules/guidelines

  1. Data are best untransformed. Fit better model to data.
  2. Select your data structure to match purpose with statistical model.
  3. Use logic and understanding of data not AIC scores to select best model.

(1) Typically, the power and flexibility of GLMs in R (even with base R) get most of the work done for the ecological data we work with within the research team. We prefer to leave data untransformed and simple when possible and use the family or offset arguments within GLMs to address data issues.

(2) Data structure is a new concept to us. We have come to appreciate that there are both individual and population-level queries associated with many of the datasets we have collected.  For our purposes, data structure is defined as the level that the dplyr::group_by to tally or count frequencies is applied. If the ecological purpose of the experiment was defined as the population response to a treatment for instance, the population becomes the sample unit – not the individual organism – and summarised as such. It is critical to match the structure of data wrangled to the purpose of the experiment to be able to fit appropriate models. Higher-order data structures can reduce the likelihood of nested, oversampled, or pseudoreplicated model fitting.

(3) Know thy data and experiment. It is easy to get lost in model fitting and dive deep into unduly complex models. There are tools before model fitting that can prime you for better, more elegant model fits.

Workflow

  1. Wrangle then data viz.
  2. Library(fitdistrplus) to explore distributions.
  3. Select data structure.
  4. Fit models.

Now, specific to topic of AIC scores for quasi-family field studies.

We recently selected quasipoisson for the family to model frequency and count data (for individual-data structures). This addressed overdispersion issues within the data. AIC scores are best used for understanding prediction not description, and logic and exploration of distributions, CDF plots, and examination of the deviance (i.e. not be more than double the degrees of freedom) framed the data and model contexts. To contrast poisson to quasipoisson for prediction, i.e. would the animals respond differently to the treatments/factors within the experiment, we used the following #rstats solutions.

————

#Functions####

#deviance calc

dfun <- function(object) {

with(object,sum((weights * residuals^2)[weights > 0])/df.residual)

}

#reuses AIC from poisson family estimation

x.quasipoisson <- function(…) {

res <- quasipoisson(…)

res$aic <- poisson(…)$aic

res

}

#AIC package that provided most intuitive solution set####

require(MuMIn)

m <- update(m,family=”x.quasipoisson”, na.action=na.fail)

m1 <- dredge(m,rank=”QAIC”, chat=dfun(m))

m1

#repeat as needed to contrast different models

————

Outcomes

This #rstats opportunity generated a lot of positive discussion on data structures, how we use AIC scores, and how to estimate fit for at least this quasi-family model set in as few lines of code as possible.

Resources

  1. An R vignette by Ben Bolker of quasi solutions.
  2. An Ecology article on quasi-possion versus nb.regression for overdispersed count data.
  3. A StatsExchange discussion on AIC scores.

Fundamentals

Same data, different structure, lead to different models. Quasipoisson a reasonable solution for overdispersed count and frequency animal ecology data. AIC scores are a bit of work, but not extensive code, to extract. AIC scores provide a useful insight into predictive capacities if the purpose is individual-level prediction of count/frequency to treatments.