Hacking the principles of #openscience #workshops

In a previous post, I discussed the key elements that really stood out for me in recent workshops associated with open science, data science, and ecology. Summer workshop season is upon us, and here are some principles to consider that can be used to hack a workshop. These hacks can be applied a priori as an instructor or in situ as a participant or instructor by engaging with the context from a pragmatic, problem-solving perspective.

Principles

1. Embrace open pedagogy.
2. Use and current best practices from traditional teaching contexts.
3. Be learner centered.
4. Speak less, do more.
5. Solve authentic challenges.

Hacks (for each principle)

1. Prepare learning outcomes for every lesson.

2. Identify solve-a-problem opportunities in advance and be open to ones that emerge organically during the workshop.

3. Use no slide decks. This challenges the instructor to more directly engage with the students and participants in the workshop and leaves space for students to shape content and narrative to some extent. Decks lock all of us in. This is appropriate for some contexts such as conference presentations, but workshops can be more fluid and open.

4. Plan pauses. Prepare your lessons with gaps for contributions.  Prepare a list of questions to offer up for every lesson and provide time for discussion of solutions.

5. Use real evidence/data to answer a compelling question (scale can be limited, approach beta as long as an answer is provided, and the challenge can emerge if teaching is open and space provided for the workshop participants to ideate).

Final hack that is a more general teaching principle, consider keeping all teaching materials within a single ecosystem that then references outwards only as needed. For me, this has become all content prepared in RStudio, knitted to html, then pushed to GitHub gh-pages for sharing as a webpage (or site). Then participants can engage in all ideas and content including code, data, ideas in one place.

 

Overdispersion tests in #rstats

A brief note on overdispersion

Assumptions

Poisson distribution assume variance is equal to the mean.

Quasi-poisson model assumes variance is a linear function of mean.

Negative binomial model assumes variance is a quadratic function of the mean.

rstats implementation

#to test you need to fit a poisson GLM then apply function to this model

library(AER)

dispersiontest(object, trafo = NULL, alternative = c(“greater”, “two.sided”, “less”))

trafo = 1 is linear testing for quasipoisson or you can fit linear equation to trafo as well

#interpretation

c = 0 equidispersion

c > 0 is overdispersed

Resources

  1. Function description from vignette for AER package.
  2. Excellent StatsExchange description of interpretation.

A note on AIC scores for quasi-families in #rstats

A summary note on recent set of #rstats discoveries in estimating AIC scores to better understand a quasipoisson family in GLMS relative to treating data as poisson.

Conceptual GLM workflow rules/guidelines

  1. Data are best untransformed. Fit better model to data.
  2. Select your data structure to match purpose with statistical model.
  3. Use logic and understanding of data not AIC scores to select best model.

(1) Typically, the power and flexibility of GLMs in R (even with base R) get most of the work done for the ecological data we work with within the research team. We prefer to leave data untransformed and simple when possible and use the family or offset arguments within GLMs to address data issues.

(2) Data structure is a new concept to us. We have come to appreciate that there are both individual and population-level queries associated with many of the datasets we have collected.  For our purposes, data structure is defined as the level that the dplyr::group_by to tally or count frequencies is applied. If the ecological purpose of the experiment was defined as the population response to a treatment for instance, the population becomes the sample unit – not the individual organism – and summarised as such. It is critical to match the structure of data wrangled to the purpose of the experiment to be able to fit appropriate models. Higher-order data structures can reduce the likelihood of nested, oversampled, or pseudoreplicated model fitting.

(3) Know thy data and experiment. It is easy to get lost in model fitting and dive deep into unduly complex models. There are tools before model fitting that can prime you for better, more elegant model fits.

Workflow

  1. Wrangle then data viz.
  2. Library(fitdistrplus) to explore distributions.
  3. Select data structure.
  4. Fit models.

Now, specific to topic of AIC scores for quasi-family field studies.

We recently selected quasipoisson for the family to model frequency and count data (for individual-data structures). This addressed overdispersion issues within the data. AIC scores are best used for understanding prediction not description, and logic and exploration of distributions, CDF plots, and examination of the deviance (i.e. not be more than double the degrees of freedom) framed the data and model contexts. To contrast poisson to quasipoisson for prediction, i.e. would the animals respond differently to the treatments/factors within the experiment, we used the following #rstats solutions.

————

#Functions####

#deviance calc

dfun <- function(object) {

with(object,sum((weights * residuals^2)[weights > 0])/df.residual)

}

#reuses AIC from poisson family estimation

x.quasipoisson <- function(…) {

res <- quasipoisson(…)

res$aic <- poisson(…)$aic

res

}

#AIC package that provided most intuitive solution set####

require(MuMIn)

m <- update(m,family=”x.quasipoisson”, na.action=na.fail)

m1 <- dredge(m,rank=”QAIC”, chat=dfun(m))

m1

#repeat as needed to contrast different models

————

Outcomes

This #rstats opportunity generated a lot of positive discussion on data structures, how we use AIC scores, and how to estimate fit for at least this quasi-family model set in as few lines of code as possible.

Resources

  1. An R vignette by Ben Bolker of quasi solutions.
  2. An Ecology article on quasi-possion versus nb.regression for overdispersed count data.
  3. A StatsExchange discussion on AIC scores.

Fundamentals

Same data, different structure, lead to different models. Quasipoisson a reasonable solution for overdispersed count and frequency animal ecology data. AIC scores are a bit of work, but not extensive code, to extract. AIC scores provide a useful insight into predictive capacities if the purpose is individual-level prediction of count/frequency to treatments.

 

Elements of a successful #openscience #rstats workshop

What makes an open science workshop effective or successful*?

Over the last 15 years, I have had the good fortune to participate in workshops as a student and sometimes as an instructor. Consistently, there were beneficial discovery experiences, and at times, some of the processes highlighted have been transformative. Last year, I had the good fortune to participate in Software Carpentry at UCSB and Software Carpentry at YorkU, and in the past, attend (in part) workshops such as Open Science for Synthesis. Several of us are now deciding what to attend as students in 2017. I have been wondering about the potential efficacy of the workshop model and why it seems that they are so relatively effective. I propose that the answer is expectations.  Here is a set of brief lists of observations from workshops that lead me to this conclusion.

*Note: I define a workshop as effective or successful when it provides me with something practical that I did not have before the workshop.  Practical outcomes can include tools, ideas, workflows, insights, or novel viewpoints from discussion. Anything that helps me do better open science. Efficacy for me is relative to learning by myself (i.e. through reading, watching webinars, or stuggling with code or data), asking for help from others, taking an online course (that I always give up on), or attending a scientific conference.

Delivery elements of an open science training workshop

  1. Lectures
  2. Tutorials
  3. Demonstrations
  4. Q & A sessions
  5. Hands-on exercises
  6. Webinars or group-viewing recorded vignettes.

Summary expectations from this list: a workshop will offer me content in more than one way unlike a more traditional course offering. I can ask questions right there on the spot about content and get an answer.

Content elements of an open science training workshop

  1. Data and code
  2. Slide decks
  3. Advanced discussion
  4. Experts that can address basic and advanced queries
  5. A curated list of additional resources
  6. Opinions from the experts on the ‘best’ way to do something
  7. A list of problems or questions that need to addressed or solved both routinely and in specific contexts when doing science
  8. A toolkit in some form associated with the specific focus of the workshop.

Summary of expectations from this list: the best, most useful content is curated. It is contemporary, and it would be a challenge for me to find out this on my own.

Pedagogical elements of an open science training workshop

  1. Organized to reflect authentic challenges
  2. Uses problem-based learning
  3. Content is very contemporary
  4. Very light on lecture and heavy on practical application
  5. Reasonably small groups
  6. Will include team science and networks to learn and solve problems
  7. Short duration, high intensity
  8. Will use an open science tool for discussion and collective note taking
  9. Will be organized by major concepts such as data & meta-data, workflows, code, data repositories OR will be organized around a central problem or theme, and we will work together through the steps to solve a problem
  10. There will be a specific, quantifiable outcome for the participants (i.e. we will learn how to do or use a specific set of tools for future work).

Summary of expectations from this list: the training and learning experience will emulate a scientific working group that has convened to solve a problem. In this case, how can we all get better at doing a certain set of scientific activities versus can a group aggregate and summarize a global alpine dataset for instance. These collaborative solving-models need not be exclusive.

Higher-order expectations that summarize all these open science workshop elements

  1. Experts, curated content, and contemporary tools.
  2. Everyone is focussed exclusively on the workshop, i.e. we all try to put our lives on hold to teach and learn together rapidly for a short time.
  3. Experiences are authentic and focus on problem solving.
  4. I will have to work trying things, but the slope of the learning curve/climb will be mediated by the workshop process.
  5. There will be some, but not too much, lecturing to give me the big picture highlights of why I need to know/use a specific concept or tool.

 

 

 

A set of #rstats #AdventureTime themed #openscience slide decks

Purpose

I recently completed a set of data science for biostatistics training exercises for graduate students. I extensively used R for Data Science and Efficient R programming to develop a set of Adventure Time R-statistics slide decks. Whilst I recognize that they are very minimal in terms of text, I hope that the general visual flow can provide a sense of the big picture philosophy that R data science and R statistics offer contemporary scientists.

Slide decks

  1. WhyR? How tidy data, open science, and R align to promote open science practices.
  2. Become a data wrangleR. An introduction to the philosophy, tips, and associated use of dplyr.
  3. Contemporary data viz in R. Philosophy of grammar of graphics, ggplot2, and some simple rules for effective data viz.
  4. Exploratory data analysis and models in R. An explanation of the difference between EDA and model fitting in R. Then, a short preview of how to highlighting modelR.
  5. Efficient statistics in R. A visual summary of the ‘Efficient R Programming’ book ideas including chunk your work, efficient planning, efficient planning, and efficient coding suggestions in R.

Here is the knitted RMarkdown html notes from the course too https://cjlortie.github.io/r.stats/, and all the materials can be downloaded from the associated GitHub repo.

I hope this collection of goodies can be helpful to others.

adventures