Elements of a successful #openscience #rstats workshop

What makes an open science workshop effective or successful*?

Over the last 15 years, I have had the good fortune to participate in workshops as a student and sometimes as an instructor. Consistently, there were beneficial discovery experiences, and at times, some of the processes highlighted have been transformative. Last year, I had the good fortune to participate in Software Carpentry at UCSB and Software Carpentry at YorkU, and in the past, attend (in part) workshops such as Open Science for Synthesis. Several of us are now deciding what to attend as students in 2017. I have been wondering about the potential efficacy of the workshop model and why it seems that they are so relatively effective. I propose that the answer is expectations.  Here is a set of brief lists of observations from workshops that lead me to this conclusion.

*Note: I define a workshop as effective or successful when it provides me with something practical that I did not have before the workshop.  Practical outcomes can include tools, ideas, workflows, insights, or novel viewpoints from discussion. Anything that helps me do better open science. Efficacy for me is relative to learning by myself (i.e. through reading, watching webinars, or stuggling with code or data), asking for help from others, taking an online course (that I always give up on), or attending a scientific conference.

Delivery elements of an open science training workshop

  1. Lectures
  2. Tutorials
  3. Demonstrations
  4. Q & A sessions
  5. Hands-on exercises
  6. Webinars or group-viewing recorded vignettes.

Summary expectations from this list: a workshop will offer me content in more than one way unlike a more traditional course offering. I can ask questions right there on the spot about content and get an answer.

Content elements of an open science training workshop

  1. Data and code
  2. Slide decks
  3. Advanced discussion
  4. Experts that can address basic and advanced queries
  5. A curated list of additional resources
  6. Opinions from the experts on the ‘best’ way to do something
  7. A list of problems or questions that need to addressed or solved both routinely and in specific contexts when doing science
  8. A toolkit in some form associated with the specific focus of the workshop.

Summary of expectations from this list: the best, most useful content is curated. It is contemporary, and it would be a challenge for me to find out this on my own.

Pedagogical elements of an open science training workshop

  1. Organized to reflect authentic challenges
  2. Uses problem-based learning
  3. Content is very contemporary
  4. Very light on lecture and heavy on practical application
  5. Reasonably small groups
  6. Will include team science and networks to learn and solve problems
  7. Short duration, high intensity
  8. Will use an open science tool for discussion and collective note taking
  9. Will be organized by major concepts such as data & meta-data, workflows, code, data repositories OR will be organized around a central problem or theme, and we will work together through the steps to solve a problem
  10. There will be a specific, quantifiable outcome for the participants (i.e. we will learn how to do or use a specific set of tools for future work).

Summary of expectations from this list: the training and learning experience will emulate a scientific working group that has convened to solve a problem. In this case, how can we all get better at doing a certain set of scientific activities versus can a group aggregate and summarize a global alpine dataset for instance. These collaborative solving-models need not be exclusive.

Higher-order expectations that summarize all these open science workshop elements

  1. Experts, curated content, and contemporary tools.
  2. Everyone is focussed exclusively on the workshop, i.e. we all try to put our lives on hold to teach and learn together rapidly for a short time.
  3. Experiences are authentic and focus on problem solving.
  4. I will have to work trying things, but the slope of the learning curve/climb will be mediated by the workshop process.
  5. There will be some, but not too much, lecturing to give me the big picture highlights of why I need to know/use a specific concept or tool.

 

 

 

Review journals or journals with synthesis format contributions in EEB

Colleagues and I were checking through current journal listings that either explicitly focus on synthesis such as systematic reviews or include a section that is frequently well represented with synthesis contributions. Most journals in ecology, evolution, and environmental science that publish primary standard, research articles nonetheless also offer the opportunity for these papers too, but it can be less frequent or sometimes less likely to accept different forms of synthesis (i.e. systematic reviews in particular versus meta-analyses).

List

Diverse synthesis contributions very frequent
Conservation Letters (Letters)
Perspectives in Science
Perspectives in Plant Ecology, Evolution and Systematics
Diversity & Distributions
Ecology Letters
TREE
Oikos
Biological Reviews
Annual review of ecology, evolution, systematics
Letters to Nature
Frontiers in Ecology and the Environment
PLOS ONE (many systematic reviews)
Environmental Evidence
Biology Letters
Quarterly Review of Biology

Frequent synthesis contributions with some diversity in formats
Global Ecology and Biogeography
Annals of Botany
New Phytologist
Ecography
Ecological Applications
Functional Ecology
Proceedings of the the Royal Society B
Ecology and Evolution

 

 

Rules of thumb for better #openscience and transparent #collaboration

Rules-of-thumb for reuse of data and plots
1. If you use unpublished data from someone else, even if they are done with it, invite them to be a co-author.
2. If you use a published dataset, at the minimum contact authors, and depending on the purpose of the reuse, consider inviting them to become a co-author. Check licensing.
3. If you use plots initiated by another but in a significantly different way/for a novel purpose, invite them to be co-author (within a reasonable timeframe).
4. If you reuse the experimental plots for the exact same purpose, offer the person that set it up ‘right of first refusal’ as first author (within a fair period of time such as 1-2 years, see next rule).
5. If adding the same data to an experiment, first authorship can shift to more recent researchers that do significant work because the purpose shifts from short to long-term ecology.  Prof Turkington (my PhD mentor) used this model for his Kluane plots.  He surveyed for many years and always invited primary researchers to be co-authors but not first.  They often declined after a few years.
6. Set a reasonable authorship embargo to give researchers that have graduated/changed focus of profession a generous chance to be first authors on papers.  This can vary from 8 months to a year or more depending on how critical it is to share the research publicly.  Development pressures, climate change, and extinctions wait for no one sadly.
Rules-of-thumb for collaborative writing
1. Write first draft.
2. Share this draft with all potential first authors so that they can see what they would be joining.
3. Offer co-authorship to everyone that appropriately contributed at this juncture and populate the authorship list as firmly as possible.
4. Potential co-authors are invited to refuse authorship but err on the side of generosity with invitations.
5. Do revisions in serial not parallel.  The story and flow gets unduly challenging for everyone when track changes are layered.

A set of #rstats #AdventureTime themed #openscience slide decks

Purpose

I recently completed a set of data science for biostatistics training exercises for graduate students. I extensively used R for Data Science and Efficient R programming to develop a set of Adventure Time R-statistics slide decks. Whilst I recognize that they are very minimal in terms of text, I hope that the general visual flow can provide a sense of the big picture philosophy that R data science and R statistics offer contemporary scientists.

Slide decks

  1. WhyR? How tidy data, open science, and R align to promote open science practices.
  2. Become a data wrangleR. An introduction to the philosophy, tips, and associated use of dplyr.
  3. Contemporary data viz in R. Philosophy of grammar of graphics, ggplot2, and some simple rules for effective data viz.
  4. Exploratory data analysis and models in R. An explanation of the difference between EDA and model fitting in R. Then, a short preview of how to highlighting modelR.
  5. Efficient statistics in R. A visual summary of the ‘Efficient R Programming’ book ideas including chunk your work, efficient planning, efficient planning, and efficient coding suggestions in R.

Here is the knitted RMarkdown html notes from the course too https://cjlortie.github.io/r.stats/, and all the materials can be downloaded from the associated GitHub repo.

I hope this collection of goodies can be helpful to others.

adventures

 

A review of ‘R for Data Science’ book @hadleywickham #rstats #openscience

Data science

Data science is a critical component of many domains of research including the domain I primarily function – ecology. However, in teaching biostatistics within the university context, we have typically focussed on the statistics and less on the science of data (i.e. handling, understanding, and manipulating data). This is unfortunate, but the teaching landscape is now rapidly evolving to include offerings of numerous institutional Master’s of Data Science degrees.

vizstars

It has taken me an embarrassingly long time to appreciate the differences between data science and statistics. My teaching has embraced open science and shared many of the skills that students need to be scientifically-literate citizens. However, data-literate citizens are important too if we want the next generation to make informed, evidence-based decisions about health, the economy, and the health of our ecosystems. Critical thinking tools for data are non-trivial concepts and statistics are absolutely needed. However, the science of data, big or little, is critical in appreciating the decisions, steps, and workflows needed to prepare, share, analyze, collaborate, and evaluate quantitative and qualitative data. I have been on a reading binge to this effect to both appreciate the value of data science thinking and improve the skill set that I can share with students and some collaborators. Last week, I completed my latest adventure – ‘R for Data Science’ by Garrett Grolemund & Hadley Wickham.

cover

Review

The book was written in R markdown, compiled using bookdown, and it is free online. Appropriately, it thus embodies both open science and data science in how it is written. Bookdown is a package for R that knits a set of R markdown files together into a book. This is important because it is open, you can clone the book from GitHub, it is written using one of the most powerful open science/data science tools, i.e. R (language and environment), and in reading online and seeing the code, you also appreciate the trickle effects of ‘open data science’ thinking to writing, collaboration, and even publishing. This is all incredible, and it is a peek into a very different future of scholarly communication. The book is nearly complete. I read what was available because I teach soon. It confirmed and advanced my understanding and skill set for data science immensely. Here is a brief summary, without spoilers, of some of the dimensions I used to conclude that this book is fantastic.

Language & clarity
In reading R statistics, statistics, or data science books, one expects/hopes that like literate coding, the prose will be accessible, pleasant, and appropriately pitched. This book was ideal in this respect. It was more formal than conversational but not too technical. The structure facilitated comprehension and reading because it was clear and logical. The visuals added a dimension of attractive clarity to the writing that were not just code, prose, R, or data viz. Many of the visuals were excellent heuristics. Some were a reminder to the reader of the big picture in data science whilst others highlighted a particular workflow/approach.

Example of big picture visual.

data-science-explore

Example of mechanistic heuristic.

join-many-to-many

These were extremely useful. I could have even used more here and there, but in digging into the examples, I recognize that they were likely not always needed (and too much can be a bad thing too if poorly executed). The clarity was very high in almost every chapter of the book. I struggled with some of the more complex chapters (for me) such as relational data or some elements of the model building, but the flow keep me rolling through these even if some of the details eluded me.

join-venn

The expectation that data science or statistics books should be only read once is a challenging notion. Many of the chapters in this book certainly satisfy that criterion, but it depends on the purpose. Some of the more challenging chapters that you identify can be re-read for better comprehension and one could also follow along/experiment with in R studio. Sometimes, it is nonetheless good to get the message from alternate sources described or explained a little differently. In my reading R bonanza, some of the R-statistics books will not be revisited. My feeling for R for Data Science is that the clean style and direct writing do not conflate the message and re-reads would likely be beneficial when needed. The message in many chapters is also unique, and even a brief revisit would highlight some of the handling elements and assumptions associated with best practices for data science.

Philosophy
Welcome to the tidyverse. Enough said to all that follow and read up within the R community. This universe is logical and feels natural. The forthcoming ggvis will help further align the grammar and semantics that parallel the code and flow with pipes versus ‘+’ of ggplot2. Tibbles are a pleasant surprise. The wrangle readings satisfy. Tidiness is next to high-orderedness. Subscribing to the philosophy of readable code, consistent data structures, and logical workflows will promote better open science and reproducibility. This is never really explicitly stated, or if it was, I missed it. I suspect that this is a good thing. We can approach open science, open data, and more transparency in science from top-down or bottom-up efforts. By not repeatedly banging that drum per se but directly providing and describing the tools to handle data cleanly and consistently, this book provides a solid bottom-up pillar for the open science movement. Tidy data and readable code are shareable AND useable. Finally and aligned with this tools-first approach, the value of models and epistemology of hypotheses are stated later in the book (Chapter 19). This worked for me in reading this book but likely not in teaching to students. I like the hypothesis/model philosophy of ‘knowing data’ developed here. It was big data in origins, balanced, and emphasized bias and non-independence in exploring and testing models. What you can learn from a model also depends on how it is applied. This was well described. Split. Build. Think. Test. Know.
Your own personal variation would likely fit within a similar framework even with little data. I did wonder a bit how I can adapt some of the model fitting ideas to more of the little data common in some the ecological inquiries (solutions: (i) pilot field experiments can provide the training data, and (ii) resampling/bootstrapping using modelr to populate larger datasets for more independent EDA) . The reminder to avoid repetition is repeated. Not ironically.

Skills
Many books do not need to adapt. Most R statistics books likely do. Packages are often a gamechanger. Grammar changes. Base R is a must know of course, but streamlining and specifics often live in the libraries the community develops. This book is available for sale on amazon, and I assume it will adapt but more slowly than the bookdown version. The frame-rate of change in no way precludes reading the book now or revisiting at some later point in time. Model building chapters, the basics of wrangling, functions, and iterations are solid reading that provide a skill set needed right now. The data viz and perhaps data transformation chapters are most likely to change soon. Read now and capture those skills but expect change. There are also some nice examples of intermediate to advanced tricks in plotting that reading now will provide. Certainly,  this the case in the iteration and model chapters too – good intermediate skill building blocks for advanced coding data science. This skill set is pretty darn awesome (PDA), and the strings chapter was also very rich in news skills and a launchpad to text mining with other packages (inspired me to try it right after completion of reading book). Skills abound.

Bottom line (of code) review for readers

high.returns <- c(“basic.R.users”, “intermediate.R.users”)

tidy.data.science <- philosophy of consistent structures %>% visualize with models %>% share

Implication
There are many tools for open science (data management plans, slideshare, data repositories, GitHub, preprints, sharing meta-data, social media, blogs, and data publications) . However, effective date science in R can also be a powerful ally if you include the final steps of communicate (Chapters 23-25).

datascience-openscience

 

Posted in R