Elements of a successful #openscience #rstats workshop

What makes an open science workshop effective or successful*?

Over the last 15 years, I have had the good fortune to participate in workshops as a student and sometimes as an instructor. Consistently, there were beneficial discovery experiences, and at times, some of the processes highlighted have been transformative. Last year, I had the good fortune to participate in Software Carpentry at UCSB and Software Carpentry at YorkU, and in the past, attend (in part) workshops such as Open Science for Synthesis. Several of us are now deciding what to attend as students in 2017. I have been wondering about the potential efficacy of the workshop model and why it seems that they are so relatively effective. I propose that the answer is expectations.  Here is a set of brief lists of observations from workshops that lead me to this conclusion.

*Note: I define a workshop as effective or successful when it provides me with something practical that I did not have before the workshop.  Practical outcomes can include tools, ideas, workflows, insights, or novel viewpoints from discussion. Anything that helps me do better open science. Efficacy for me is relative to learning by myself (i.e. through reading, watching webinars, or stuggling with code or data), asking for help from others, taking an online course (that I always give up on), or attending a scientific conference.

Delivery elements of an open science training workshop

  1. Lectures
  2. Tutorials
  3. Demonstrations
  4. Q & A sessions
  5. Hands-on exercises
  6. Webinars or group-viewing recorded vignettes.

Summary expectations from this list: a workshop will offer me content in more than one way unlike a more traditional course offering. I can ask questions right there on the spot about content and get an answer.

Content elements of an open science training workshop

  1. Data and code
  2. Slide decks
  3. Advanced discussion
  4. Experts that can address basic and advanced queries
  5. A curated list of additional resources
  6. Opinions from the experts on the ‘best’ way to do something
  7. A list of problems or questions that need to addressed or solved both routinely and in specific contexts when doing science
  8. A toolkit in some form associated with the specific focus of the workshop.

Summary of expectations from this list: the best, most useful content is curated. It is contemporary, and it would be a challenge for me to find out this on my own.

Pedagogical elements of an open science training workshop

  1. Organized to reflect authentic challenges
  2. Uses problem-based learning
  3. Content is very contemporary
  4. Very light on lecture and heavy on practical application
  5. Reasonably small groups
  6. Will include team science and networks to learn and solve problems
  7. Short duration, high intensity
  8. Will use an open science tool for discussion and collective note taking
  9. Will be organized by major concepts such as data & meta-data, workflows, code, data repositories OR will be organized around a central problem or theme, and we will work together through the steps to solve a problem
  10. There will be a specific, quantifiable outcome for the participants (i.e. we will learn how to do or use a specific set of tools for future work).

Summary of expectations from this list: the training and learning experience will emulate a scientific working group that has convened to solve a problem. In this case, how can we all get better at doing a certain set of scientific activities versus can a group aggregate and summarize a global alpine dataset for instance. These collaborative solving-models need not be exclusive.

Higher-order expectations that summarize all these open science workshop elements

  1. Experts, curated content, and contemporary tools.
  2. Everyone is focussed exclusively on the workshop, i.e. we all try to put our lives on hold to teach and learn together rapidly for a short time.
  3. Experiences are authentic and focus on problem solving.
  4. I will have to work trying things, but the slope of the learning curve/climb will be mediated by the workshop process.
  5. There will be some, but not too much, lecturing to give me the big picture highlights of why I need to know/use a specific concept or tool.

 

 

 

A set of #rstats #AdventureTime themed #openscience slide decks

Purpose

I recently completed a set of data science for biostatistics training exercises for graduate students. I extensively used R for Data Science and Efficient R programming to develop a set of Adventure Time R-statistics slide decks. Whilst I recognize that they are very minimal in terms of text, I hope that the general visual flow can provide a sense of the big picture philosophy that R data science and R statistics offer contemporary scientists.

Slide decks

  1. WhyR? How tidy data, open science, and R align to promote open science practices.
  2. Become a data wrangleR. An introduction to the philosophy, tips, and associated use of dplyr.
  3. Contemporary data viz in R. Philosophy of grammar of graphics, ggplot2, and some simple rules for effective data viz.
  4. Exploratory data analysis and models in R. An explanation of the difference between EDA and model fitting in R. Then, a short preview of how to highlighting modelR.
  5. Efficient statistics in R. A visual summary of the ‘Efficient R Programming’ book ideas including chunk your work, efficient planning, efficient planning, and efficient coding suggestions in R.

Here is the knitted RMarkdown html notes from the course too https://cjlortie.github.io/r.stats/, and all the materials can be downloaded from the associated GitHub repo.

I hope this collection of goodies can be helpful to others.

adventures

 

A review of ‘R for Data Science’ book @hadleywickham #rstats #openscience

Data science

Data science is a critical component of many domains of research including the domain I primarily function – ecology. However, in teaching biostatistics within the university context, we have typically focussed on the statistics and less on the science of data (i.e. handling, understanding, and manipulating data). This is unfortunate, but the teaching landscape is now rapidly evolving to include offerings of numerous institutional Master’s of Data Science degrees.

vizstars

It has taken me an embarrassingly long time to appreciate the differences between data science and statistics. My teaching has embraced open science and shared many of the skills that students need to be scientifically-literate citizens. However, data-literate citizens are important too if we want the next generation to make informed, evidence-based decisions about health, the economy, and the health of our ecosystems. Critical thinking tools for data are non-trivial concepts and statistics are absolutely needed. However, the science of data, big or little, is critical in appreciating the decisions, steps, and workflows needed to prepare, share, analyze, collaborate, and evaluate quantitative and qualitative data. I have been on a reading binge to this effect to both appreciate the value of data science thinking and improve the skill set that I can share with students and some collaborators. Last week, I completed my latest adventure – ‘R for Data Science’ by Garrett Grolemund & Hadley Wickham.

cover

Review

The book was written in R markdown, compiled using bookdown, and it is free online. Appropriately, it thus embodies both open science and data science in how it is written. Bookdown is a package for R that knits a set of R markdown files together into a book. This is important because it is open, you can clone the book from GitHub, it is written using one of the most powerful open science/data science tools, i.e. R (language and environment), and in reading online and seeing the code, you also appreciate the trickle effects of ‘open data science’ thinking to writing, collaboration, and even publishing. This is all incredible, and it is a peek into a very different future of scholarly communication. The book is nearly complete. I read what was available because I teach soon. It confirmed and advanced my understanding and skill set for data science immensely. Here is a brief summary, without spoilers, of some of the dimensions I used to conclude that this book is fantastic.

Language & clarity
In reading R statistics, statistics, or data science books, one expects/hopes that like literate coding, the prose will be accessible, pleasant, and appropriately pitched. This book was ideal in this respect. It was more formal than conversational but not too technical. The structure facilitated comprehension and reading because it was clear and logical. The visuals added a dimension of attractive clarity to the writing that were not just code, prose, R, or data viz. Many of the visuals were excellent heuristics. Some were a reminder to the reader of the big picture in data science whilst others highlighted a particular workflow/approach.

Example of big picture visual.

data-science-explore

Example of mechanistic heuristic.

join-many-to-many

These were extremely useful. I could have even used more here and there, but in digging into the examples, I recognize that they were likely not always needed (and too much can be a bad thing too if poorly executed). The clarity was very high in almost every chapter of the book. I struggled with some of the more complex chapters (for me) such as relational data or some elements of the model building, but the flow keep me rolling through these even if some of the details eluded me.

join-venn

The expectation that data science or statistics books should be only read once is a challenging notion. Many of the chapters in this book certainly satisfy that criterion, but it depends on the purpose. Some of the more challenging chapters that you identify can be re-read for better comprehension and one could also follow along/experiment with in R studio. Sometimes, it is nonetheless good to get the message from alternate sources described or explained a little differently. In my reading R bonanza, some of the R-statistics books will not be revisited. My feeling for R for Data Science is that the clean style and direct writing do not conflate the message and re-reads would likely be beneficial when needed. The message in many chapters is also unique, and even a brief revisit would highlight some of the handling elements and assumptions associated with best practices for data science.

Philosophy
Welcome to the tidyverse. Enough said to all that follow and read up within the R community. This universe is logical and feels natural. The forthcoming ggvis will help further align the grammar and semantics that parallel the code and flow with pipes versus ‘+’ of ggplot2. Tibbles are a pleasant surprise. The wrangle readings satisfy. Tidiness is next to high-orderedness. Subscribing to the philosophy of readable code, consistent data structures, and logical workflows will promote better open science and reproducibility. This is never really explicitly stated, or if it was, I missed it. I suspect that this is a good thing. We can approach open science, open data, and more transparency in science from top-down or bottom-up efforts. By not repeatedly banging that drum per se but directly providing and describing the tools to handle data cleanly and consistently, this book provides a solid bottom-up pillar for the open science movement. Tidy data and readable code are shareable AND useable. Finally and aligned with this tools-first approach, the value of models and epistemology of hypotheses are stated later in the book (Chapter 19). This worked for me in reading this book but likely not in teaching to students. I like the hypothesis/model philosophy of ‘knowing data’ developed here. It was big data in origins, balanced, and emphasized bias and non-independence in exploring and testing models. What you can learn from a model also depends on how it is applied. This was well described. Split. Build. Think. Test. Know.
Your own personal variation would likely fit within a similar framework even with little data. I did wonder a bit how I can adapt some of the model fitting ideas to more of the little data common in some the ecological inquiries (solutions: (i) pilot field experiments can provide the training data, and (ii) resampling/bootstrapping using modelr to populate larger datasets for more independent EDA) . The reminder to avoid repetition is repeated. Not ironically.

Skills
Many books do not need to adapt. Most R statistics books likely do. Packages are often a gamechanger. Grammar changes. Base R is a must know of course, but streamlining and specifics often live in the libraries the community develops. This book is available for sale on amazon, and I assume it will adapt but more slowly than the bookdown version. The frame-rate of change in no way precludes reading the book now or revisiting at some later point in time. Model building chapters, the basics of wrangling, functions, and iterations are solid reading that provide a skill set needed right now. The data viz and perhaps data transformation chapters are most likely to change soon. Read now and capture those skills but expect change. There are also some nice examples of intermediate to advanced tricks in plotting that reading now will provide. Certainly,  this the case in the iteration and model chapters too – good intermediate skill building blocks for advanced coding data science. This skill set is pretty darn awesome (PDA), and the strings chapter was also very rich in news skills and a launchpad to text mining with other packages (inspired me to try it right after completion of reading book). Skills abound.

Bottom line (of code) review for readers

high.returns <- c(“basic.R.users”, “intermediate.R.users”)

tidy.data.science <- philosophy of consistent structures %>% visualize with models %>% share

Implication
There are many tools for open science (data management plans, slideshare, data repositories, GitHub, preprints, sharing meta-data, social media, blogs, and data publications) . However, effective date science in R can also be a powerful ally if you include the final steps of communicate (Chapters 23-25).

datascience-openscience

 

Posted in R

#rstudio #github missing command lines for mac setup @rstudio @github @swcarpentry

Preamble
Every few months, I try to do a clean install on my machine. I know that OS X Sierra is due out in September, but I elected to do a wipe and clean install now for the remainder of summer.

1940weozuj7fqjpg

Wipe, reinstall OSX from usb, brief minor hack/tweaks, then just a few apps including base-r and rstudio. I prefer to connect to github without desktop app and use rstudio directly.

Limitation, I forgot two little things that consumed forever to get rstudio and github to connect. So, if you are a mac user too, here is a synopsis.

tumblr_lr04fa04Ke1qg0z57o1_500

Most steps well articulated online
#open terminal/shell.
git config –global user.name “your_username”
git config –global user.email “your_email@example.com”

#missing 1 for macs: tell osx keychain to store password
git config –global credential.helper osxkeychain

#generate SSH RSA key via command line
ssh-keygen -t rsa -C “your_email@example.com”

#alternatively, you can do via rstudio tools/global options/enable version control
#then create RSA key, save, copy, and paste over to your github account online.

#check authentication works
ssh -T git@github.com

#missing 2 for macsdo a command line push to get password into osxkeychain
#I tried clone/new repo, make changes, commit, then push, and failed because no password to push changes via version control to github was stored and rstudio does not talk to keychain #frustrating
#so make/clone a repo, generate a change, and then do push from command line

git push -u origin gh-pages

#or

git push -u origin master

#depending on branch name

#I hope this note-to-self provides you with the missing lines you need to get your next level too!

unnamed

 

A common sense review of @swcarpentry workshop by @RemiDaigle @juliesquid @ben_d_best @ecodatasci from @brenucsb @nceas

Rationale
This Fall, I am teaching graduate-level biostatistics. I have not had the good fortune of teaching many graduate-level offerings, and I am really excited to do so. A team of top-notch big data scientists are hosted at NCEAS. They have recently formed a really exciting collaborative-learning collective entitled ecodatascience. I was also aware of the mission of software carpentry but had not reviewed the materials. The ecodatascience collective recently hosted a carpentry workshop, and I attended. I am a parent and use common sense media as a tool to decide on appropriate content. As a tribute to that tool and the efforts of the ecodatascience instructors, here is a brief common sense review.

comp

ecodatascience software carpentry workshop
spring 2016

rating

 

 

WHAT YOU NEED TO KNOW

sw carpentry review

You need to know that the materials, approach, and teaching provided through software carpentry are a perfect example of contemporary, pragmatic, practice-what-you-teach instruction. Basic coding skills, common tools, workflows, and the culture of open science were clearly communicated throughout the two days of instruction and discussion, and this is a clear 5/5 rating. Contemporary ecology should be collaborative, transparent, and reproducible. It is not always easy to embody this. The use of GitHub and RStudio facilitated a very clear signal of collaboration and documented workflows.

All instructors were positive role models, and both men and women participated in direct instruction and facilitation on both days.  This is also a perfect rating. Contemporary ecology is not about fixed scientific products nor an elite, limited-diversity set of participants within the scientific process. This workshop was a refreshing look at how teaching and collaboration have changed. There were also no slide decks. Instructors worked directly from RStudio, GitHub Desktop app, the web, and gh-pages pushed to the browser. It worked perfectly. I think this would be an ideal approach to teaching biostatistics.

Statistics are not the same as data wrangling or coding. However, data science (wrangling & manipulation, workflows, meta-data, open data, & collaborative analysis tools) should be clearly explained and differentiated from statistical analyses in every statistics course and at least primer level instruction provided in data science. I have witnessed significant confusion from established, senior scientists on the difference between data science/management and statistics, and it is thus critical that we communicate to students the importance and relationship between both now if we want to promote data literacy within society.

There was no sex, drinking, or violence during the course :). Language was an appropriate mix of technical and colloquial so I gave it a positive rating, i.e. I view 1 star as positive as you want some colloquial but not too much in teaching precise data science or statistics. Finally, I rated consumerism at 3/5, and I view this an excellent rating. The instructors did not overstate the value of these open science tools – but they could have and I wanted them to! It would be fantastic to encourage everyone to adopt these tools, but I recognize the challenges to making them work in all contexts including teaching at the undergraduate or even graduate level in some scientific domains.

Bottom line for me – no slide decks for biostats course, I will use GitHub and push content out, and I will share repo with students. We will spend one third of the course on data science and how this connects to statistics, one third on connecting data to basic analyses and documented workflows, and the final component will include several advanced statistical analyses that the graduate students identify as critical to their respective thesis research projects.

I would strongly recommend that you attend a workshop model similar to the work of software carpentry and the ecodatascience collective. I think the best learning happens in these contexts. The more closely that advanced, smaller courses emulate the workshop model, the more likely that students will engage in active research similarly. I am also keen to start one of these collectives within my department, but I suspect that it is better lead by more junior scientists.

Net rating of workshop is 5 stars.
Age at 14+ (kind of a joke), but it is a proxy for competency needed. This workshop model is best pitched to those that can follow and read instructions well and are comfortable with a little drift in being lead through steps without a simplified slide deck.