A set of #rstats #AdventureTime themed #openscience slide decks


I recently completed a set of data science for biostatistics training exercises for graduate students. I extensively used R for Data Science and Efficient R programming to develop a set of Adventure Time R-statistics slide decks. Whilst I recognize that they are very minimal in terms of text, I hope that the general visual flow can provide a sense of the big picture philosophy that R data science and R statistics offer contemporary scientists.

Slide decks

  1. WhyR? How tidy data, open science, and R align to promote open science practices.
  2. Become a data wrangleR. An introduction to the philosophy, tips, and associated use of dplyr.
  3. Contemporary data viz in R. Philosophy of grammar of graphics, ggplot2, and some simple rules for effective data viz.
  4. Exploratory data analysis and models in R. An explanation of the difference between EDA and model fitting in R. Then, a short preview of how to highlighting modelR.
  5. Efficient statistics in R. A visual summary of the ‘Efficient R Programming’ book ideas including chunk your work, efficient planning, efficient planning, and efficient coding suggestions in R.

Here is the knitted RMarkdown html notes from the course too https://cjlortie.github.io/r.stats/, and all the materials can be downloaded from the associated GitHub repo.

I hope this collection of goodies can be helpful to others.



A review of ‘R for Data Science’ book @hadleywickham #rstats #openscience

Data science

Data science is a critical component of many domains of research including the domain I primarily function – ecology. However, in teaching biostatistics within the university context, we have typically focussed on the statistics and less on the science of data (i.e. handling, understanding, and manipulating data). This is unfortunate, but the teaching landscape is now rapidly evolving to include offerings of numerous institutional Master’s of Data Science degrees.


It has taken me an embarrassingly long time to appreciate the differences between data science and statistics. My teaching has embraced open science and shared many of the skills that students need to be scientifically-literate citizens. However, data-literate citizens are important too if we want the next generation to make informed, evidence-based decisions about health, the economy, and the health of our ecosystems. Critical thinking tools for data are non-trivial concepts and statistics are absolutely needed. However, the science of data, big or little, is critical in appreciating the decisions, steps, and workflows needed to prepare, share, analyze, collaborate, and evaluate quantitative and qualitative data. I have been on a reading binge to this effect to both appreciate the value of data science thinking and improve the skill set that I can share with students and some collaborators. Last week, I completed my latest adventure – ‘R for Data Science’ by Garrett Grolemund & Hadley Wickham.



The book was written in R markdown, compiled using bookdown, and it is free online. Appropriately, it thus embodies both open science and data science in how it is written. Bookdown is a package for R that knits a set of R markdown files together into a book. This is important because it is open, you can clone the book from GitHub, it is written using one of the most powerful open science/data science tools, i.e. R (language and environment), and in reading online and seeing the code, you also appreciate the trickle effects of ‘open data science’ thinking to writing, collaboration, and even publishing. This is all incredible, and it is a peek into a very different future of scholarly communication. The book is nearly complete. I read what was available because I teach soon. It confirmed and advanced my understanding and skill set for data science immensely. Here is a brief summary, without spoilers, of some of the dimensions I used to conclude that this book is fantastic.

Language & clarity
In reading R statistics, statistics, or data science books, one expects/hopes that like literate coding, the prose will be accessible, pleasant, and appropriately pitched. This book was ideal in this respect. It was more formal than conversational but not too technical. The structure facilitated comprehension and reading because it was clear and logical. The visuals added a dimension of attractive clarity to the writing that were not just code, prose, R, or data viz. Many of the visuals were excellent heuristics. Some were a reminder to the reader of the big picture in data science whilst others highlighted a particular workflow/approach.

Example of big picture visual.


Example of mechanistic heuristic.


These were extremely useful. I could have even used more here and there, but in digging into the examples, I recognize that they were likely not always needed (and too much can be a bad thing too if poorly executed). The clarity was very high in almost every chapter of the book. I struggled with some of the more complex chapters (for me) such as relational data or some elements of the model building, but the flow keep me rolling through these even if some of the details eluded me.


The expectation that data science or statistics books should be only read once is a challenging notion. Many of the chapters in this book certainly satisfy that criterion, but it depends on the purpose. Some of the more challenging chapters that you identify can be re-read for better comprehension and one could also follow along/experiment with in R studio. Sometimes, it is nonetheless good to get the message from alternate sources described or explained a little differently. In my reading R bonanza, some of the R-statistics books will not be revisited. My feeling for R for Data Science is that the clean style and direct writing do not conflate the message and re-reads would likely be beneficial when needed. The message in many chapters is also unique, and even a brief revisit would highlight some of the handling elements and assumptions associated with best practices for data science.

Welcome to the tidyverse. Enough said to all that follow and read up within the R community. This universe is logical and feels natural. The forthcoming ggvis will help further align the grammar and semantics that parallel the code and flow with pipes versus ‘+’ of ggplot2. Tibbles are a pleasant surprise. The wrangle readings satisfy. Tidiness is next to high-orderedness. Subscribing to the philosophy of readable code, consistent data structures, and logical workflows will promote better open science and reproducibility. This is never really explicitly stated, or if it was, I missed it. I suspect that this is a good thing. We can approach open science, open data, and more transparency in science from top-down or bottom-up efforts. By not repeatedly banging that drum per se but directly providing and describing the tools to handle data cleanly and consistently, this book provides a solid bottom-up pillar for the open science movement. Tidy data and readable code are shareable AND useable. Finally and aligned with this tools-first approach, the value of models and epistemology of hypotheses are stated later in the book (Chapter 19). This worked for me in reading this book but likely not in teaching to students. I like the hypothesis/model philosophy of ‘knowing data’ developed here. It was big data in origins, balanced, and emphasized bias and non-independence in exploring and testing models. What you can learn from a model also depends on how it is applied. This was well described. Split. Build. Think. Test. Know.
Your own personal variation would likely fit within a similar framework even with little data. I did wonder a bit how I can adapt some of the model fitting ideas to more of the little data common in some the ecological inquiries (solutions: (i) pilot field experiments can provide the training data, and (ii) resampling/bootstrapping using modelr to populate larger datasets for more independent EDA) . The reminder to avoid repetition is repeated. Not ironically.

Many books do not need to adapt. Most R statistics books likely do. Packages are often a gamechanger. Grammar changes. Base R is a must know of course, but streamlining and specifics often live in the libraries the community develops. This book is available for sale on amazon, and I assume it will adapt but more slowly than the bookdown version. The frame-rate of change in no way precludes reading the book now or revisiting at some later point in time. Model building chapters, the basics of wrangling, functions, and iterations are solid reading that provide a skill set needed right now. The data viz and perhaps data transformation chapters are most likely to change soon. Read now and capture those skills but expect change. There are also some nice examples of intermediate to advanced tricks in plotting that reading now will provide. Certainly,  this the case in the iteration and model chapters too – good intermediate skill building blocks for advanced coding data science. This skill set is pretty darn awesome (PDA), and the strings chapter was also very rich in news skills and a launchpad to text mining with other packages (inspired me to try it right after completion of reading book). Skills abound.

Bottom line (of code) review for readers

high.returns <- c(“basic.R.users”, “intermediate.R.users”)

tidy.data.science <- philosophy of consistent structures %>% visualize with models %>% share

There are many tools for open science (data management plans, slideshare, data repositories, GitHub, preprints, sharing meta-data, social media, blogs, and data publications) . However, effective date science in R can also be a powerful ally if you include the final steps of communicate (Chapters 23-25).



Posted in R

The importance of #upgoESA experiment by @DrHolly #ESA2016

The ‘Up-Goer Five Challenge: Using Common Language to Communicate Your Science to the Public‘ session was an experiment.  It was a brilliant success. Enjoyable and profound because of the direct and indirect discoveries in how we communicate and share. Semantics are important. Scientific language conveys complexity. Complexity can become a barrier. Simpler language tends to highlight emotions. Using simpler words can change meaning but make the narrative more powerful.  The main direct discovery was that we function, as scientists and communicators, on a continuum from jargon to overly simple, and we need to find the sweet spot in using complexity appropriately in sharing our findings with others (and one another).

EAS NCRG Complexity

However, I propose the ‘experiment’ need not have been successful for us to learn. Experiments are about discovery. We learn as much from error as success in science. Trials are useful. The most exciting element of the up goer five model for talks was the fact that Dr. Holly Menninger proposed the session, it got approved, and many people participated (in speaking, attending, and the discussion). We need to try things out. We need to experiment with scientific communication just like we experiment with research systems and test hypotheses and predictions. There is a field of research in communication studies, and I am not proposing we must also become experts in that too. However, ESA meetings are a safe place for ecologists.  At the minimum, we can try some new things in how we communicate with one another and explore efficacy and potential for different audiences. There is likely no one best way for every context. Importantly, we can practice taking risks. Each of us needs to decide what we are comfortable with. Oral session, poster, or ignite for instance each come with different risks and challenges. The upgoESA model provided an alternative opportunity that came with new risks. However, we benefitted from the experiment and made some discoveries. Consequently, I propose we continue to look outward like Dr. Holly Menninger did and continue to bring new opportunities to future ESA meetings that explore how communicate. PechaKucha, slide karoke, video abstracts, streaming, micro-writing groups, hackathons, datashareathons, and more meetups are all viable experiments too. The session ‘Ecology on the Runway: An Eco-Fashion Show and Other Non-Traditional Public Engagement Approaches‘ was also an experiment with risks, entertainment, and a different set of messages.

We need to continue to hack the conference model and treat it like our own collective experiment to become better communicators. Plus, experiments are fun.


Posted in fun

Sharing strategies for #ESA2016 #openscience #scicomm

Meetings are an excellent opportunity to not only communicate your science but secure feedback. I propose the more you give, the more get.


There are at least the following five open-science products associated with any contribution (presentation or poster) to share with your colleagues and a much wider online audience prior to the meeting.  [ green text = hyperlinks ]

Open-science products to share for a meeting

  1. The slide deck or poster can be published on SlideShare.
  2. Your data-science workflow, code, and EDA can be published as an r-markdown on GitHub.
  3. The primary or derived summary data (if you are not ready to go public yet) can be published on figshare (and/or included in GitHub repo).
  4. Most journals accept submissions that have been pre-printed. Consider sharing your draft paper on PeerJ or bioRxiv. Not at that stage? Do a blog post instead.
  5. Record a video abstract to share the main finding of your talk and post to YouTube or Vimeo. This could attract a larger audience to the conference presentation and does an incredibly useful service in communicating science to the public and others that do not attend the conference.


I enjoy the process of science way too much, and I am easily distracted.  How did I do in preparing for my ESA talk this year on microenvironmental change under desert shrubs?  I scored a total of 4 our of 5 . Feel free to click on bolded text below to see materials and provide feedback. Each is absolutely a work-in-progress like the experiment itself (we need at least one more year of data). However, I am hoping it is a good time to share ideas now and see if we can do better next year in the field.

Deck on SlideShare

Code on GitHub

Data on GitHub

Video abstract (went a bit crazy here and did two). Field and in office versions. Very high cheese factor in both (hard to be natural on camera).

Science is a process. Share your steps.





#rstudio #github missing command lines for mac setup @rstudio @github @swcarpentry

Every few months, I try to do a clean install on my machine. I know that OS X Sierra is due out in September, but I elected to do a wipe and clean install now for the remainder of summer.


Wipe, reinstall OSX from usb, brief minor hack/tweaks, then just a few apps including base-r and rstudio. I prefer to connect to github without desktop app and use rstudio directly.

Limitation, I forgot two little things that consumed forever to get rstudio and github to connect. So, if you are a mac user too, here is a synopsis.


Most steps well articulated online
#open terminal/shell.
git config –global user.name “your_username”
git config –global user.email “your_email@example.com”

#missing 1 for macs: tell osx keychain to store password
git config –global credential.helper osxkeychain

#generate SSH RSA key via command line
ssh-keygen -t rsa -C “your_email@example.com”

#alternatively, you can do via rstudio tools/global options/enable version control
#then create RSA key, save, copy, and paste over to your github account online.

#check authentication works
ssh -T git@github.com

#missing 2 for macsdo a command line push to get password into osxkeychain
#I tried clone/new repo, make changes, commit, then push, and failed because no password to push changes via version control to github was stored and rstudio does not talk to keychain #frustrating
#so make/clone a repo, generate a change, and then do push from command line

git push -u origin gh-pages


git push -u origin master

#depending on branch name

#I hope this note-to-self provides you with the missing lines you need to get your next level too!