Prepping data for #rstats #tidyverse and a priori planning

messy data can be your friend (or frenemy)

Many if not most data clean up, tidying, wrangling, and joining can be done directly in R. There are many advantages to this approach – i.e. read in data in whatever format (from excel to json to zip) and then do your tidying – including transparency, a record of what you did, reproducibility (if you ever have to do it again for another experiment or someone else does), and reproducibility again if your data get updated and you must rinse and repeat! Additionally, an all-R workflow forces you to think about your data structure, the class of each vector (or what each variable means/represents), missing data, and facilitates better QA/QC. Finally, your data assets are then ready to go for the #tidyverse once read in for joining, mutating in derived variables, or reshaping. That said, I propose that good thinking precedes good rstats. For me, this ripples backwards from projects where I did not think ahead sufficiently and certainly got the job done with R and the tidyverse in particular, but it took some time that could have been better spent on the statistical models later in the workflow. Consequently, here are some recent tips I have been thinking about this season of data collection and experimental design that are pre-R for R and how I know I like to subsequently join and tidy.


  1. Keep related data in separate csv files.
    For instance, I have a site with 30 long-term shrubs that I measure morphology and growth, interactions with/associations with the plant and animal community, and microclimate. I keep each set of data in a separate csv (no formatting, keeps it simple, reads in well) including shrub_morphology.csv, associations.csv, and microclimate.csv. This matches how I collect the data in the field, represents the levels of thinking and sampling, and at times I sample asynchronously so open each only as needed. You can have a single excel file instead with multiple sheets and read those in using R, but I find that there is a tendency for things to get lost, and it is hard for me to have parallel checking with sheets versus just opening up each file side-by-side and doing some thinking. Plus, this format ensures and reminds me to write a meta-data file for each data stream.
  2. Always have a key vector in each data file that represents the unique sampling instance.
    I like the #tidyverse dyplr::join family to put together my different data files. Here is an explanation of the workflow. So, in the shrub example, the 30 individual shrubs structure all observations for how much they grow, what other plants and animals associate with them, and what the microclimate looks like under their canopy so I have a vector entitled shrub_ID that demarcates each instance in space that I sample. I often also have a fourth data file for field sampling that is descriptive using the same unique ID as the key approach wherein I add lat, long, qualitative observations, disturbances, or other supporting data.
  3. Ensure each vector/column is a single class.
    You can resolve this issue later, but I prefer to keep each vector a single class, i.e. all numeric, all character, or all date and time.
  4. Double-code confusing vectors for simplicity and error checking.
    I double-code data and time vectors to simpler vectors just to be safe. I try to use readr functions like read_csv that makes minimal and more often than not correct assumptions about the class of each vector. However, to be safe for vectors that I have struggled with in the past, and fixed in R using tidytools or others, I now just set up secondary columns that match my experimental design and sampling. If I visit my site with 30 shrubs three times in a growing season, I have a date vector that captures this rich and accurate sampling process, i.e. august 14, 2019 as a row, but I also prefer a census column that for each row has 1,2, or 3. This helps me recall how often I sampled when I reinspect the data and also provides a means for quick tallies and other tools. Sometimes, if I know it is long-term data over many years, I also add a simple year column that simply lists 2017, 2018, and 2019. Yes, I can reverse engineer this in R, but I like the structure – like a backbone or scaffold to my dataframe to support my thinking about statistics to match design.
  5. Keep track of total unique observation instances.
    I like tidy data. In each dataframe, I like a vector that provides me a total tally of the length of the data as a representation of unique observations. You can wrangle in later, and this vector does not replace the unique ID key vector at all. In planning an experiment, I do some math. One site, 30 shrubs, 3 census events per season/year, and a total of 3 years. So, 1 x 30 x 3 x 3 should be 270 unique observations or rows. I hardcode that into the data to ensure that I did not miss or forget to collect data. It is also fulfilling to have them all checked off. The double-check using tibble::rowid_to_column should confirm that count, and further to tip #2, you can have a variable or set of variables to join different dataframes so this becomes fundamentally useful if I measured shrub growth and climate three times each year for three years in my join (i.e. I now have a single observation_ID vector I generated that should match my hardcoded collection_ID data column and I can ensure it lines up with the census column too etc per year). A tiny bit of rendundancy just makes it so much easier to check for missing data later.
  6. Leave blanks blank. Ensures your data codes true and false zeros correctly (for me this means I observed a zero, i.e. no plants under the shrub at all versus missing data) and also stick to tip #3. My quick a priori rule that I annotate in meta-data for each file is that missing altogether is coded as blank (i.e. no entry in that row/instance but I still have the unique_ID and row there as placeholder) and an observed zero in the field or during experiment is coded as 0. Do not record ‘NA’ as characters in a numeric column in the csv because it flips the entire vector to character, and read_csv and other functions sorts this out better with blanks anyway. I can also impute in missing values if needed by leaving blanks blank.
  7. Never delete data. Further to tip #1, and other ideas described in R for Data Science, once I plan my experiment and decide on my data collecting and structural rules a priori, the data are sacred and stay intact. I have many colleagues delete data that they did not ‘need’ once they derived their site-level climate estimates (then lived to regret it) or delete rows because they are blank (not omit in #rstats workflow but opened up data file and deleted). Sensu tip #5, I like the tally that matches the designed plan for experiment and prefer to preserve the data structure.
  8. Avoid automatic factor assignments. Further to simple data formats like tip #1 and tip #4, I prefer to read in data and keep assumptions minimal until I am ready to build my models. Many packages and statistical tools do not need the vector be factor class, and I prefer to make assignments directly to ensure statistics match the planned design for the purpose of each variable. Sometimes, variables can be both. The growth of the shrub in my example is a response to the growing season and climate in some models but a predictor in other models such as the effect of the shrub canopy on the other plants and animals. The r-package forcats can help you out when you need to enter into these decisions and challenges with levels within factors.

Putting the different pieces together in science and data science is important. The construction of each project element including the design of experiment, evidence and data collection, and #rstats workflow for data wrangling, viz, and statistical models suggest that a little thinking beforehand and planning (like visual lego instruction guides) ensures that all these different pieces fit together in the process of project building and writing. Design them so that connect easily.

Sometimes you can get away without instructions and that is fun, but jamming pieces together that do not really fit and trying to pry them apart later is never really fun.

#rstats adventures in the land of @rstudio shiny (apps)

Colleagues and I had some sweet telemetry data, we did some simple models (& some relatively more complex ones too), we drew maps, and we wrote a paper. However, I thought it would be great to also provide stakeholders with the capacity to engage with the models, data, and maps. I published the data with a DOI, published the code at zenodo (& online at GitHub), and submitted paper to a journal. We elected not to pre-print because this particular field of animal ecology is not an easy place. My goal was to rapidly spin up some interactive capacity via two apps.

Map app is simple but was really surprising once rendered. Very different and much more clear finding through interactivity. This was a fascinating adventure!
Model app exploring the distribution of data and the resource selection function application for this species confirmed what we concluded in the paper.

Shiny app steps development flow is straightforward, and I like the logic!!
1. Use RStudio
2. Set up a shiny app account (free for up to 25hrs total use per month)
3. Set up a single r script with three elements
(i) ui
(ii) server
(iii) generate app (typically single line)
4. Click run app in RStudio to see it.
5. Test and play.
6. Publish (click publish button).

There is a bit more to it but not much more.

A user interface makes it an app (haha), the server serves up the rstats or your work, and the final line generates app using shiny package. I could have an interactive html page published on GitHub and use plotly and leaflet etc, but I wanted to have the sliders and select input features more like a web app – because it is.

Main challenge to adventure was leaflet and reactive data
The primary challenge, adventure time style, was the reactive data calls and leaflet. If you have to produce an interactive map that can be updated with user input, you change your workflow a tiny bit.
a. The select input becomes an input$var that is in essence the name of vector you can use in your rstats code. So, this intuitive in conventional shiny app to me.
b. To take advantage of user input to render an updated map, I struggled a bit. You still use the input but want to filter your data to replot map. Novel elements include introducing a reactive function call to rewrite your dataframe in server chunk and then in leaflet first renderLeaflet map but them use an observe function to update the map with the reactive, i.e. user-defined, subset of the data. Simple in concept now that I get it, but it was still a bit tweaky to call specific elements from reactive data for mapping.

Apps from your work can illuminate patterns for others and for you.

Apps can provide a mechanism to interact with your models and see the best fits or outcomes in a more parallel, extemporary capacity

Apps are a gratifying mean to make statistics and data more accessible

Short-cut/parsimony coding: If you wrap your data script or wrangling into the renderPlot call, your data becomes reactive (without the formal reactive function).

The position of scripts is important – check this – numerous options where to read in data and this has consequences.

Also, consider modularizing your code.

Check out conditionalPanel function for customization across tabPanels. Tips in general here for shiny.

A checklist for choosing between #rstats packages

The paradox of choice can at times be a challenge. There are well over 10,000 packages on CRAN now (likely 16,000), and there have been suggestions on how to find what you need but not necessarily on how to choose between alternatives. Here is a brief checklist that I used to contrast two similar packages for doing meta-analyses in R (summarized in a preprint qualitatively).

regularly maintained
recently updated
package maintainer on GitHub
manual available
vignettes available
used/published in similar projects
aligned workflow
semantics intuitive
contemporary grammar
functions that get the job done
arguments to support needs
visualization options (if needed)
dependencies reported/reasonable
connects to other packages


It is great to have different choices, and it is important to explore different alternatives because each can lead to potentially different adventures.

A vision statement describing goals for Ecology @ESAEcology #openscience

Many aspects of the journal Ecology are exceptional.  It is a society journal and that is important. The strength of research, depth of reporting, and scope of primary ecological research that informs and shapes fundamental theory has been profound.  None of these benefits need to change.  Nonetheless, research that supports the scientific process and engenders discovery can always evolve and must be fluent.  So must the process of scientific communication including publications through journals.  With collaborators and support from NCEAS and a large publishing company, I have participated in meta-science research examining needs and trends in the process of peer review for ecologists and evolutionary biologists, i.e. Behind the shroud: a survey of editors in ecology and evolution published in Frontiers in Ecology and the Environment or biases in peer review such as Systematic Variation in Reviewer Practice According to Country and Gender in the Field of Ecology and Evolution published in PLOS ONE.  In total, we have published 50 peer-reviewed publications describing a path forward for ecology and evolution in particular with respect to inclusivity, open science, and journal policy.  Ideally, we have identified at least three salient elements for journals relevant to authors, referees, and editors, and four pillars for a future for scholarly publishing more broadly.  The three elements for Ecology specifically would be speed, recognition, and more full and reproducible reporting.  The four pillars include an ecosystem of products, open access, open or better peer review, and recognition for participation in the process .


Goals to consider

  1. Rapid peer review with no more than 4 weeks total for first decision.
  2. A 50% editor-driven rejection rate of initial submissions.
  3. Two referees per submission if in agreement (little to no evidence more individuals are required).
  4. Double the 2017 impact factor to ~10 within 2 years and return to top 10 ranking in 160 of journals listed in field of ecology.
  5. Further diversify the contributions to address exploration, confirmation, replication, consolidation, & synthesis.
  6. Innovate content offering to encompass more elements of the scientific process including design, schemas, workflows, ideation tools, data models, ontologies, and challenges.
  7. Allow authors to report failure and bias in process and decision making for empirical contributions.
  8. Provide additional novel material from every publication as free content even when behind paywall.
  9. Develop a collaborative reward system for the editorial board that capitalizes on existing expertise and produces novel scientific content such as editorials, commentaries, and the reviews as outwardly facing products. Include and invite referees to participate in these ‘meta’ papers because reviews are a form of critical and valuable synthesis.
  10. Promote a vision of scientific synthesis in every publication in the Discussion section of reports. Request an effect size measure for reports to provide an anchor for future reuse (i.e. use the criteria proposed in ‘Will your paper be used in a meta‐analysis? Make the reach of your research broader and longer lasting’).
  11. Revise the data policy to require data deposition – at least in some form such as derived data – openly prior to final acceptance but not necessarily for initial submission.
  12. Request access to code and data for review process.
  13. Explore incentives for referees – this is a critical issue for many journals. Associate reviews with Publons or ORCID.
  14. Emulate the PeerJ model for badges and profiles for editors, authors, and
  15. Remove barriers for inclusivity of authors through double-blind review.
  16. Develop an affirmative action and equity statement for existing publications and submissions to promote diversity through elective declaration statements and policy changes.
  17. All editors must complete awareness training for implicit bias. Editors can also be considered for certification awarded by the ESA based on merit of reviewing such as volume, quality of reviews, and service. Recognition and social capital are important incentives.
  18. Develop an internship program for junior scientists to participate in the review and editorial process.
  19. Explore reproducibility through experimental design and workflow registration with the submission process.
  20. Remove cover letters as a requirement for submission.


I value our community and the social good that our collective research, publications, and scientific outcomes provide for society.  However, I am also confident that we can do more.  Journals and the peer review process can function to illuminate the scientific process and peer review including addressing issues associated with reproducibility in science and inclusivity.  Know better, do better.  It is time for scientific journals to evolve, and the journal Ecology can be a flagship for change that benefits humanity at large by informing evidence-based decision making and ecological literacy.