Prepping data for #rstats #tidyverse and a priori planning

messy data can be your friend (or frenemy)

Many if not most data clean up, tidying, wrangling, and joining can be done directly in R. There are many advantages to this approach – i.e. read in data in whatever format (from excel to json to zip) and then do your tidying – including transparency, a record of what you did, reproducibility (if you ever have to do it again for another experiment or someone else does), and reproducibility again if your data get updated and you must rinse and repeat! Additionally, an all-R workflow forces you to think about your data structure, the class of each vector (or what each variable means/represents), missing data, and facilitates better QA/QC. Finally, your data assets are then ready to go for the #tidyverse once read in for joining, mutating in derived variables, or reshaping. That said, I propose that good thinking precedes good rstats. For me, this ripples backwards from projects where I did not think ahead sufficiently and certainly got the job done with R and the tidyverse in particular, but it took some time that could have been better spent on the statistical models later in the workflow. Consequently, here are some recent tips I have been thinking about this season of data collection and experimental design that are pre-R for R and how I know I like to subsequently join and tidy.

Tips

  1. Keep related data in separate csv files.
    For instance, I have a site with 30 long-term shrubs that I measure morphology and growth, interactions with/associations with the plant and animal community, and microclimate. I keep each set of data in a separate csv (no formatting, keeps it simple, reads in well) including shrub_morphology.csv, associations.csv, and microclimate.csv. This matches how I collect the data in the field, represents the levels of thinking and sampling, and at times I sample asynchronously so open each only as needed. You can have a single excel file instead with multiple sheets and read those in using R, but I find that there is a tendency for things to get lost, and it is hard for me to have parallel checking with sheets versus just opening up each file side-by-side and doing some thinking. Plus, this format ensures and reminds me to write a meta-data file for each data stream.
  2. Always have a key vector in each data file that represents the unique sampling instance.
    I like the #tidyverse dyplr::join family to put together my different data files. Here is an explanation of the workflow. So, in the shrub example, the 30 individual shrubs structure all observations for how much they grow, what other plants and animals associate with them, and what the microclimate looks like under their canopy so I have a vector entitled shrub_ID that demarcates each instance in space that I sample. I often also have a fourth data file for field sampling that is descriptive using the same unique ID as the key approach wherein I add lat, long, qualitative observations, disturbances, or other supporting data.
  3. Ensure each vector/column is a single class.
    You can resolve this issue later, but I prefer to keep each vector a single class, i.e. all numeric, all character, or all date and time.
  4. Double-code confusing vectors for simplicity and error checking.
    I double-code data and time vectors to simpler vectors just to be safe. I try to use readr functions like read_csv that makes minimal and more often than not correct assumptions about the class of each vector. However, to be safe for vectors that I have struggled with in the past, and fixed in R using tidytools or others, I now just set up secondary columns that match my experimental design and sampling. If I visit my site with 30 shrubs three times in a growing season, I have a date vector that captures this rich and accurate sampling process, i.e. august 14, 2019 as a row, but I also prefer a census column that for each row has 1,2, or 3. This helps me recall how often I sampled when I reinspect the data and also provides a means for quick tallies and other tools. Sometimes, if I know it is long-term data over many years, I also add a simple year column that simply lists 2017, 2018, and 2019. Yes, I can reverse engineer this in R, but I like the structure – like a backbone or scaffold to my dataframe to support my thinking about statistics to match design.
  5. Keep track of total unique observation instances.
    I like tidy data. In each dataframe, I like a vector that provides me a total tally of the length of the data as a representation of unique observations. You can wrangle in later, and this vector does not replace the unique ID key vector at all. In planning an experiment, I do some math. One site, 30 shrubs, 3 census events per season/year, and a total of 3 years. So, 1 x 30 x 3 x 3 should be 270 unique observations or rows. I hardcode that into the data to ensure that I did not miss or forget to collect data. It is also fulfilling to have them all checked off. The double-check using tibble::rowid_to_column should confirm that count, and further to tip #2, you can have a variable or set of variables to join different dataframes so this becomes fundamentally useful if I measured shrub growth and climate three times each year for three years in my join (i.e. I now have a single observation_ID vector I generated that should match my hardcoded collection_ID data column and I can ensure it lines up with the census column too etc per year). A tiny bit of rendundancy just makes it so much easier to check for missing data later.
  6. Leave blanks blank. Ensures your data codes true and false zeros correctly (for me this means I observed a zero, i.e. no plants under the shrub at all versus missing data) and also stick to tip #3. My quick a priori rule that I annotate in meta-data for each file is that missing altogether is coded as blank (i.e. no entry in that row/instance but I still have the unique_ID and row there as placeholder) and an observed zero in the field or during experiment is coded as 0. Do not record ‘NA’ as characters in a numeric column in the csv because it flips the entire vector to character, and read_csv and other functions sorts this out better with blanks anyway. I can also impute in missing values if needed by leaving blanks blank.
  7. Never delete data. Further to tip #1, and other ideas described in R for Data Science, once I plan my experiment and decide on my data collecting and structural rules a priori, the data are sacred and stay intact. I have many colleagues delete data that they did not ‘need’ once they derived their site-level climate estimates (then lived to regret it) or delete rows because they are blank (not omit in #rstats workflow but opened up data file and deleted). Sensu tip #5, I like the tally that matches the designed plan for experiment and prefer to preserve the data structure.
  8. Avoid automatic factor assignments. Further to simple data formats like tip #1 and tip #4, I prefer to read in data and keep assumptions minimal until I am ready to build my models. Many packages and statistical tools do not need the vector be factor class, and I prefer to make assignments directly to ensure statistics match the planned design for the purpose of each variable. Sometimes, variables can be both. The growth of the shrub in my example is a response to the growing season and climate in some models but a predictor in other models such as the effect of the shrub canopy on the other plants and animals. The r-package forcats can help you out when you need to enter into these decisions and challenges with levels within factors.

Putting the different pieces together in science and data science is important. The construction of each project element including the design of experiment, evidence and data collection, and #rstats workflow for data wrangling, viz, and statistical models suggest that a little thinking beforehand and planning (like visual lego instruction guides) ensures that all these different pieces fit together in the process of project building and writing. Design them so that connect easily.

Sometimes you can get away without instructions and that is fun, but jamming pieces together that do not really fit and trying to pry them apart later is never really fun.

#meeting challenge: avoid this term for two weeks

Meetings are universally reviled. Even those that are obligated to call or host them likely do not relish the notion.

Challenge. Do not use the term meeting for two weeks.

To facilitate this proceess, here is a set of synonyms or alt-terms to consider. Most importantly, using an alternative term that is more specific and coupled to the functional purpose of calling a meeting will ensure that outcomes are aligned with expectations. Hopefully, it will also reduce the implicit biases associated with calling, attending, and participating in meetings.

synonymimplicationnovel_functionrelatedness
chatimplies brevity and light-hearted discussionrapid update5
conferencesuggests equalitypresent novel findings5
huddlesportsabout to make a play as team5
walkzen and outsidemeet but move5
workshopimplies skills and tool developmentfocus on technical problem solving5
rallychange, social good, positive cheerpromote successes and recognize contributions5
brainstormfocus on ideationsole purpose is to ideate4
lightning sessionbrief reportingupdates provided rapidly in succession4
Q&Aimplies end of talk and open forumseek answers to pressing questions4
symposiumformal presentation and exploration of theorypropose big theories4
congregationpart of the flock and familyconnect as team3
conversationtwo-way dialogue impliedtalking, not reporting3
get-togethersociality encouragedpromotes individual discovery3
lunchsharing food together is primaleating provides a focus & changes time perception3
parleypirates but making dispute resolution funconflict resolution without stating it3
pop-upcrafty and creative ideas to sellgenerate marketable ideas3
shark tankimplies entertainment but critical analysespermission to be critical3
show-and-tellimplies bringing in an item or idea you care aboutpresentations of work but focus on tangible3
showdownwild westinformal chance to sound off on ideas or projects3
snackrapid consumption of a treatshort and sweet exchange3

Metadata
Synonyms are very loosely defined here as terms with similar meanings. Consider using any term that is aligned with the purpose of calling people together virtually or in person. Imagine lecture, lab, viewing or any term that means bring people together to do something. Meetings are no different but have really come to mean mostly negative opportunities to convene.

Implication is the first thing that pops into your mind when you see or hear the term.

Novel function is the sole and express purpose of the time allocated and need to bring people together. If you cannot think of a function for calling your team together, then you likely do not need to.

Relatedness is scored from 1 to 5 with 5 being closely aligned with the most likely purposes or functions of meetings and 1 is unrelated. Most terms here roughly approximate the traditional needs for a meeting but feel free to get radical. After all, office parties are really meetings that have the capacity to be fun or at least provide libation.


Posted in fun

Ecological network flavors: many-to-many, few-to-many, and few-to-many spatially

Recent conference attendance inspired me to do a quick typology of networks that were presented in various talks. All were done in R using a few different packages.
All were interested in diversity patterns.
None were food webs.

Networks

many-to-many: many plant species and many pollinators for instance

few-to-many: mapping the associated set of pollinators to one flowering species

few-to-many: replicated mapping of diversity for one taxa to a single species of another either nested or spatially contrasted.

 

Network analyses are amazing. I need to learn more!

Can you also map interactions onto other interactions?

 

 

Fix-it Facilitation: additional resources

A super fun process exploring how empirical contributions can reshape and embrace theory by addressing gaps in better designs and clear interpretations of findings.

Fix-it Felix: advances in testing plant facilitation as a restoration tool in Applied Vegetation Science.

The original contribution was longer with a more complete set of resources. Here is the full citation list that framed and supported the story and discussion.

Literature cited

Badano, E.I., Bustamante, R.O., Villarroel, E., Marquet, P.A. & Cavieres, L.A. 2015. Facilitation by nurse plants regulates community invasibility in harsh environments. Journal of Vegetation Science: 756-767.

Badano, E.I., Samour-Nieva, O.R., Flores, J., Flores-Flores, J.L., Flores-Cano, J.A. & Rodas-Ortíz, J.P. 2016. Facilitation by nurse plants contributes to vegetation recovery in human-disturbed desert ecosystems. Journal of Plant Ecology 9: 485-497.

Barney, J.N. 2016. Invasive plant management must be driven by a holistic understanding of invader impacts. Applied Vegetation Science 19: 183-184.

Bertness, M.D. & Callaway, R. 1994. Positive interactions in communities. Trends in Ecology and Evolution 9: 191-193.

Bronstein, J.L. 2009. The evolution of facilitation and mutualism. Journal of Ecology 97: 1160-1170.

Bruno, J.F., Stachowicz, J.J. & Bertness, M.D. 2003. Inclusion of facilitation into ecological theory. Trends in Ecology and Evolution 18: 119-125.

Bulleri, F., Bruno, J.F., Silliman, B.R. & Stachowicz, J.J. 2016. Facilitation and the niche: implications for coexistence, range shifts and ecosystem functioning. Functional Ecology 30: 70-78.

Callaway, R.M. 1998. Are positive interactions species-specific? Oikos 82: 202-207.

Chamberlain, S.A., Bronstein, J.L. & Rudgers, J.A. 2014. How context dependent are species interactions? Ecology Letters 17: 881-890.

Filazzola, A. & Lortie, C.J. 2014. A systematic review and conceptual framework for the mechanistic pathways of nurse plants. Global Ecology and Biogeography 23: 1335-1345.

Gomez-Aparicio, L., Zamora, R., Gomez, J.M., Hodar, J.A., Castro, J. & Baraza, E. 2004. Applying plant facilitation to forest restoration: a meta-analysis of the use of shrubs as nurse plants. Ecological Applications 14: 1128-1138.

Holmgren, M. & Scheffer, M. 2010. Strong facilitation in mild environments: the stress gradient hypothesis revisited. Journal of Ecology 98: 1269-1275.

James, J.J., Rinella, M.J. & Svejcar, T. 2012. Grass Seedling Demography and Sagebrush Steppe Restoration. Rangeland Ecology & Management 65: 409-417.

Lortie, C.J., Filazzola, A., Welham, C. & Turkington, R. 2016. A cost–benefit model for plant–plant interactions: a density-series tool to detect facilitation. Plant Ecology: 1-15.

Macek, P., Schöb, C., Núñez-Ávila, M., Hernández Gentina, I.R., Pugnaire, F.I. & Armesto, J.J. 2017. Shrub facilitation drives tree establishment in a semiarid fog-dependent ecosystem. Applied Vegetation Science.

Malanson, G.P. & Resler, L.M. 2015. Neighborhood functions alter unbalanced facilitation on a stress gradient. Journal of Theoretical Biology 365: 76-83.

McIntire, E. & Fajardo, A. 2011. Facilitation within species: a possible origin of group-selected superoorganisms. American Naturalist 178: 88-97.

McIntire, E.J.B. & Fajardo, A. 2014. Facilitation as a ubiquitous driver of biodiversity. New Phytologist 201: 403-416.

Michalet, R., Brooker, R.W., Cavieres, L.A., Kikvidze, Z., Lortie, C.J., Pugnaire, F.I., Valiente‐Banuet, A. & Callaway, R.M. 2006. Do biotic interactions shape both sides of the humped‐back model of species richness in plant communities? Ecology Letters 9: 767-773.

Michalet, R., Le Bagousse-Pinguet, Y., Maalouf, J.-P. & Lortie, C.J. 2014. Two alternatives to the stress-gradient hypothesis at the edge of life: the collapse of facilitation and the switch from facilitation to competition. Journal of Vegetation Science 25: 609-613.

Noumi, Z., Chaieb, M., Michalet, R. & Touzard, B. 2015. Limitations to the use of facilitation as a restoration tool in arid grazed savanna: a case study. Applied Vegetation Science 18: 391-401.

O’Brien, M.J., Pugnaire, F.I., Armas, C., Rodríguez-Echeverría, S. & Schöb, C. 2017. The shift from plant–plant facilitation to competition under severe water deficit is spatially explicit. Ecology and Evolution 7: 2441-2448.

Pescador, D.S., Chacón-Labella, J., de la Cruz, M. & Escudero, A. 2014. Maintaining distances with the engineer: patterns of coexistence in plant communities beyond the patch-bare dichotomy. New Phytologist 204: 140-148.

Rydgren, K., Hagen, D., Rosef, L., Pedersen, B. & Aradottir, A.L. 2017. Designing seed mixtures for restoration on alpine soils: who should your neighbours be? Applied Vegetation Science.

Sheley, R.L. & James, J.J. 2014. Simultaneous intraspecific facilitation and interspecific competition between native and annual grasses. Journal of Arid Environments 104: 80-87.

Silliman, B.R., Schrack, E., He, Q., Cope, R., Santoni, A., van der Heide, T., Jacobi, R., Jacobi, M. & van de Koppel, J. 2015. Facilitation shifts paradigms and can amplify coastal restoration efforts. Proceedings of the National Academy of Sciences 112: 14295-14300.

Stachowicz, J.J. 2001. Mutualism, facilitation, and the structure of ecological communities. Bioscience 51: 235-246.

von Gillhaussen, P., Rascher, U., Jablonowski, N.D., Plückers, C., Beierkuhnlein, C. & Temperton, V.M. 2014. Priority Effects of Time of Arrival of Plant Functional Groups Override Sowing Interval or Density Effects: A Grassland Experiment. PLoS ONE 9: e86906.

Went, F.W. 1942. The dependence of certain annual plants on shrubs in southern California deserts. Bulletin of the Torrey Botanical Club 69: 100-114.

Xiao, S. & Michalet, R. 2013. Do indirect interactions always contribute to net indirect facilitation? Ecological Modelling 268: 1-8.

Why read a book review when you you can read the book (for free via #oa #openscience)?

 

Reviews, recommendations, and ratings are an important component of contemporary online consumption. Rotten Tomatoes, Metacritic, and Amazon.com reviews and recommendations increasingly shape decisions. Science and technical books are no exception. Increasingly, I have checked reviews for a technical book on a purchasing site even before I downloaded the free book. Too much information, not too little informs many of the competing learning opportunities (#rstats ) for instance).  I used to check the book reviews section in journals and enjoyed reading them (even if I never read the book). My reading habits have changed now, and I rarely read sections from journals and focus only on target papers. This is an unfortunate. I recognize that reviews are important for many science and technical products (not just for books but packages, tools, and approaches). Here is my brief listicle for why reviews are important  for science books and tools.

benefit description
curation Reviews (reviewed) and published in journals engender trust and weight critique to some extent.
developments and rate of change A book review typically frames the topic and offering of a book/tool in the progress of the science.
deeper dive into topic The review usually speaks to a specific audience and helps one decide on fit with needs.
highlights The strengths and limitations of offering are described and can point out pitfalls.
insights and implications Sometimes the implications and meaning of a book or tool is not described directly. Reviews can provide.
independent comment Critics are infamous. In science, the opportunity to offer praise is uncommon and reviews can provide balance.
fits offering into specific scientific subdiscpline Technical books can get lost bceause of the silo effect in the sciences. Reviews can connect disciplines.

Here is an estimate of the frequency of publication of book reviews in some of the journals I read regularly.

journal total.reviews recent
American Naturalist 12967 9
Conservation Biology 1327 74
Journal of Applied Ecology 270 28
Journal of Ecology 182 0
Methods in Ecology & Evolution 81 19
Oikos 211 22

Details of journal data scrape here: https://cjlortie.github.io/book.reviews/

 

A good novel tells us the truth about its hero; but a bad novel tells us the truth about its author.

–Gilbert K. Chesterton