name: title class: Left, middle # During analysis · Data/coding practices and tools ### [Act II] Recomendations and practices for open and reproducible research .large[Reproducible Research Practices (RRP'23) · April 2023] .right[Carlos Granell · Sergi Trilles] .right[Universitat Jaume I] --- class: inverse, bottom, middle ## Karl Pooper's *Conjectures and Refutations* .large[The ideas we can most trust are those that have been the most tried and tested] ??? [Karl Popper](https://en.wikipedia.org/wiki/Karl_Popper) --- name: rec11 class: inverse, center, middle # .blue.bg-white[\#11] # Open data != reproducible data --- class: left ### Open data != reproducible .huge[“Openness and Open Science (data sharing, code sharing, open access, etc.) are enablers of reproducibility, but do not necessarily guarantee it” .small[<a name=cite-chiarelli2021></a>[[CLJ21](https://doi.org/10.5281/zenodo.5521077)]]] .huge[By default:] - .large[Open != good (of high academic quality)] -- - .large[Reproducible != Good (of high academic quality)] -- - .large[Open != reproducible] ??? Source: [Becoming a better scientist with open and reproducible research (2)](https://lgatto.github.io/open-and-rr-2/) --- name: rec12 class: inverse, center, middle # .blue.bg-white[\#12] # Data (required) for reproducibility --- class: left ### Data for reproducibility .huge[Are datasets valuable contributions by themselves? .small[<a name=cite-noy2023></a>[[NG23](https://dl.acm.org/doi/10.1145/3528574)]]] -- .huge[Go to next [Recomendation #13](#rec13)] --- class: left ### Data for reproducibility .huge[Are datasets meaningless outside of the accompanying article? .small[[[NG23](https://dl.acm.org/doi/10.1145/3528574)]]] -- - .large[Reproducibility means .gray.bg-blue[access] to datasets to validate the research] - .large[Access does not necessarily imply that datasets are open or citable or discoverable by themselves] - .large[**Suggestion**: Deposit reproducibility package (data, code, docs) on Zenodo or similar rather than supplementary material tied to the paper] - .large[Hint: [connecting GitHub with Zenodo](https://genr.eu/wp/cite/)] --- class: left ### Data for reproducibility .huge[Example: [Using mobile devices as scientific measurement instruments: Reliable android task scheduling](https://doi.org/10.1016/j.pmcj.2022.101550)]
??? https://jojozhuang.github.io/tutorial/mermaid-cheat-sheet/ --- name: rec13 class: inverse, center, middle # .blue.bg-white[\#13] # Be (data) FAIR, my friend --- class: left ### Be (data) FAIR, my friend .pull-left[ <img src="images/fair-principles.png" width="75%" style="display: block; margin: auto;" /> ] .pull-right[ .large[[Findable, Accesible, Interoperable, Reusable](https://www.go-fair.org/fair-principles/) for scientific data management and stewardship] - .large[Emphasis on identifiers, metadata, standards, licenses, permanence .small[<a name=cite-wilkinson2016></a>[[Wil+16](http://dx.doi.org/10.1038/sdata.2016.18)]]] - .large[Analysis on [metadata standards](https://github.com/leipzig/metadata-in-rcr) for reproducible research .small[<a name=cite-leipzig2021></a>[[Lei+21](https://www.sciencedirect.com/science/article/pii/S2666389921001707)]]] - .large[Ten simple rules for getting and giving credit for data .small[<a name=cite-wood-charlson2022></a>[[Woo+22](https://doi.org/10.1371/journal.pcbi.1010476)]]] ] ??? Standards that are featured within .small[[[Lei+21](https://www.sciencedirect.com/science/article/pii/S2666389921001707)]] can be found at https://github.com/leipzig/metadata-in-rcr. --- class: left ### Be (data) FAIR, my friend .huge[Datasets *are* valuable contributions by themselves .small[[[NG23](https://dl.acm.org/doi/10.1145/3528574)]]] -- - .large[Datasets should be findable, accessible, interoperable, and reusable] -- - .large[Datasets as .gray.bg-blue[first-class citizen] in scientific discourse .small[[[NG23](https://dl.acm.org/doi/10.1145/3528574)]]...] -- - .large[...so others can reuse, cite, evaluate, create value based on them to advance knowledge (.gray.bg-blue[replicability?])] --- class: left ### Be (data) FAIR, my friend .pull-left[ .huge[[Coalition for Publishing Data in the Earth and Space Sciences (COPDESS)](https://copdess.org/enabling-fair-data-project/commitment-statement-in-the-earth-space-and-environmental-sciences/)]: > .large[all journals in geosciences require authors to make all data that supports the conclusions in their papers available in publicly accessible repositories that follow the FAIR principles] ] .pull-right[ .huge[**[What UJI recommends...](http://www.uji.es/upo/rest/contenido/630998650/raw?idioma=es)**] > *personal investigador difunda en modalidad de acceso abierto los datos de investigación (datasets) asociados a la producción científica siempre que no se den razones legítimas de confidencialidad, propiedad intelectual y/o seguridad. Los datos de investigación deberán ser FAIR (findable, accessible, interoperable and reusable) e ir acompañados de una licencia estándar que indique explícitamente las condiciones de uso y favorezca la reproductibilidad científica* ] --- name: rec14 class: inverse, center, middle # .blue.bg-white[\#14] # Use open source software whenever possible --- class: left ### Open source software .pull-left[ .huge[Instead of ] - .large[ArcGIS] - .large[Google Maps/Places] - .large[Stata, SPSS, Excel...] ] -- .pull-right[ .huge[Pick] - .large[QGIS] - .large[OpenStreetMap] - .large[Python, R, ...] ] -- .huge[Mixed approach to make the .blue[implicit] (analytical workflow) .gray.bg-blue[explicit]] - .large[[ArcGIS Python Notebooks](https://developers.arcgis.com/python/guide/using-the-jupyter-notebook-environment/)] ??? make implicit explicit means to make analytical workflow explicit instead of click-based analysis --- name: rec15 class: inverse, center, middle # .blue.bg-white[\#15] # Learn/use scripting languages --- class: left ### Scripting languages .huge[Play with data, measure & explore, distrust of your intuition] .huge[R/Python scripts describe every step of an analysis] - .large[Descriptive-based analysis (vs. click-based) of what the code does] .huge[Others can understand (Remember: [_'Show me', not 'trust me'_](slides11_01.html#stark2018))] - .large[.gray.bg-blue[What] has been done & .gray.bg-blue[How] it has been done] - .large[See section *During analysis: best coding practices* .small[<a name=cite-alston2021></a>[[AR21](https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/bes2.1801)]]] --- name: rec16 class: inverse, center, middle # .blue.bg-white[\#16] # Learn/use computational notebook formats --- class:left ### Computational notebook formats .huge[Computational notebook is a virtual notebook .gray.bg-blue[environment] used for .gray.bg-blue[literate programming], which consists of cells of .gray.bg-blue[documentation], executable .gray.bg-blue[code], and .gray.bg-blue[results] as code output ([Wikipedia](https://en.wikipedia.org/wiki/Notebook_interface))] - .large[Jupyter, RMarkdown, Quarto, Matlab,...] -- .huge[Notebooks as first-class citizens] - .large[EarthCube annual meetings include a call for notebooks ([CFN 22](https://www.earthcube.org/post/call-for-notebooks-cfn-22)) as peer-reviewed submissions - See [Notebooks 2022 proceedings](https://github.com/earthcube2022)] - .large[Master/PhD theses as computational notebooks?] ??? Also in my class of Spatial Data Science --- class:left ### Computational notebook formats .huge[A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks .small[<a name=cite-pimentel2019></a>[[Pim+19](https://doi.org/10.1109/MSR.2019.00077)]]] .large[Studied 1.4 millions of notebooks (GitHub)] - .large[Only 24% ran without exceptions] - .large[Only 4% produced same results] .large[Provided *best practices for the reproducibility of notebooks*] --- name: rec17 class: inverse, center, middle # .blue.bg-white[\#17] # Preserve computational environment --- class:left ### Preserve computational environment .huge[Dependency management packages] - .large[document/manage specific versions of all packages and dependences used in a project] - .large[`renv` for R or `venv` for Python. See [post on `renv` usage](https://www.r-bloggers.com/2023/03/r-renv-how-to-manage-dependencies-in-r-projects-easily/)] - author: `renv::activate()` + `renv::snapshot()` + `renv::snapshot()` ... - others: `renv::restore()` - .large[complement them with best coding practices (setting a seed, etc.)] --- name: rec18 class: inverse, center, middle # .blue.bg-white[\#18] # Learn/use containerisation tools --- class: left ### Containerisation tools .pull-left[ .large[Beyond dependency management packages...] .large[[Docker](https://www.docker.com/) and family tools .small[<a name=cite-nust2020-docker></a>[[Nus+20](https://doi.org/10.1371/journal.pcbi.1008316)]]] .large[[Binder](https://mybinder.org/) deploys a cloud-based docker container based on a git repo] .large[[`repo2docker`](https://github.com/jupyterhub/repo2docker) fetches a git repository and builds a container image based on the configuration files found in the repository] ] .pull-right[ <img src="https://journals.plos.org/ploscompbiol/article/figure/image?size=large&id=10.1371/journal.pcbi.1008316.g002 " width="90%" style="display: block; margin: auto;" /> ] ??? creating a snapshot of the computational environment --- name: rec19 class: inverse, center, middle # .blue.bg-white[\#19] # Be (software) FAIR, my friend --- class: left ### Be (software) FAIR, my friend .huge[Document your source code .small[<a name=cite-barker2022></a><a name=cite-hasselbring2020></a>[[Bar+22](http://dx.doi.org/10.1038/s41597-022-01710-x); [Has+20](https://doi.org/10.1109/MC.2020.2998235)]]] - .large[**Findable**: Deposit source code in a repository that provides DOIs and metadata] - .large[**Accessible**: Provide the opportunity to download the source code] - .large[**Interoperable**: Use open source programming languages and software] - .large[**Reusable**: Release the software under a clear and open usage license] -- .huge[Cite software properly .small[<a name=cite-smith2016></a>[[SKN16](https://doi.org/10.7717/peerj-cs.86)]] - [#23](slides23_01.html#rec23)] -- .huge[**[What UJI recommends...](http://www.uji.es/upo/rest/contenido/630998650/raw?idioma=es)**] .large[nothing yet] --- name: rec20 class: inverse, center, middle # .blue.bg-white[\#20] # Make use of *Make* --- class: left ### Make use of *Make* .huge[[GNU Make](https://www.gnu.org/software/make/) is +40 years old but still relevant today] - .large[coordinates and automates command-line processes, such as a series of independent scripts] - .large[provides “one-entry” point to your analysis] .huge[Readings] - .large[[Reproducibility with Make](https://the-turing-way.netlify.app/reproducible-research/make.html)] - .large[[SnakeMake](https://snakemake.readthedocs.io/en/stable/): Python alternative to Make] --- name: summary class: inverse, center, middle # Summary --- - .large[[Open data != reproducible data](#rec11)] - .large[[Data (required) for reproducibility](#rec12)] - .large[[Be (data) FAIR, my friend](#rec13)] - .large[[Use open source software whenever possible](#rec14)] - .large[[Learn/use scripting languages](#rec15)] - .large[[Learn/use computational notebook formats](#rec16)] - .large[[Preserve computational environment](#rec17)] - .large[[Learn/use containerisation tools](#rec18)] - .large[[Be (software) FAIR, my friend](#rec19)] - .large[[Make use of *Make*](#rec20)] --- # References .tiny[ <a name=bib-smith2016></a>[Smith, AM, DS Katz, et al.](#cite-smith2016) (2016). "Software citation principles". In: _PeerJ Computer Science_ 2, p. e86. URL: [https://doi.org/10.7717/peerj-cs.86](https://doi.org/10.7717/peerj-cs.86). <a name=bib-wilkinson2016></a>[Wilkinson, Mark D., Michel Dumontier, et al.](#cite-wilkinson2016) (2016). "The FAIR Guiding Principles for scientific data management and stewardship". In: _Scientific Data_ 3.1. URL: [http://dx.doi.org/10.1038/sdata.2016.18](http://dx.doi.org/10.1038/sdata.2016.18). <a name=bib-pimentel2019></a>[Pimentel, João Felipe, Leonardo Murta, et al.](#cite-pimentel2019) (2019). "A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks". In: _2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)_. , pp. 507-517. <a name=bib-hasselbring2020></a>[Hasselbring, Wilhelm, Leslie Carr, et al.](#cite-hasselbring2020) (2020). "Open Source Research Software". In: _Computer_ 53.8, pp. 84-88. URL: [https://doi.org/10.1109/MC.2020.2998235](https://doi.org/10.1109/MC.2020.2998235). <a name=bib-nust2020-docker></a>[Nust, Daniel, Vanessa Sochat, et al.](#cite-nust2020-docker) (2020). "Ten simple rules for writing Dockerfiles for reproducible data science". In: _PLOS Computational Biology_ 16.11, pp. 1-24. URL: [https://doi.org/10.1371/journal.pcbi.1008316](https://doi.org/10.1371/journal.pcbi.1008316). <a name=bib-alston2021></a>[Alston, Jesse M. and Jessica A. Rick](#cite-alston2021) (2021). "A Beginner's Guide to Conducting Reproducible Research". In: _The Bulletin of the Ecological Society of America_ 102.2, p. e01801. eprint: https://esajournals.onlinelibrary.wiley.com/doi/pdf/10.1002/bes2.1801. URL: [https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/bes2.1801](https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1002/bes2.1801). <a name=bib-chiarelli2021></a>[Chiarelli, Andrea, Lucia Loffreda, et al.](#cite-chiarelli2021) (2021). _The Art of Publishing Reproducible Research Outputs: Supporting emerging practices through cultural and technological innovation_. URL: [https://doi.org/10.5281/zenodo.5521077](https://doi.org/10.5281/zenodo.5521077). <a name=bib-leipzig2021></a>[Leipzig, Jeremy, Daniel Nüst, et al.](#cite-leipzig2021) (2021). "The role of metadata in reproducible computational research". In: _Patterns_ 2.9, p. 100322. ISSN: 2666-3899. URL: [https://www.sciencedirect.com/science/article/pii/S2666389921001707](https://www.sciencedirect.com/science/article/pii/S2666389921001707). <a name=bib-barker2022></a>[Barker, Michelle, Neil P. Chue Hong, et al.](#cite-barker2022) (2022). "Introducing the FAIR Principles for research software". In: _Scientific Data_ 9.1. URL: [http://dx.doi.org/10.1038/s41597-022-01710-x](http://dx.doi.org/10.1038/s41597-022-01710-x). <a name=bib-wood-charlson2022></a>[Wood-Charlson, Elisha M., Zachary Crockett, et al.](#cite-wood-charlson2022) (2022). "Ten simple rules for getting and giving credit for data". In: _PLOS Computational Biology_ 18.9, pp. 1-11. URL: [https://doi.org/10.1371/journal.pcbi.1010476](https://doi.org/10.1371/journal.pcbi.1010476). <a name=bib-noy2023></a>[Noy, Natasha and Carole Goble](#cite-noy2023) (2023). "Are We Cobblers without Shoes?". In: _Communications of the ACM_ 66.1, pp. 36-38. ISSN: 0001-0782. URL: [https://dl.acm.org/doi/10.1145/3528574](https://dl.acm.org/doi/10.1145/3528574). ]