Before analysis

Basics to get started

Carlos Granell

GEOTEC, Universitat Jaume I

Apr 25, 2024

Focus is the art of knowing what to ignore.

The fastest way to raise your level of performance: Cut your number of commitments in half.

[#1] ONE project = ONE folder

What’s a project?

  • an experiment in a PhD project

  • master project thesis

  • ideas for future research

  • regular meeting notes/minutes

  • teaching materials

  • a review paper

  • a conference presentation

  • a workshop/seminar materials

  • a book

  • PhD thesis manuscript

[#2] Choose your best way to organise a project/folder

Project/folder structure

Make sure it’s consistent, informative, works for you, and stick to it

  • README.md

  • LICENSE

  • CODE_OF_CONDUCT

  • CONTRIBUTING

  • data: data, data-raw

  • code: scripts, analysis

  • results: reports, figs

  • documentation: notes, docs

[#3] Choose a consistent file naming convention

File naming convention

  • File names are machine-readable, human-readable, and play well with default ordering

  • Script file names begin with numbers/letters to indicate the sequence in the analysis: 01_download_data.R

  • Data file names begin with dates (YYYYMMDD) as prefix: 20200115_survey.csv

[#4] Never ever touch raw data

Remember Newton’s letter to Flamsteed

Raw data

  • Store raw data permanently (data-raw folder)

  • Use scripts to process/clean raw datasets

  • Store processed data in a separate folder (data or data-clean folder)

  • Document the process (simple steps, diagrams, content/structure of datasets, provenance) in a plain text README file (See Recommendation #6)

[#5] Use open data formats

Open formats

  • Use open, text-based formats whenever possible

  • Independent of specific software tools or vendors

  • Alternatively, provide data in an open format besides proprietary format

    • Microsoft Excel (.xls) + Comma-separated values (.csv)

    • ESRI Shapefile (.shp) + GeoPackages (.gpkg)

Example

Dutch national centre of expertise and repository for research data (DANS) - Preferred vs non-preferred formats

[#6] Document, document, and document

README file(s)

  • Include a README file in the root folder to describe the project, basic orientation to use your code, data, etc.

Tips

Suggestions for writing a good README and GitHub’s README

If your project is on GitHub, README files will be automatically visualised if written in Markdown

README file(s)

  • Include (if required) README files in each subfolder to describe metadata/complex content

  • Keep track of ideas, discussions and decisions about the project (in the notes folder)

  • Plain text files can be easily version controlled (See Recommendations #9 and #10)

[#7] Add a (data) license

Concept

A license is a contract between the authors and users (Jolly, Fletcher, and Bourne 2012)

Without a license, copyright is automatically attached to your work

If you plan to make your work (data/databases/documents) public, always specify a license via a LICENSE file (LICENSE.md or LICENSE.txt)

Data licenses: Creative Commons

CC BY-SA-NC-ND

BY Creators/authors must be credited
SA Derivatives or redistributions must have identical license
NC Only non-commercial usage is allowed
ND No derivatives are allowed







Data licenses: Creative Commons

Ex: Article (or data/datasets) to be published

As user/viewer, can you CC BY 4.0 CC BY-NC-ND
Read, print and download it? YES YES
Redistribute or republish it? YES YES
Translate it? YES YES (private use only and not for distribution)
Download for text and data mining? YES YES
Reuse portions in other works 1? YES YES
Sell or re-use it for commercial purpose? YES NO

Data licenses: CC Zero

Creators/researchers/educators put their works into the global public domain for the benefit of society

Data licenses: Open Data Commons

ODC Public Domain Dedication and License (PDDL): Public Domain for data/databases (≅CC0)

ODC-By: Attribution for data/databases (≅CC-BY)

ODC Open Data License (ODbL): Attribution Share-Alike for data/databases (≅CC-BY-SA)

Data licenses: What UJI recommends

  • Final projects (TFG, TFM): CC BY-SA

  • Doctoral theses: CC BY-SA or CC BY-NC-SA

  • Teaching materials: CC BY-NC-SA

[#8] Add a (software) license

Software licenses

Always add a license to the software you plan to make public

Permissive = attribution (recommended for academic work)

Copyleft = share-alike (derivative work maintain same license as the original)

Software licenses: What UJI recommends

See Recommendation 19

[#9] Learn/use version control systems

Turn your local project folder into a version control repository

::::

[#10] Learn/use online (Git) repository hosting services

Easier for individuals and teams to use Git for version control and collaboration

References

Bryan, Jennifer. 2018. “Excuse Me, Do You Have a Moment to Talk about Version Control?” The American Statistician 72 (1): 20–27. https://doi.org/10.1080/00031305.2017.1399928.
Bryan, Jenny, and Jim Hester. 2020. Happy Git and GitHub for the useR. https://happygitwithr.com/.
Jolly, M, AC Fletcher, and PE Bourne. 2012. “Ten Simple Rules to Protect Your Intellectual Property.” PLoS Computacional Biology 8: e1002766. https://doi.org/10.1371/journal.pcbi.1002766.
Morin, A, J Urban, and P Sliz. 2012. “A Quick Guide to Software Licensing for the Scientist-Programmer.” PLoS Computacional Biology 8 (7): e1002598. https://doi.org/10.1371/journal.pcbi.1002598.
Perez-Riverol, Y, L Gatto, R Wang, T Sachsenberg, J Uszkoreit, F da Veiga Leprevost, C Fufezan, et al. 2016. “Ten Simple Rules for Taking Advantage of Git and GitHub.” PLoS Computational Biology 12 (7). https://doi.org/10.1371/journal.pcbi.1004947.
The Turing Way Community. 2022. “The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research.” Zenodo. https://doi.org/10.5281/zenodo.3233853.