This repository provides an example software project designed as a companion for the "Building better research software" lesson. The software project is packaged in the spacewalks.zip archive and serves as the starter bundle for learners to download at the beginning of the course.
The software project contains a Python script that uses the NASA data on human space walks (Extravehicular activities - EVAs), exported/downloaded in JSON format, does some analysis over this data, plots a few graphs and saves the data in CSV format. This example project is intentionally constructed to illustrate some common mistakes in research software development.
Throughout the lesson, course attendees learn and apply better research software practices — including elements of FAIR — as they work to improve the software project.
Different branches of this repository reflect the state of the software project at the start of each episode in the "Building better research software" lesson:
- Episode "1. Course introduction" is not making changes to the software.
- Episode "2. Better start with a software project" is starting from the spacewalks.zip archive.
- Branch 03-reproducible-dev-environment matches the code at the start of episode "3. Reproducible software environments"
- Branch 04-code-readability matches the code at the start of episode "4. Code readability"
- Branch 05-code-structure matches the code at the start of episode "5. Code structure"
- Branch 06-code-correctness matches the code at the start of episode "6. Code correctness"
- Branch 07-software-documentation matches the code at the start of episode "7. Software documentation"
- Branch 08-open-collaboration matches the code at the start of episode "8. Open collaboration"
Finally, the better and improved code version of the software project as it is at the end of the lesson can be found in the "final" branch.
There is a number of things that can be improved with the starter software project. The practices we cover for building better research software fall into the following three areas:
- Steps you can take in your own computing environment to improve the software
- The project is missing a
requirements.txt(or equivalent) file to document dependencies - Instructions for running the code are buried within the code rather than provided explicitly
- Version control is not used; instead, versioning is handled by embedding versions in filenames (e.g.
my code_v2.py) or directories (e.g.astronaut-data-analyses-old). - The project's folder structure could be improved — grouping data files, analysis results, plots and tests into clearly named directories
- The project is missing a
- Steps you can take to improve the source code and organisation of the project
- Avoid non-descriptive file names like
data.jsonormy_code_v2.py— use names that reflect the content or purpose of the file - Do not use blank spaces or special characters in file names as they can cause errors or be misinterpreted when running scripts from command line
- Group all import statements at the top of
my_code_v2.pyfor clarity and readability, rather than scattering them throughout the code - Avoid hard-coding assumptions, such as fixing the loop count to 374 data entries in
my_code_v2.py— this makes the code fragile and non-reusable with other data files - Use descriptive variable names instead of single letters like
wto make the code easier to understand and maintain - Relying on commenting and uncommenting code to control execution or select analyses is impractical and error-prone; consider using functions or configuration files instead
- Ensure scripts can be run multiple times without failure — for example,
my_code_v2.pyfails on a second run because it tries to save a result to an existing file that cannot be overwritten, and the code does not check for this - Add comments, documentation, and explanations in the code to make it easier for others (and your future self) to follow
- Refactor the code to use functions rather than keeping everything in one large, monolithic block
- Remove unused variables like
fieldnames, which was intended for saving data to a CSV file but is not used and clutters the code making it less readable and understandable - Avoid spaces in column names within data files, as they can cause errors when reading or processing data
- When reading data in - do not guess but specify the encoding to prevent errors (this should match the encoding that was used when saving the original data)
- Add tests to verify that the code and its outputs are correct; without tests, subtle bugs can easily go unnoticed
- Avoid non-descriptive file names like
- Steps you can take to make the software easier for others to use
- Missing a README (or similar documentation) to explain the purpose of the project, its dependencies, and how to use it on a high level
- There are no installation or usage instructions to guide users in more depth
- Missing a LICENSE file to specify how the code can be reused — without it, reuse is not permitted by default
- The code should be shared through collaborative software development platforms rather than distributing it via email or messaging apps
This is a non-exhaustive list - there are possibly other things that can be improved upon.
At the end of the lesson, we finish with an improved version of the software project that people should strive to achieve when writing their research software.
The authors of this fabricated example software project are:
Copyright of code in this project remains with the authors or their organisations and is licensed under the MIT licence. For the data used in this project - see data section.
The data used on in this project is open data provided by NASA and obtained as follows.
Data source: https://data.nasa.gov/Raw-Data/Extra-vehicular-Activity-EVA-US-and-Russia/9kcy-zwvn/about_data.
Either export data from the above page using the Export button or download in JSON format from command line as:
curl https://data.nasa.gov/resource/eva.json --output eva-data.json
Note: the original data has been modified for the purposes of this tutorial by inserting a semicolon separator after each name in the crew field.
The idea for this software has been borrowed from the "Astronaut analysis" workshop material by Helmholtz Federated IT Services (HIFIS).
This work has been supported by the UK's Software Sustainability Institute via the EPSRC, BBSRC, ESRC, NERC, AHRC, STFC and MRC grant EP/S021779/1 and UK Reproducibility Network (UKRN).