Skip to content

MOOC: Reproducible Research II: Practices and tools for managing computations and data

In this MOOC, we will show you how to improve your practices and your ability to manage and process larger amounts of data, complex computations, while controlling your software environment.

Enrollment: Apr 02, 2024 to Sep 04, 2024

Course: May 16, 2024 to Sep 12, 2024

More Info and Registration

General information

  • Time: From May 16, 2024 to Sep 12, 2024
  • Registration deadline: From Apr 02, 2024 to Sep 04, 2024
  • Location: Online
  • Effort: 35 hours

Description

We will show you how to improve your practices for managing large data and complex computations in controlled software environments:

  • you will learn how to use formats like JSON, FITS, and HDF5, platforms like Zenodo and Software Heritage, tools like git-annex, docker, singularity, guix, make, and snakemake;
  • we will show you how to integrate them in a real-life use case: a sunspot detection study. You will see for yourself that our methods and tools allow you to work in a reliable and reproducible way.

The strength of this new MOOC lies in a general and systematic presentation of the major concepts and of how they translate into practical solutions through numerous hands-on sessions with state-of-the-art open-source tools.

Learning outcomes

At the end of this course, you will be able to:

Manage research data:

  • understand the challenges posed by large volumes of data
  • archive data on well-known archives such as Software Heritage and Zenodo
  • integrate data into versioning (Git Annex)
  • use structured binary data formats (FITS, HDF5)

Use tools and techniques for controlling the software environment:

  • understand how software packages are built and managed
  • deploy software environments as containers (ex: Docker)
  • manage software environments using a functional package manager (ex: Guix)
  • work in controlled software environments on a daily basis

Automate long or complex computations using workflows:

  • understand the challenges of scaling up: long calculations, distributed calculations
  • choose a workflow tool adapted to your needs
  • automate a data analysis using make and snakemake
  • control the software environments of a workflow

Prerequisites

This course is for everyone who relies on a computer to perform data analysis. You should have some experience with running commands in a terminal, and have a basic knowledge of git and scientific Python.

Registration

More Info and registration