Software engineering for data science

  • Challenges
    • Workflows / Pipelines consist of different tools
    • dozens of individual methods
    • Software dependencies
    • Complex dependency trees and configuration requirements
    • HPC resource management
    • Scalability - performance optimization
    • Reproduce results with older data / integrate with newer data

Challenges / Solution


When faced with a challenge,
look for a way not for a way out
David L. Weatherford





A custom DSL (domain-specific language) for building reproducible computational workflows and pipelines

Nextflow - Features

Nextflow is built around the concept that
Linux is the lingua franca of data science


it allows:

  • Fast prototyping
    • write a computational pipeline by making it simpler to put together many different tasks
    • reuse your existing scripts and tools
    • extend with a powerful DSL
    • “no” need to learn a new language or API

Nextflow - Features

  • Reproducibility and isolation of dependencies
    • supports Docker and Singularity containers technology
    • supports Conda envs: easy and automatic software installation
    • integration of the GitHub code sharing platform

process {
  conda 'bwa samtools multiqc'
  ...
}
process {
  container 'quay.io/biocontainers/bedtools:2.30.0--h7d7f7ad_2'
  ...
}
docker.enabled = true
$ nextflow run nextflow-io/hello -r v1.1

Nextflow - Features

it is portable:

  • abstraction layer between pipeline’s logic and the execution layer
  • can be executed on multiple platforms without changing its code
  • SGE, LSF, SLURM, PBS and HTCondor batch schedulers
  • Kubernetes, Amazon AWS and Google Cloud platforms.
process {
  executor = 'sge'
  queue = 'long.q'
  
  cpus = 2
}

Nextflow - Features

it has unified parallelism:

  • based on the dataflow programming model
  • greatly simplifies writing complex distributed pipelines
  • Parallelisation is implicitly defined by the processes input and output declarations.
  • The resulting applications are inherently parallel and can scale-up or scale-out

it comes with continuous checkpoints:

  • all the intermediate results are automatically tracked
  • can resume from the last successfully executed step

Nextflow - Basics

Processes

  • A Nextflow pipeline is joining together different processes
  • A process can be written in any scripting language (Bash, Perl, Ruby, Python, etc.)
  • Processes are executed independently and are isolated from each other, i.e. they do not share a common (writable) state.

Nextflow - Basics

Channels

  • Processes communicate is via asynchronous FIFO queues, called channels in Nextflow.
  • Any process can define one or more channels as input and output.
  • The interaction between these processes, and ultimately the pipeline execution flow itself, is implicitly defined by these input and output declarations.
  • Parallelisation is implicitly defined by the processes input and output declarations

Nextflow - Basics

Execution abstraction

  • A process defines what command or script has to be executed

  • The executor determines how it is run on the target system

  • By default processes are executed on the local computer (good for testing)

  • real world computational pipelines run on an HPC or cloud platform

  • Nextflow provides an abstraction between the pipeline and the underlying execution system.

Nextflow - Basics

Execution abstraction

  • Write a pipeline once and seamlessly run it on your computer, a grid platform, or the cloud, without modifying it, by simply defining the target execution platform in the configuration file.

Nextflow supports:

  • SGE, LSF, SLURM, PBS, Torque, HTCondor
  • Amazon Web Services (AWS), Google Cloud Platform (GCP), Kubernetes

Nextflow - Basics

Scripting language

  • Nextflow is designed to have a minimal learning curve
  • In most cases, standard scripting skills to develop Nextflow workflows are fine.
  • It also provides a powerful scripting DSL.
  • Nextflow scripting is an extension of Groovy, which is a super-set of Java
x = Math.random()
if( x < 0.5 ) {
    println "You lost."
}
else {
    println "You won!"
}

Nextflow - Basics

Configuration options

  • Pipeline configuration is defined in nextflow.config

e.g. : executors, containers/Envs, environment variables, pipeline parameters etc.

params {
  resultDir = "/data/results"
  genome = "/data/hg38/genome.fasta"
}
process {
  withName:BWA {
    cpus = 8
    container = "quay.io/biocontainers/bwa:0.7.17--h5bf99c6_8"
  }
}
singularity {
    enabled = true
}

Nextflow - Example

Nextflow - Example

Nextflow - Links