This is a guide to using the
bioflows package for running standard
pipelines to analyse NGS datasets. Currently, we have implemented some standard workflows alongwith tutorials using this package. To start using bioflows use the quick start to go to any of the tutorials implemented
A primary objective of the Computational Biology Core at Brown's Centre for Computational Biology of Human Disease, is to enable reproducibility in computational analysis of NGS data. Critical to this objective is to provide a simple tool for creating/running bioinformatics workflows as well as consistent software environments across multiple platforms. To this we:
- Developed bioflows a workflow tool to ensure consistency in analysis steps and stages with interoperability across multiple job submission systems
- Use CONDA package management system for managing software tools
- Container based approach using docker for cross-platform interoperability of the analysis environment
bioflows is an user-friendly python implementation of a workflow manager. The user is expected to not have any programming knowledge and needs to only provide a control file in a YAML format, chosen for its human readability. The goal here is to provide users with a simple and straight-forward interface for processing NGS datasets with many samples using standard bioinformatics pipelines, e.g RNA-seq, GATK variant calling etc. The tool is developed to alleviate some of the primary issues with scaling up pipelines, such as file naming, management of data, output and logs.
How it works
bioflows uses two main python packages: 1) luigi developed at spotify for managing dependencies among task and 2) SAGA python API to launch jobs across different types of systems. All the necessary tools are provided from the **CBCs anaconda channel.
Currently bioflows provides the following features:
Simple management of
A module to easily download data from NCBI's SRA archive and to optionally directly continue processing the data through the pipelines. A few key elements of this module are:
- Only download and convert the data to usable fastqs
- Data from multiple runs are concatenated automatically
- Metadata associated with the SRA data is also provided as a seperate table
Conda packages for all dependencies are already pre-built and provided alongwith the software
Conda Package Management
CONDA is a system agnostic software package management system based on
the Anaconda python distribution to ensure that a software and all its
dependencies are bundled together. These conda packages can be
downloaded from various publicly available repositories called
channels and one such channel for bio-informatics tools is bioconda.
For ensuring reproducibily, we have established a publicly accessible channel for all programs that are included with wrappers within the
bioflows tool through the compbiocore channel. In this channel, we have also provided conda packages of all software used including the
bioflows package itself. To download specific packages or the
bioflows tool use the following command into your conda environment:
bioflows and its dependencies are available as
conda packages for the linux OS only.
bioflows in other OSes you will need to use the docker container approach
More detailed instructions on how to install anaconda and use the conda environments can be found in the anaconda documentation for: