bioflows
YAML control file Specifications
Project information section
-
bioproject
: This will be an unique identifier for your project. This is adopted from the NCBI SRA format structure, so if you use an SRA dataset you can employ these ids -
experiment
: An identifier for your experiment such as RNA-seq, ChIP-seq etc -
sample_manifest
: This section contains two sections-
fastq_file
: The full path to the sample to fastq map file. This file is in a three column comma separated format with each line formatted as:sample_id, path_to_fastq_file_for_read1, path_to_fastq_file_for_read2
The
sample_id
is unique and if you are using single end data you just need to specify one column as shown below:sample_id, path_to_fastq_file
-
metadata
: This is all the metadata associated with a givensample_id
if available such as gender, extraction date etc. This should also be a CSV format file. Currently, not necessary as this information is not yet used
-
Global run parameters
-
run_parms
: This section specifies the global parameters for the current analysis-
conda_command
: This is the command used to activate your conda environment -
work_dir
: The working directory for analysis usually created on/gpfs/scratch
-
log_dir
: The subdirectory for all the log files -
paired_end
: Whether data consists of paired end reads or single end reads (True/False)
Warning
This is an experimental feature and may not work as intended
-
ssh_user
: The user name if workflow is run from a local machine -
saga_scheduler
: The scheduler being used, for CCV the value used here is slurm. !!! note Currently only tested withslurm
scheduler. Will add test to others soon -
gtf_file
: The full path to the gtf file for gene annotations, needed if you are planning to run rna-seq analysis
-
Workflow parameters
-
workflow_sequence
: This section specifies the sequence of tools to be used and the options passed to tools as well as the job parameters if using a scheduler such as slurm-
fastqc
:If you want to use the default parameters use default else you can use any of the options provided by the program. See the example for GSNAP below on how to do that. See the documentation for the options for fastqc. -
gsnap
: Here we give an example of two sections as we need to pass the index information to the aligner-
options
: Specify program options here. In this example we specify the following-
-d: The genome index for GSNAP
-
-s: and the splicesites file location for GSNAP.
The format is exactly that as to what you would specify on the command line for the program
-d Ensembl_mus_GRCm38.p5_rel89 -s Mus_musculus.GRCm38.89.splicesites.iit
See the documentation for the GSNAP program for more options
-
-
job_param
s: This section specifies parameters for job submission such as memory, number of cores etc- ncpus: 16
- mem: 40000
- time: 60
-
-
qualimap_rnaseq
: Run the qualimap module for RNAseq with the default settings
-
The final YAML control file should look as below to run a test example. Only modify the parts that are highlighted below to fill in your own values.
bioproject: Project_test_localhost
experiment: rnaseq_pilot
sample_manifest:
fastq_file: /users/:bluetext:`username/sample_manifest_min.csv`
metadata:
run_parms:
conda_command: source activate /gpfs/runtime/opt/conda/envs/cbc_conda_test
work_dir: /gpfs/scratch/'user/test_workflow'
log_dir: logs
paired_end: False
local_targets: False
db: sqlite
db_loc: ":memory:"
saga_host: localhost
ssh_user: 'ccv username'
saga_scheduler: slurm
gtf_file: /gpfs/data/cbc/cbcollab/ref_tools/Ensembl_hg_GRCh37_rel87/Homo_sapiens.GRCh37.87.gtf
workflow_sequence:
fastqc: default
gsnap:
options:
-d: Ensembl_Homo_sapiens_GRCh37
-s: Ensembl_Homo_sapiens.GRCh37.87.splicesites.iit
job_params:
ncpus: 16
mem: 40000
time: 60
qualimap_rnaseq: default
htseq-count: default
How to run
Copy the above into a text file and save it in /users/username
as test_run.yaml
Copy the manifest below into a text file and save it in /users/username
as sample_manifest_min.csv
samp_1299,/gpfs/scratch/aragaven/rnaseq_test/PE_hg/Cb2_1.gz,/gpfs/scratch/aragaven/rnaseq_test/PE_hg/Cb2_2.gz
samp_1214,/gpfs/scratch/aragaven/rnaseq_test/PE_hg/Cb_1.gz,/gpfs/scratch/aragaven/rnaseq_test/PE_hg/Cb_2.gz
Now in your screen session run the following commands to setup your environment if you have not done so previously during the setup or you have started a new screen session
source activate bflows
bioflows-rnaseq test_run.yaml
In this case I have created a small test dataset with 10000 reads from a test human RNAseq data, so it should run within the hour and you should see that the alignments are completed.