~vejnar/LabxPipe

Genomics pipelines

9954247 Update schema multi-cmd

~vejnar pushed to ~vejnar/LabxPipe git

1 year, 1 month ago

1df8ae7 Update doc & examples

~vejnar pushed to ~vejnar/LabxPipe git

1 year, 2 months ago

#LabxPipe

MPLv2

  • Integrated with LabxDB: all required annotations (labels, strand, paired etc) are retrieved from LabxDB. This is optional.
  • Based on existing robust technologies. No new language.
    • LabxPipe pipelines are defined in JSON text files.
    • LabxPipe is written in Python. Using norms, such as input and output filenames, insures compatibility between tasks.
  • Simple and complex pipelines.
    • By default, pipelines are linear (one step after the other).
    • Branching is easily achieved be defining a previous step (using step_input parameter) allowing users to create any dependency between tasks.
  • Parallelized using robust asynchronous threads from the Python standard library.

#Commands

LabxPipe provides a unique lxpipe command with multiples sub-commands. Running a pipeline would typically involve using these sub-commands:

LabxPipe

The output of multiple pipelines executed using lxpipe run can be combined to merge gene counts or create profiles and trackhubs with the following sub-commands:

LabxPipe

See examples to understand how each sub-command works.

#Examples

See JSON files in config/pipelines of this repository.

Pipeline JSON file
mrna_seq.json mRNA-seq.
mrna_seq_profiling_bam.json mRNA-seq. Genomic coverage profiles using GeneAbacus. BAM and SAM outputs.
mrna_seq_no_db.json mRNA-seq. No LabxDB.
mrna_seq_with_plotting.json mRNA-seq. Plotting non-mapped reads. Demonstrate step_input.
mrna_seq_cufflinks.json mRNA-seq. Replaces GeneAbacus by Cufflinks.
chip_seq.json ChIP-seq. Bowtie2 and Samtools to uniquify reads.
chip_seq_user_function.json ChIP-seq. Bowtie2 and Samtools to uniquify reads. Genomic coverage profiles using GeneAbacus. Peak-calling using MACS3 employing a user-defined step/function.

Following demonstrates how to apply mrna_seq.json pipeline. It requires:

  • LabxDB
  • FASTQ files for sample named AGR000850 and AGR000912
    /plus/data/seq/by_run/AGR000850
    ├── 23_009_R1.fastq.zst
    └── 23_009_R2.fastq.zst
    /plus/data/seq/by_run/AGR000912
    ├── 65_009_R1.fastq.zst
    └── 65_009_R2.fastq.zst
    

Note: mrna_seq_no_db.json demonstrates how to use LabxPipe without LabxDB: it only requires FASTQ files (in path_seq_run directory, see above).

Requirements:

  • LabxDB. Alternatively, mrna_seq_no_db.json doesn't require LabxDB.
  • ReadKnead to trim reads.
  • STAR and genome index in directory defined path_star_index.
  • GeneAbacus to count reads and generate genomic profile for tracks.
  1. Start pipeline:

    lxpipe run --pipeline mrna_seq.json \
               --worker 2 \
               --processor 16
    

    Output is written in path_output directory.

  2. Create report:

    lxpipe report --pipeline mrna_seq.json
    

    Report file mrna_seq.xlsx should be created in same directory as mrna_seq.json.

  3. Extract output file(s) to use them directly, for instance to load them in IGV. For example:

    • To extract BAM files and rename them using the sample label:
      lxpipe extract --pipeline mrna_seq.json \
                     --files aligning,accepted_hits.sam.zst \
                     --label
      
    • To extract BigWig profile files and rename them using the sample label and reference in addition to the original filename used as filename suffix:
      lxpipe extract --pipeline mrna_seq.json \
                     --files profiling,genome_plus.bw \
                     --label \
                     --reference \
                     --suffix
      
      Use -d/--dry_run to test the extract command before applying it.
  4. Merge gene/mRNA counts generated by GeneAbacus in counting directory:

    lxpipe merge-count --pipeline mrna_seq.json \
                       --step counting
    
  5. Create a trackhub. Requirements:

    • ChromosomeMappings file (to map chromosome names from Ensembl/NCBI to UCSC)
    • Tabulated file (with chromosome name and length)

    Execute in a separate directory:

    lxpipe trackhub --runs AGR000850,AGR000912 \
                    --species_ucsc danRer11 \
                    --path_genome /plus/scratch/sai/annots/danrer_genome_all_ensembl_grcz11_ucsc_chroms_chrom_length.tab \
                    --path_mapping /plus/scratch/sai/annots/ChromosomeMappings/GRCz11_ensembl2UCSC.txt \
                    --input_sam \
                    --bam_names accepted_hits.sam.zst \
                    --make_config \
                    --make_trackhub \
                    --make_bigwig \
                    --processor 16
    

    Directory is ready to be shared by a web server for display in the UCSC genome browser.

#Configuration

Parameters can be defined globally. See in config directory of this repository for examples.

#Writing pipelines

Parameters are defined first globally (see above), then per pipeline, then per replicate/run, and then per step/function. The latest definition takes precedence: path_seq_run defined in /etc/hts/labxpipe.json is used by default, but if path_seq_run is defined in the pipeline file, it will be used instead.

Main parameters

Parameter Type
name string
path_output string
path_seq_run string
path_local_steps string
path_annots string
path_bowtie2_index string
path_bwa-mem2_index string
path_minimap2_index string
path_star_index string
fastq_exts []strings
adaptors {}
logging_level string
run_refs []strings
replicate_refs []strings
ref_info_source []strings
ref_infos {}
analysis [{}, {}, ...]

Parameters for all steps

Parameter Type
step_name string
step_function string
step_desc string
force boolean

Step-specific parameters

Step Synonym Parameter Type
readknead preparing options []strings
ops_r1 [{}, {}, ...]
ops_r2 [{}, {}, ...]
plot_fastq_in boolean
plot_fastq boolean
fastq_out boolean
zip_fastq_out string
bowtie2 genomic_aligning options []strings
index string
output string
output_unfiltered string
compress_sam boolean
compress_sam_cmd string
create_bam◆ boolean
index_bam◆ boolean
bwa-mem2 options []strings
index string
output string
compress_output boolean
compress_output_cmd string
create_bam◆ boolean
index_bam◆ boolean
minimap2 options []strings
index string
output string
compress_output boolean
compress_output_cmd string
create_bam◆ boolean
index_bam◆ boolean
star aligning options []strings
index string
output_type []strings
compress_sam boolean
compress_sam_cmd string
compress_unmapped boolean
compress_unmapped_cmd string
cufflinks options []strings
inputs [{}, {}, ...]
features [{}, {}, ...]
geneabacus counting options []strings
inputs [{}, {}, ...]
path_annots string
features [{}, {}, ...]
samtools_sort options []strings
sort_by_name_bam boolean
samtools_uniquify options []strings
sort_by_name_bam boolean
index_bam boolean
cleaning steps [{}, {}, ...]

◆ indicates exclusive options. For example, either create_bam or index_bam can be used, but not both.

Sample-specific parameters. Automatically populated if using LabxDB or sourced from ref_infos. These parameters can be changed manually in any step (for example setting paired to false will ignore second reads in that step).

Parameter Type
label_short string
paired boolean
directional boolean
r1_strand string
quality_scores string

#User-defined step

In addition to the provided steps/functions, i.e. bowtie2, star or geneabacus, users can defined their own step, usable in the LabxPipe pipelines. LabxPipe will import user-defined steps:

  • Written in Python

  • One step per file with the .py extension located in the directory defined by path_local_steps

  • Each step defined in individual file requires:

    1. A functions variable listing the step name(s)
    2. A function named run with the 3 parameters path_in, path_out and params

    For example:

    functions = ['macs3']
    def run(path_in, path_out, params):
        ...
    

Example of a user-defined function providing peak-calling using MACS3 is available in config/user_steps/macs3.py in this repository.

Example of a pipeline using the MACS3 step is available in config/pipelines/chip_seq_user_function.json in this repository.

#Demultiplexing sequencing reads: lxpipe demultiplex

  • Demultiplex reads based on barcode sequences from the Second barcode field in LabxDB

  • Demultiplexing using ReadKnead. The most important for demultiplexing is the ReadKnead pipeline. Pipelines are identified using the Adapter 3' field in LabxDB.

  • Example for simple demultiplexing. The first nucleotides at the 5' end of read 1 are used as barcodes (the Adapter 3' field is set to sRNA 1.5 in LabxDB for these samples) with the following pipeline:

    {
        "sRNA 1.5": {
            "R1": [
                {
                    "name": "demultiplex",
                    "end": 5,
                    "max_mismatch": 1
                }
            ],
            "R2": null
        }
    }
    

    The barcode sequences are added by LabxPipe using the Second barcode field in LabxDB.

  • Example for iCLIP demultiplexing. In Vejnar et al., iCLIP is demultiplexed (the Adapter 3' field is set to TruSeq-DMS+A Index in LabxDB for these samples) using the following pipeline:

    {
        "TruSeq-DMS+A Index": {
            "R1": [
                {
                    "name": "clip",
                    "end": 5,
                    "length": 4,
                    "add_clipped": true
                },
                {
                    "name": "trim",
                    "end": 3,
                    "algo": "bktrim",
                    "min_sequence": 5,
                    "keep": ["trim_exact", "trim_align"]
                },
                {
                    "name": "length",
                    "min_length": 6
                },
                {
                    "name": "demultiplex",
                    "end": 3,
                    "max_mismatch": 1,
                    "length_ligand": 2
                },
                {
                    "name": "length",
                    "min_length": 15
                }
            ],
            "R2": null
        }
    }
    

    Pipeline is stored in demux_truseq_dms_a.json. The barcode sequences are added by LabxPipe using the Second barcode field in LabxDB. (NB: published demultiplexed data were generated using "algo": "align" with a minimum score of 80 instead of "algo": "bktrim")

    Then pipeline was tested running:

    lxpipe demultiplex --bulk HHYLKADXX \
                       --path_demux_ops demux_truseq_dms_a.json \
                       --path_seq_prepared prepared \
                       --demux_nozip \
                       --processor 1 \
                       --demux_verbose_level 20 \
                       --no_readonly
    

    This output is very verbose: for every read, output from every step of the demultiplexing pipeline is reported. To get consistent output, --processor must be set to 1. Output is written in local directory prepared.

    And finally, once pipeline is validated (data is written in path_seq_prepared directory, see here):

    lxpipe demultiplex --bulk HHYLKADXX \
                       --path_demux_ops demux_truseq_dms_a.json \
                       --processor 10
    

#License

LabxPipe is distributed under the Mozilla Public License Version 2.0 (see /LICENSE).

Copyright © 2013-2023 Charles E. Vejnar