Pipeline Templates

You can always run your own pipeline scripts through the container, but the container also includes a set of predefined pipeline scripts that can be run as is or extended to your needs. Each pipeline script has a -h argument which will explain its use. The available pipelines are:

  • preprocess-phix

  • presto-abseq

  • presto-clontech

  • presto-clontech-umi

  • changeo-10x

  • changeo-igblast

  • tigger-genotype

  • shazam-threshold

  • changeo-clone

All template pipeline scripts can be found in /usr/local/bin.

PhiX cleaning pipeline

Removes reads from a sequence file that align against the PhiX174 reference genome.

Usage: preprocess-phix [OPTIONS]
-s

FASTQ sequence file.

-r

Directory containing phiX174 reference db. Defaults to /usr/local/share/phix.

-n

Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.

-o

Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.

-p

Number of subprocesses for multiprocessing tools. Defaults to the available cores.

-h

This message.

Example: preprocess-phix

# Arguments
DATA_DIR=~/project
READS=/data/raw/sample.fastq
OUT_DIR=/data/presto/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    preprocess-phix -s $READS -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    preprocess-phix -s $READS -o $OUT_DIR -p $NPROC

Note

The PhiX cleaning pipeline will convert the sequence headers to the pRESTO format. Thus, if the nophix output file is provided as input to the presto-abseq pipeline script you must pass the argument -x presto to presto-abseq, which will tell the script that the input headers are in pRESTO format (rather than the Illumina format).

NEBNext / AbSeq immune sequencing kit preprocessing pipeline

A start to finish pRESTO processing script for NEBNext / AbSeq immune sequencing data. An example for human BCR processing is shown below. Primer sequences are available from the Immcantation repository under protocols/AbSeq or inside the container under /usr/local/share/protocols/AbSeq. Mouse primers are not supplied. TCR V gene references can be specified with the flag -r /usr/local/share/igblast/fasta/imgt_human_tr_v.fasta.

Usage: presto-abseq [OPTIONS]
-1

Read 1 FASTQ sequence file. Sequence beginning with the C-region or J-segment).

-2

Read 2 FASTQ sequence file. Sequence beginning with the leader or V-segment).

-j

Read 1 FASTA primer sequences. Defaults to /usr/local/share/protocols/AbSeq/AbSeq_R1_Human_IG_Primers.fasta.

-v

Read 2 FASTA primer or template switch sequences. Defaults to /usr/local/share/protocols/AbSeq/AbSeq_R2_TS.fasta.

-c

C-region FASTA sequences for the C-region internal to the primer. If unspecified internal C-region alignment is not performed.

-r

V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_human_ig_v.fasta.

-y

YAML file providing description fields for report generation.

-n

Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.

-o

Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.

-x

The mate-pair coordinate format of the raw data. Defaults to illumina.

-p

Number of subprocesses for multiprocessing tools. Defaults to the available cores.

-h

This message.

One of the requirements for generating the report at the end of the pRESTO pipeline is a YAML file containing information about the data and processing. Valid fields are shown in the example sample.yaml below, although no fields are strictly required:

sample.yaml

title: "pRESTO Report: CD27+ B cells from subject HD1"
author: "Your Name"
version: "0.5.4"
description: "Memory B cells (CD27+)."
sample: "HD1"
run: "ABC123"
date: "Today"

Example: presto-abseq

# Arguments
DATA_DIR=~/project
READS_R1=/data/raw/sample_R1.fastq
READS_R2=/data/raw/sample_R2.fastq
YAML=/data/sample.yaml
SAMPLE_NAME=sample
OUT_DIR=/data/presto/sample
NPROC=4

# Docker command
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    presto-abseq -1 $READS_R1 -2 $READS_R2 -y $YAML \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    presto-abseq -1 $READS_R1 -2 $READS_R2 -y $YAML \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Takara Bio / Clontech SMARTer v1 immune sequencing kit preprocessing pipeline

A start to finish pRESTO processing script for Takara Bio / Clontech SMARTer v1 immune sequencing kit data. C-regions are assigned using the universal C-region primer sequences are available from the Immcantation repository under protocols/Universal or inside the container under /usr/local/share/protocols/Universal.

Usage: presto-clontech [OPTIONS]
-1

Read 1 FASTQ sequence file. Sequence beginning with the C-region.

-2

Read 2 FASTQ sequence file. Sequence beginning with the leader.

-j

C-region reference sequences (reverse complemented). Defaults to /usr/local/share/protocols/Universal/Mouse_IG_CRegion_RC.fasta.

-r

V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_mouse_ig_v.fasta.

-y

YAML file providing description fields for report generation.

-n

Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.

-o

Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.

-x

The mate-pair coordinate format of the raw data. Defaults to illumina.

-p

Number of subprocesses for multiprocessing tools. Defaults to the available cores.

-h

This message.

Example: presto-clontech

# Arguments
DATA_DIR=~/project
READS_R1=/data/raw/sample_R1.fastq
READS_R2=/data/raw/sample_R2.fastq
CREGION=/usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta
VREF=/usr/local/share/igblast/fasta/imgt_human_ig_v.fasta
SAMPLE_NAME=sample
OUT_DIR=/data/presto/sample
NPROC=4

# Docker command
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    presto-clontech -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    presto-clontech -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Takara Bio / Clontech SMARTer v2 (UMI) immune sequencing kit preprocessing pipeline

A start to finish pRESTO processing script for Takara Bio / Clontech SMARTer v2 immune sequencing kit data that includes UMIs. C-regions are assigned using the universal C-region primer sequences are available from the Immcantation repository under protocols/Universal or inside the container under /usr/local/share/protocols/Universal.

Usage: presto-clontech-umi [OPTIONS]
-1

Read 1 FASTQ sequence file. Sequence beginning with the C-region.

-2

Read 2 FASTQ sequence file. Sequence beginning with the leader.

-j

C-region reference sequences (reverse complemented). Defaults to /usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta.

-r

V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_human_ig_v.fasta.

-n

Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.

-o

Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.

-x

The mate-pair coordinate format of the raw data. Defaults to illumina.

-p

Number of subprocesses for multiprocessing tools. Defaults to the available cores.

-h

This message.

Example: presto-clontech-umi

# Arguments
DATA_DIR=~/project
READS_R1=/data/raw/sample_R1.fastq
READS_R2=/data/raw/sample_R2.fastq
CREGION=/usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta
VREF=/usr/local/share/igblast/fasta/imgt_human_ig_v.fasta
SAMPLE_NAME=sample
OUT_DIR=/data/presto/sample
NPROC=4

# Docker command
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    presto-clontech-umi -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    presto-clontech-umi -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

10x Genomics V(D)J annotation pipeline

Assigns new annotations and infers clonal relationships to 10x Genomics single-cell V(D)J data output by Cell Ranger.

Usage: changeo-10x [OPTIONS]
-s

FASTA or FASTQ sequence file.

-a

10x Genomics cellranger-vdj contig annotation CSV file. Must corresponding with the FASTA/FASTQ input file (all, filtered or consensus).

-r

Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/[species name]/vdj.

-g

Species name. One of human, mouse, rabbit, rat, or rhesus_monkey. Defaults to human.

-t

Receptor type. One of ig or tr. Defaults to ig.

-x

Distance threshold for clonal assignment. Specify “auto” for automatic detection. If unspecified, clonal assignment is not performed.

-m

Distance model for clonal assignment. Defaults to the nucleotide Hamming distance model (ham).

-b

IgBLAST IGDATA directory, which contains the IgBLAST database, optional_file and auxillary_data directories. Defaults to /usr/local/share/igblast.

-n

Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the sequence filename.

-o

Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.

-f

Output format. One of changeo or airr. Defaults to airr.

-p

Number of subprocesses for multiprocessing tools. Defaults to the available cores.

-i

Specify to allow partial alignments.

-z

Specify to disable cleaning and compression of temporary files.

-h

This message.

Example: changeo-10x

# Arguments
DATA_DIR=~/project
READS=/data/raw/sample_filtered_contig.fasta
ANNOTATIONS=/data/raw/sample_filtered_contig_annotations.csv
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
DIST=auto
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    changeo-10x -s $READS -a $ANNOTATIONS -x $DIST -n $SAMPLE_NAME \
    -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    changeo-10x -s $READS -a $ANNOTATIONS -x $DIST -n $SAMPLE_NAME \
    -o $OUT_DIR -p $NPROC

IgBLAST annotation pipeline

Performs V(D)J alignment using IgBLAST and post-processes the output into the Change-O data standard.

Usage: changeo-igblast [OPTIONS]
-s

FASTA or FASTQ sequence file.

-r

Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/[species name]/vdj.

-g

Species name. One of human, mouse, rabbit, rat, or rhesus_monkey. Defaults to human.

-t

Receptor type. One of ig or tr. Defaults to ig.

-b

IgBLAST IGDATA directory, which contains the IgBLAST database, optional_file and auxillary_data directories. Defaults to /usr/local/share/igblast.

-n

Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the sequence filename.

-o

Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.

-f

Output format. One of airr (default) or changeo. Defaults to airr.

-p

Number of subprocesses for multiprocessing tools. Defaults to the available cores.

-k

Specify to filter the output to only productive/functional sequences.

-i

Specify to allow partial alignments.

-z

Specify to disable cleaning and compression of temporary files.

-h

This message.

Example: changeo-igblast

# Arguments
DATA_DIR=~/project
READS=/data/presto/sample/sample-final_collapse-unique_atleast-2.fastq
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    changeo-igblast -s $READS -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    changeo-igblast -s $READS -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Genotyping pipeline

Infers V segment genotypes using TIgGER.

Usage: tigger-genotype [options]
-d DB, --db=DB

Change-O formatted TSV (TAB) file.

-r REF, --ref=REF

FASTA file containing IMGT-gapped V segment reference germlines. Defaults to /usr/local/share/germlines/imgt/human/vdj/imgt_human_IGHV.fasta.

-v VFIELD, --vfield=VFIELD

Name of the output field containing genotyped V assignments. Defaults to V_CALL_GENOTYPED.

-x MINSEQ, --minseq=MINSEQ

Minimum number of sequences in the mutation/coordinate range. Samples with insufficient sequences will be excluded. Defaults to 50.

-y MINGERM, --mingerm=MINGERM

Minimum number of sequences required to analyze a germline allele. Defaults to 200.

-n NAME, --name=NAME

Sample name or run identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.

-o OUTDIR, --outdir=OUTDIR

Output directory. Will be created if it does not exist. Defaults to the current working directory.

-f FORMAT, --format=FORMAT

File format. One of ‘airr’ (default) or ‘changeo’.

-p NPROC, --nproc=NPROC

Number of subprocesses for multiprocessing tools. Defaults to the available processing units.

-h, --help

Show this help message and exit

Example: tigger-genotype

# Arguments
DATA_DIR=~/project
DB=/data/changeo/sample/sample_db-pass.tab
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    tigger-genotype -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    tigger-genotype -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Clonal threshold inference pipeline

Performs automated detection of the clonal assignment threshold.

Usage: shazam-threshold [options]
-d DB, --db=DB

Tabulated data file, in Change-O (TAB) or AIRR format (TSV).

-m METHOD, --method=METHOD

Threshold inferrence to use. One of gmm, density, or none. If none, the distance-to-nearest distribution is plotted without threshold detection. Defaults to density.

-n NAME, --name=NAME

Sample name or run identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.

-o OUTDIR, --outdir=OUTDIR

Output directory. Will be created if it does not exist. Defaults to the current working directory.

-f FORMAT, --format=FORMAT

File format. One of ‘airr’ (default) or ‘changeo’.

-p NPROC, --nproc=NPROC

Number of subprocesses for multiprocessing tools. Defaults to the available processing units.

--model=MODEL

Model to use for the gmm model. One of gamma-gamma, gamma-norm, norm-norm or norm-gamma. Defaults to gamma-gamma.

--subsample=SUBSAMPLE

Number of distances to downsample the data to before threshold calculation. By default, subsampling is not performed.

--repeats=REPEATS

Number of times to recalculate. Defaults to 1.

-h, --help

Show this help message and exit

Example: shazam-threshold

# Arguments
DATA_DIR=~/project
DB=/data/changeo/sample/sample_genotyped.tab
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    shazam-threshold -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    shazam-threshold -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Clonal assignment pipeline

Assigns Ig sequences into clonally related lineages and builds full germline sequences.

Usage: changeo-clone [OPTIONS]
-d

Change-O formatted TSV (TAB) file.

-x

Distance threshold for clonal assignment.

-m

Distance model for clonal assignment. Defaults to the nucleotide Hamming distance model (ham).

-r

Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/human/vdj.

-n

Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.

-o

Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.

-f

Output format. One of airr (default) or changeo.

-p

Number of subprocesses for multiprocessing tools. Defaults to the available cores.

-a

Specify to clone the full data set. By default the data will be filtering to only productive/functional sequences.

-z

Specify to disable cleaning and compression of temporary files.

-h

This message.

Example: changeo-clone

# Arguments
DATA_DIR=~/project
DB=/data/changeo/sample/sample_genotyped.tab
DIST=0.15
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \
    changeo-clone -d $DB -x $DIST -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \
    changeo-clone -d $DB -x $DIST -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC