Example pipelines

You can always run your own pipeline scripts through the container, but the container also includes a set of predefined pipeline scripts that can be run as is or extended to your needs. Each pipeline script has a -h argument which will explain its use. The available pipelines are:

preprocess-phix
presto-abseq
presto-clontech
presto-clontech-umi
changeo-10x
changeo-igblast
tigger-genotype
shazam-threshold
changeo-clone

All example pipeline scripts can be found in /usr/local/bin.

PhiX cleaning pipeline

Removes reads from a sequence file that align against the PhiX174 reference genome.

Usage: preprocess-phix [OPTIONS]

-s: FASTQ sequence file.
-r: Directory containing phiX174 reference db. Defaults to /usr/local/share/phix.
-n: Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.
-o: Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
-p: Number of subprocesses for multiprocessing tools. Defaults to the available cores.
-h: This message.

Example: preprocess-phix

# Arguments
DATA_DIR=~/project
READS=/data/raw/sample.fastq
OUT_DIR=/data/presto/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    preprocess-phix -s $READS -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    preprocess-phix -s $READS -o $OUT_DIR -p $NPROC

Note

The PhiX cleaning pipeline will convert the sequence headers to the pRESTO format. Thus, if the nophix output file is provided as input to the presto-abseq pipeline script you must pass the argument -x presto to presto-abseq, which will tell the script that the input headers are in pRESTO format (rather than the Illumina format).

NEBNext / AbSeq immune sequencing kit preprocessing pipeline

A start to finish pRESTO processing script for NEBNext / AbSeq immune sequencing data. An example for human BCR processing is shown below. Primer sequences are available from the Immcantation repository under protocols/AbSeq or inside the container under /usr/local/share/protocols/AbSeq. Mouse primers are not supplied. TCR V gene references can be specified with the flag -r /usr/local/share/igblast/fasta/imgt_human_tr_v.fasta.

Usage: presto-abseq [OPTIONS]

-1: Read 1 FASTQ sequence file. Sequence beginning with the C-region or J-segment).
-2: Read 2 FASTQ sequence file. Sequence beginning with the leader or V-segment).
-j: Read 1 FASTA primer sequences. Defaults to /usr/local/share/protocols/AbSeq/AbSeq_R1_Human_IG_Primers.fasta.
-v: Read 2 FASTA primer or template switch sequences. Defaults to /usr/local/share/protocols/AbSeq/AbSeq_R2_TS.fasta.
-c: C-region FASTA sequences for the C-region internal to the primer. If unspecified internal C-region alignment is not performed.
-r: V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_human_ig_v.fasta.
-y: YAML file providing description fields for report generation.
-n: Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.
-o: Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
-x: The mate-pair coordinate format of the raw data. Defaults to illumina.
-p: Number of subprocesses for multiprocessing tools. Defaults to the available cores.
-h: This message.

One of the requirements for generating the report at the end of the pRESTO pipeline is a YAML file containing information about the data and processing. Valid fields are shown in the example sample.yaml below, although no fields are strictly required:

sample.yaml

title: "pRESTO Report: CD27+ B cells from subject HD1"
author: "Your Name"
version: "0.5.4"
description: "Memory B cells (CD27+)."
sample: "HD1"
run: "ABC123"
date: "Today"

Example: presto-abseq

# Arguments
DATA_DIR=~/project
READS_R1=/data/raw/sample_R1.fastq
READS_R2=/data/raw/sample_R2.fastq
YAML=/data/sample.yaml
SAMPLE_NAME=sample
OUT_DIR=/data/presto/sample
NPROC=4

# Docker command
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    presto-abseq -1 $READS_R1 -2 $READS_R2 -y $YAML \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    presto-abseq -1 $READS_R1 -2 $READS_R2 -y $YAML \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Takara Bio / Clontech SMARTer v1 immune sequencing kit preprocessing pipeline

A start to finish pRESTO processing script for Takara Bio / Clontech SMARTer v1 immune sequencing kit data. C-regions are assigned using the universal C-region primer sequences are available from the Immcantation repository under protocols/Universal or inside the container under /usr/local/share/protocols/Universal.

Usage: presto-clontech [OPTIONS]

-1: Read 1 FASTQ sequence file. Sequence beginning with the C-region.
-2: Read 2 FASTQ sequence file. Sequence beginning with the leader.
-j: C-region reference sequences (reverse complemented). Defaults to /usr/local/share/protocols/Universal/Mouse_IG_CRegion_RC.fasta.
-r: V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_mouse_ig_v.fasta.
-y: YAML file providing description fields for report generation.
-n: Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.
-o: Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
-x: The mate-pair coordinate format of the raw data. Defaults to illumina.
-p: Number of subprocesses for multiprocessing tools. Defaults to the available cores.
-h: This message.

Example: presto-clontech

# Arguments
DATA_DIR=~/project
READS_R1=/data/raw/sample_R1.fastq
READS_R2=/data/raw/sample_R2.fastq
CREGION=/usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta
VREF=/usr/local/share/igblast/fasta/imgt_human_ig_v.fasta
SAMPLE_NAME=sample
OUT_DIR=/data/presto/sample
NPROC=4

# Docker command
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    presto-clontech -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    presto-clontech -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Takara Bio / Clontech SMARTer v2 (UMI) immune sequencing kit preprocessing pipeline

A start to finish pRESTO processing script for Takara Bio / Clontech SMARTer v2 immune sequencing kit data that includes UMIs. C-regions are assigned using the universal C-region primer sequences are available from the Immcantation repository under protocols/Universal or inside the container under /usr/local/share/protocols/Universal.

Usage: presto-clontech-umi [OPTIONS]

-1: Read 1 FASTQ sequence file. Sequence beginning with the C-region.
-2: Read 2 FASTQ sequence file. Sequence beginning with the leader.
-j: C-region reference sequences (reverse complemented). Defaults to /usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta.
-r: V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_human_ig_v.fasta.
-n: Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.
-o: Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
-x: The mate-pair coordinate format of the raw data. Defaults to illumina.
-p: Number of subprocesses for multiprocessing tools. Defaults to the available cores.
-a: Specify to run multiple alignment of barcode groups prior to consensus. This step is skipped by default.
-h: This message.

Example: presto-clontech-umi

# Arguments
DATA_DIR=~/project
READS_R1=/data/raw/sample_R1.fastq
READS_R2=/data/raw/sample_R2.fastq
CREGION=/usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta
VREF=/usr/local/share/igblast/fasta/imgt_human_ig_v.fasta
SAMPLE_NAME=sample
OUT_DIR=/data/presto/sample
NPROC=4

# Docker command
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    presto-clontech-umi -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    presto-clontech-umi -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \
    -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

10x Genomics V(D)J annotation pipeline

Assigns new annotations and infers clonal relationships to 10x Genomics single cell V(D)J data output by Cell Ranger.

Usage: changeo-10x [OPTIONS]

-s: FASTA or FASTQ sequence file.
-a: 10x Genomics cellranger-vdj contig annotation CSV file. Must corresponding with the FASTA/FASTQ input file (all, filtered or consensus).
-r: Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/[species name]/vdj.
-g: Species name. One of human, mouse, rabbit, rat, or rhesus_monkey. Defaults to human.
-t: Receptor type. One of ig or tr. Defaults to ig.
-x: Distance threshold for clonal assignment. Specify “auto” for automatic detection. If unspecified, clonal assignment is not performed.
-m: Distance model for clonal assignment. Defaults to the nucleotide Hamming distance model (ham).
-e: Method to use for determining the optimal threshold. One of ‘gmm’ or ‘density’. Defaults to ‘density’.
-d: Curve fitting model. Applies only when method (-e) is ‘gmm’. One of ‘norm-norm’, ‘norm-gamma’, ‘gamma-norm’ and ‘gamma-gamma’. Defaults to ‘gamma-gamma’.
-u: Method to use for threshold selection. Applies only when method (-e) is ‘gmm’. One of ‘optimal’, ‘intersect’ and ‘user’. Defaults to ‘user’.
-b: IgBLAST IGDATA directory, which contains the IgBLAST database, optional_file and auxillary_data directories. Defaults to /usr/local/share/igblast.
-n: Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the sequence filename.
-o: Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
-f: Output format. One of changeo or airr. Defaults to airr.
-p: Number of subprocesses for multiprocessing tools. Defaults to the available cores.
-i: Specify to allow partial alignments.
-z: Specify to disable cleaning and compression of temporary files.
-h: This message.

Example: changeo-10x

# Arguments
DATA_DIR=~/project
READS=/data/raw/sample_filtered_contig.fasta
ANNOTATIONS=/data/raw/sample_filtered_contig_annotations.csv
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
DIST=auto
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    changeo-10x -s $READS -a $ANNOTATIONS -x $DIST -n $SAMPLE_NAME \
    -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    changeo-10x -s $READS -a $ANNOTATIONS -x $DIST -n $SAMPLE_NAME \
    -o $OUT_DIR -p $NPROC

IgBLAST annotation pipeline

Performs V(D)J alignment using IgBLAST and post-processes the output into the Change-O data standard.

Usage: changeo-igblast [OPTIONS]

-s: FASTA or FASTQ sequence file.
-r: Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/[species name]/vdj.
-g: Species name. One of human, mouse, rabbit, rat, or rhesus_monkey. Defaults to human.
-t: Receptor type. One of ig or tr. Defaults to ig.
-b: IgBLAST IGDATA directory, which contains the IgBLAST database, optional_file and auxillary_data directories. Defaults to /usr/local/share/igblast.
-n: Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the sequence filename.
-o: Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
-f: Output format. One of airr (default) or changeo. Defaults to airr.
-p: Number of subprocesses for multiprocessing tools. Defaults to the available cores.
-k: Specify to filter the output to only productive/functional sequences.
-i: Specify to allow partial alignments.
-z: Specify to disable cleaning and compression of temporary files.
-h: This message.

Example: changeo-igblast

# Arguments
DATA_DIR=~/project
READS=/data/presto/sample/sample-final_collapse-unique_atleast-2.fastq
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    changeo-igblast -s $READS -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    changeo-igblast -s $READS -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Genotyping pipeline

Infers V segment genotypes using TIgGER.

Usage: tigger-genotype [options]

-d DB, --db=DB: Change-O formatted TSV (TAB) file.
-r REF, --ref=REF: FASTA file containing IMGT-gapped V segment reference germlines. Defaults to /usr/local/share/germlines/imgt/human/vdj/imgt_human_IGHV.fasta.
-v VFIELD, --vfield=VFIELD: Name of the output field containing genotyped V assignments. Defaults to V_CALL_GENOTYPED.
-x MINSEQ, --minseq=MINSEQ: Minimum number of sequences in the mutation/coordinate range. Samples with insufficient sequences will be excluded. Defaults to 50.
-y MINGERM, --mingerm=MINGERM: Minimum number of sequences required to analyze a germline allele. Defaults to 200.
-n NAME, --name=NAME: Sample name or run identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.
-u FIND-UNMUTATED, --find-unmutated=FIND-UNMUTATED: Whether to use ‘-r’ to find which samples are unmutated. Defaults to TRUE.
-o OUTDIR, --outdir=OUTDIR: Output directory. Will be created if it does not exist. Defaults to the current working directory.
-f FORMAT, --format=FORMAT: File format. One of ‘airr’ (default) or ‘changeo’.
-p NPROC, --nproc=NPROC: Number of subprocesses for multiprocessing tools. Defaults to the available processing units.
-h, --help: Show this help message and exit

Example: tigger-genotype

# Arguments
DATA_DIR=~/project
DB=/data/changeo/sample/sample_db-pass.tab
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    tigger-genotype -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    tigger-genotype -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

TIgGER infers the subject-specific genotyped V gene calls and saves the corrected calls in a new column, v_call_genotyped. TIgGER also generates a *_genotype.fasta file, which contains the subject-specific germline IGHV genes. In future analyses, if v_call_genotyped column is used to replace v_call, please remember to use this *_genotype.fasta file generated previously by TIgGER as the subject-specific IGHV gene germline. An example of this application can be found in the Clonal assignment pipeline section.

Clonal threshold inference pipeline

Performs automated detection of the clonal assignment threshold.

Usage: shazam-threshold [options]

-d DB, --db=DB: Tabulated data file, in Change-O (TAB) or AIRR format (TSV).
-m METHOD, --method=METHOD: Threshold inferrence to use. One of gmm, density, or none. If none, the distance-to-nearest distribution is plotted without threshold detection. Defaults to density.
-n NAME, --name=NAME: Sample name or run identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.
-o OUTDIR, --outdir=OUTDIR: Output directory. Will be created if it does not exist. Defaults to the current working directory.
-f FORMAT, --format=FORMAT: File format. One of ‘airr’ (default) or ‘changeo’.
-p NPROC, --nproc=NPROC: Number of subprocesses for multiprocessing tools. Defaults to the available processing units.
--model=MODEL: Model to use for the gmm model. One of gamma-gamma, gamma-norm, norm-norm or norm-gamma. Defaults to gamma-gamma.
--cutoff=CUTOFF: Method to use for threshold selection. One of optimal, intersect or user. Defaults to optimal.
--spc=SPC: Specificity required for threshold selection. Applies only when method=’gmm’ and cutoff=’user’. Defaults to 0.995.
--subsample=SUBSAMPLE: Number of distances to downsample the data to before threshold calculation. By default, subsampling is not performed.
--repeats=REPEATS: Number of times to recalculate. Defaults to 1.
-h, --help: Show this help message and exit

Example: shazam-threshold

# Arguments
DATA_DIR=~/project
DB=/data/changeo/sample/sample_genotyped.tab
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    shazam-threshold -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    shazam-threshold -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Clonal assignment pipeline

Assigns Ig sequences into clonally related lineages and builds full germline sequences.

If the TIgGER, or another package, was applied previously to the data set for identifying a subject-specific genotype, including potentially novel V, D and/or J genes, a new directory $NEW_REF with the personalized germline database should be created. For example, if TIgGER was run to identify a subject-specific IGHV genotype, the directory would contain: 1) *_genotype.fasta file generated previously by TIgGER, which contains the subject-specific germline IGHV genes 2) imgt_human_IGHD.fasta and imgt_human _IGHJ.fasta, which contain the IMGT IGHD and IGHJ genes and can both be copied from the original germline database: /usr/local/share/germlines/imgt/human/vdj/. When changeo-clone is called, this new personalized germline database should be passed with parameter -r (see example below). And please remember to update v_call column with subject-specific IGHV call (for TIgGER this is found in v_call_genotyped column).

# update v_call
db %>%
   dplyr::mutate(v_call = v_call_genotyped) %>%
   select(-v_call_genotyped)

Usage: changeo-clone [OPTIONS]

-d: Change-O formatted TSV (TAB) file.
-x: Distance threshold for clonal assignment.
-m: Distance model for clonal assignment. Defaults to the nucleotide Hamming distance model (ham).
-r: Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/human/vdj.
-n: Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.
-o: Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
-f: Output format. One of airr (default) or changeo.
-p: Number of subprocesses for multiprocessing tools. Defaults to the available cores.
-a: Specify to clone the full data set. By default the data will be filtering to only productive/functional sequences.
-z: Specify to disable cleaning and compression of temporary files.
-h: This message.

Example: changeo-clone

# Arguments
DATA_DIR=~/project
DB=/data/changeo/sample/sample_genotyped.tab
DIST=0.15
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:devel \
    changeo-clone -d $DB -x $DIST -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif \
    changeo-clone -d $DB -x $DIST -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

Example: changeo-clone with personalized germline database

# Arguments
DATA_DIR=~/project
NEW_REF=/data/personalized_germlines
DB=/data/changeo/sample/sample_genotyped.tab
DIST=0.15
SAMPLE_NAME=sample
OUT_DIR=/data/changeo/sample
NPROC=4

# Run pipeline in docker image
docker run -v $DATA_DIR:/data:z immcantation/suite:devel changeo-clone -r $NEW_REF -d $DB -x $DIST -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC

# Singularity command
singularity exec -B $DATA_DIR:/data immcantation_suite-devel.sif changeo-clone -r $NEW_REF -d $DB -x $DIST -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC