Pipeline Templates¶
You can always run your own pipeline scripts through the container, but the
container also includes a set of predefined pipeline scripts that can be run as
is or extended to your needs. Each pipeline script has a -h
argument which
will explain its use. The available pipelines are:
preprocess-phix
presto-abseq
presto-clontech
presto-clontech-umi
changeo-10x
changeo-igblast
tigger-genotype
shazam-threshold
changeo-clone
All template pipeline scripts can be found in /usr/local/bin
.
PhiX cleaning pipeline¶
Removes reads from a sequence file that align against the PhiX174 reference genome.
- Usage: preprocess-phix [OPTIONS]
- -s
FASTQ sequence file.
- -r
Directory containing phiX174 reference db. Defaults to /usr/local/share/phix.
- -n
Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.
- -o
Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
- -p
Number of subprocesses for multiprocessing tools. Defaults to the available cores.
- -h
This message.
Example: preprocess-phix
# Arguments DATA_DIR=~/project READS=/data/raw/sample.fastq OUT_DIR=/data/presto/sample NPROC=4 # Run pipeline in docker image docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ preprocess-phix -s $READS -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ preprocess-phix -s $READS -o $OUT_DIR -p $NPROC
Note
The PhiX cleaning pipeline will convert the sequence headers to
the pRESTO format. Thus, if the nophix
output file is provided as
input to the presto-abseq
pipeline script you must pass the argument
-x presto
to presto-abseq
, which will tell the
script that the input headers are in pRESTO format (rather than the
Illumina format).
NEBNext / AbSeq immune sequencing kit preprocessing pipeline¶
A start to finish pRESTO processing script for NEBNext / AbSeq immune sequencing data.
An example for human BCR processing is shown below. Primer sequences are available from the
Immcantation repository under protocols/AbSeq
or inside the container under /usr/local/share/protocols/AbSeq
. Mouse primers are not supplied.
TCR V gene references can be specified with the flag
-r /usr/local/share/igblast/fasta/imgt_human_tr_v.fasta
.
- Usage: presto-abseq [OPTIONS]
- -1
Read 1 FASTQ sequence file. Sequence beginning with the C-region or J-segment).
- -2
Read 2 FASTQ sequence file. Sequence beginning with the leader or V-segment).
- -j
Read 1 FASTA primer sequences. Defaults to /usr/local/share/protocols/AbSeq/AbSeq_R1_Human_IG_Primers.fasta.
- -v
Read 2 FASTA primer or template switch sequences. Defaults to /usr/local/share/protocols/AbSeq/AbSeq_R2_TS.fasta.
- -c
C-region FASTA sequences for the C-region internal to the primer. If unspecified internal C-region alignment is not performed.
- -r
V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_human_ig_v.fasta.
- -y
YAML file providing description fields for report generation.
- -n
Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.
- -o
Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
- -x
The mate-pair coordinate format of the raw data. Defaults to illumina.
- -p
Number of subprocesses for multiprocessing tools. Defaults to the available cores.
- -h
This message.
One of the requirements for generating the report at the end of the pRESTO pipeline is a YAML
file containing information about the data and processing. Valid fields are shown in the example
sample.yaml
below, although no fields are strictly required:
sample.yaml
title: "pRESTO Report: CD27+ B cells from subject HD1"
author: "Your Name"
version: "0.5.4"
description: "Memory B cells (CD27+)."
sample: "HD1"
run: "ABC123"
date: "Today"
Example: presto-abseq
# Arguments DATA_DIR=~/project READS_R1=/data/raw/sample_R1.fastq READS_R2=/data/raw/sample_R2.fastq YAML=/data/sample.yaml SAMPLE_NAME=sample OUT_DIR=/data/presto/sample NPROC=4 # Docker command docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ presto-abseq -1 $READS_R1 -2 $READS_R2 -y $YAML \ -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ presto-abseq -1 $READS_R1 -2 $READS_R2 -y $YAML \ -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC
Takara Bio / Clontech SMARTer v1 immune sequencing kit preprocessing pipeline¶
A start to finish pRESTO processing script for Takara Bio / Clontech SMARTer v1 immune
sequencing kit data. C-regions are assigned using the universal C-region primer sequences are
available from the Immcantation repository under
protocols/Universal
or inside the container under /usr/local/share/protocols/Universal
.
- Usage: presto-clontech [OPTIONS]
- -1
Read 1 FASTQ sequence file. Sequence beginning with the C-region.
- -2
Read 2 FASTQ sequence file. Sequence beginning with the leader.
- -j
C-region reference sequences (reverse complemented). Defaults to /usr/local/share/protocols/Universal/Mouse_IG_CRegion_RC.fasta.
- -r
V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_mouse_ig_v.fasta.
- -y
YAML file providing description fields for report generation.
- -n
Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.
- -o
Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
- -x
The mate-pair coordinate format of the raw data. Defaults to illumina.
- -p
Number of subprocesses for multiprocessing tools. Defaults to the available cores.
- -h
This message.
Example: presto-clontech
# Arguments DATA_DIR=~/project READS_R1=/data/raw/sample_R1.fastq READS_R2=/data/raw/sample_R2.fastq CREGION=/usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta VREF=/usr/local/share/igblast/fasta/imgt_human_ig_v.fasta SAMPLE_NAME=sample OUT_DIR=/data/presto/sample NPROC=4 # Docker command docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ presto-clontech -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \ -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ presto-clontech -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \ -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC
Takara Bio / Clontech SMARTer v2 (UMI) immune sequencing kit preprocessing pipeline¶
A start to finish pRESTO processing script for Takara Bio / Clontech SMARTer v2 immune
sequencing kit data that includes UMIs. C-regions are assigned using the universal C-region
primer sequences are available from the Immcantation repository under
protocols/Universal
or inside the container under /usr/local/share/protocols/Universal
.
- Usage: presto-clontech-umi [OPTIONS]
- -1
Read 1 FASTQ sequence file. Sequence beginning with the C-region.
- -2
Read 2 FASTQ sequence file. Sequence beginning with the leader.
- -j
C-region reference sequences (reverse complemented). Defaults to /usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta.
- -r
V-segment reference file. Defaults to /usr/local/share/igblast/fasta/imgt_human_ig_v.fasta.
- -n
Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the read 1 filename.
- -o
Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
- -x
The mate-pair coordinate format of the raw data. Defaults to illumina.
- -p
Number of subprocesses for multiprocessing tools. Defaults to the available cores.
- -h
This message.
Example: presto-clontech-umi
# Arguments DATA_DIR=~/project READS_R1=/data/raw/sample_R1.fastq READS_R2=/data/raw/sample_R2.fastq CREGION=/usr/local/share/protocols/Universal/Human_IG_CRegion_RC.fasta VREF=/usr/local/share/igblast/fasta/imgt_human_ig_v.fasta SAMPLE_NAME=sample OUT_DIR=/data/presto/sample NPROC=4 # Docker command docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ presto-clontech-umi -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \ -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ presto-clontech-umi -1 $READS_R1 -2 $READS_R2 -j $CREGION -r $VREF \ -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC
10x Genomics V(D)J annotation pipeline¶
Assigns new annotations and infers clonal relationships to 10x Genomics single-cell V(D)J data output by Cell Ranger.
- Usage: changeo-10x [OPTIONS]
- -s
FASTA or FASTQ sequence file.
- -a
10x Genomics cellranger-vdj contig annotation CSV file. Must corresponding with the FASTA/FASTQ input file (all, filtered or consensus).
- -r
Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/[species name]/vdj.
- -g
Species name. One of human, mouse, rabbit, rat, or rhesus_monkey. Defaults to human.
- -t
Receptor type. One of ig or tr. Defaults to ig.
- -x
Distance threshold for clonal assignment. Specify “auto” for automatic detection. If unspecified, clonal assignment is not performed.
- -m
Distance model for clonal assignment. Defaults to the nucleotide Hamming distance model (ham).
- -b
IgBLAST IGDATA directory, which contains the IgBLAST database, optional_file and auxillary_data directories. Defaults to /usr/local/share/igblast.
- -n
Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the sequence filename.
- -o
Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
- -f
Output format. One of changeo or airr. Defaults to airr.
- -p
Number of subprocesses for multiprocessing tools. Defaults to the available cores.
- -i
Specify to allow partial alignments.
- -z
Specify to disable cleaning and compression of temporary files.
- -h
This message.
Example: changeo-10x
# Arguments DATA_DIR=~/project READS=/data/raw/sample_filtered_contig.fasta ANNOTATIONS=/data/raw/sample_filtered_contig_annotations.csv SAMPLE_NAME=sample OUT_DIR=/data/changeo/sample DIST=auto NPROC=4 # Run pipeline in docker image docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ changeo-10x -s $READS -a $ANNOTATIONS -x $DIST -n $SAMPLE_NAME \ -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ changeo-10x -s $READS -a $ANNOTATIONS -x $DIST -n $SAMPLE_NAME \ -o $OUT_DIR -p $NPROC
IgBLAST annotation pipeline¶
Performs V(D)J alignment using IgBLAST and post-processes the output into the Change-O data standard.
- Usage: changeo-igblast [OPTIONS]
- -s
FASTA or FASTQ sequence file.
- -r
Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/[species name]/vdj.
- -g
Species name. One of human, mouse, rabbit, rat, or rhesus_monkey. Defaults to human.
- -t
Receptor type. One of ig or tr. Defaults to ig.
- -b
IgBLAST IGDATA directory, which contains the IgBLAST database, optional_file and auxillary_data directories. Defaults to /usr/local/share/igblast.
- -n
Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the sequence filename.
- -o
Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
- -f
Output format. One of airr (default) or changeo. Defaults to airr.
- -p
Number of subprocesses for multiprocessing tools. Defaults to the available cores.
- -k
Specify to filter the output to only productive/functional sequences.
- -i
Specify to allow partial alignments.
- -z
Specify to disable cleaning and compression of temporary files.
- -h
This message.
Example: changeo-igblast
# Arguments DATA_DIR=~/project READS=/data/presto/sample/sample-final_collapse-unique_atleast-2.fastq SAMPLE_NAME=sample OUT_DIR=/data/changeo/sample NPROC=4 # Run pipeline in docker image docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ changeo-igblast -s $READS -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ changeo-igblast -s $READS -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC
Genotyping pipeline¶
Infers V segment genotypes using TIgGER.
- Usage: tigger-genotype [options]
- -d DB, --db=DB
Change-O formatted TSV (TAB) file.
- -r REF, --ref=REF
FASTA file containing IMGT-gapped V segment reference germlines. Defaults to /usr/local/share/germlines/imgt/human/vdj/imgt_human_IGHV.fasta.
- -v VFIELD, --vfield=VFIELD
Name of the output field containing genotyped V assignments. Defaults to V_CALL_GENOTYPED.
- -x MINSEQ, --minseq=MINSEQ
Minimum number of sequences in the mutation/coordinate range. Samples with insufficient sequences will be excluded. Defaults to 50.
- -y MINGERM, --mingerm=MINGERM
Minimum number of sequences required to analyze a germline allele. Defaults to 200.
- -n NAME, --name=NAME
Sample name or run identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.
- -o OUTDIR, --outdir=OUTDIR
Output directory. Will be created if it does not exist. Defaults to the current working directory.
- -f FORMAT, --format=FORMAT
File format. One of ‘airr’ (default) or ‘changeo’.
- -p NPROC, --nproc=NPROC
Number of subprocesses for multiprocessing tools. Defaults to the available processing units.
- -h, --help
Show this help message and exit
Example: tigger-genotype
# Arguments DATA_DIR=~/project DB=/data/changeo/sample/sample_db-pass.tab SAMPLE_NAME=sample OUT_DIR=/data/changeo/sample NPROC=4 # Run pipeline in docker image docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ tigger-genotype -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ tigger-genotype -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC
Clonal threshold inference pipeline¶
Performs automated detection of the clonal assignment threshold.
- Usage: shazam-threshold [options]
- -d DB, --db=DB
Tabulated data file, in Change-O (TAB) or AIRR format (TSV).
- -m METHOD, --method=METHOD
Threshold inferrence to use. One of gmm, density, or none. If none, the distance-to-nearest distribution is plotted without threshold detection. Defaults to density.
- -n NAME, --name=NAME
Sample name or run identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.
- -o OUTDIR, --outdir=OUTDIR
Output directory. Will be created if it does not exist. Defaults to the current working directory.
- -f FORMAT, --format=FORMAT
File format. One of ‘airr’ (default) or ‘changeo’.
- -p NPROC, --nproc=NPROC
Number of subprocesses for multiprocessing tools. Defaults to the available processing units.
- --model=MODEL
Model to use for the gmm model. One of gamma-gamma, gamma-norm, norm-norm or norm-gamma. Defaults to gamma-gamma.
- --subsample=SUBSAMPLE
Number of distances to downsample the data to before threshold calculation. By default, subsampling is not performed.
- --repeats=REPEATS
Number of times to recalculate. Defaults to 1.
- -h, --help
Show this help message and exit
Example: shazam-threshold
# Arguments DATA_DIR=~/project DB=/data/changeo/sample/sample_genotyped.tab SAMPLE_NAME=sample OUT_DIR=/data/changeo/sample NPROC=4 # Run pipeline in docker image docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ shazam-threshold -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ shazam-threshold -d $DB -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC
Clonal assignment pipeline¶
Assigns Ig sequences into clonally related lineages and builds full germline sequences.
- Usage: changeo-clone [OPTIONS]
- -d
Change-O formatted TSV (TAB) file.
- -x
Distance threshold for clonal assignment.
- -m
Distance model for clonal assignment. Defaults to the nucleotide Hamming distance model (ham).
- -r
Directory containing IMGT-gapped reference germlines. Defaults to /usr/local/share/germlines/imgt/human/vdj.
- -n
Sample identifier which will be used as the output file prefix. Defaults to a truncated version of the input filename.
- -o
Output directory. Will be created if it does not exist. Defaults to a directory matching the sample identifier in the current working directory.
- -f
Output format. One of airr (default) or changeo.
- -p
Number of subprocesses for multiprocessing tools. Defaults to the available cores.
- -a
Specify to clone the full data set. By default the data will be filtering to only productive/functional sequences.
- -z
Specify to disable cleaning and compression of temporary files.
- -h
This message.
Example: changeo-clone
# Arguments DATA_DIR=~/project DB=/data/changeo/sample/sample_genotyped.tab DIST=0.15 SAMPLE_NAME=sample OUT_DIR=/data/changeo/sample NPROC=4 # Run pipeline in docker image docker run -v $DATA_DIR:/data:z immcantation/suite:4.3.0 \ changeo-clone -d $DB -x $DIST -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC # Singularity command singularity exec -B $DATA_DIR:/data immcantation_suite-4.3.0.sif \ changeo-clone -d $DB -x $DIST -n $SAMPLE_NAME -o $OUT_DIR -p $NPROC