10x Genomics V(D)J Sequence Analysis Tutorial¶
This tutorial is a basic walkthrough for defining B cell clonal families and building B cell lineage trees using 10x Genomics BCR sequencing data. It is intended for users without prior experience with Immcantation. If you are familiar with Immcantation, then this page may be more useful.
Knowledge of basic command line usage is assumed. Please check out the individual documentation sites for the functions detailed in this tutorial before using them on your own data. For simplicity, this tutorial will use the Immcantation Docker image which contains all necessary software. It is also possible to install the packages being used separately (see pRESTO, Change-O, and Alakazam).
Please contact us if you have any questions.
First, download and unzip the example data. It represents the Ig V(D)J sequences from CD19+ B cells isolated from PBMCs of a healthy human donor, and is based on data provided by 10x Genomics under a Creative Commons Attribute license, and processed with their Cell Ranger pipeline.
Second, install Docker (if you don’t have it already) and download the Immcantation Docker image. For some operating systems, it may be necessary to use super-user privileges (sudo), and/or to have Docker Desktop running before entering the following commands.
In a terminal, enter:
# download the current Immcantation Docker image (may take a few minutes) docker pull immcantation/suite:4.1.0
Within the terminal, move to the directory where you’ve placed the example data using the command
Load the current directory into the Docker image:
# Linux/Mac OS X docker run -it --workdir /data -v $(pwd):/data:z immcantation/suite:4.1.0 bash # Windows docker run -it --workdir /data -v %cd%:/data:z immcantation/suite:4.1.0 bash
After running the previous command, you’ll now be in the mounted /data folder inside the container. To check that everything is properly configured, enter the following commands:
BuildTrees.py --version # should return BuildTrees.py: 1.0.0 2020.05.01 ls # should show filtered_contig_annotations.csv and filtered_contig.fasta, possibly others
If the first command doesn’t return the expected output, you probably aren’t inside the right (or any) Docker container.
If the second doesn’t return the expected output, you may not be running the Docker image from the correct directory.
Exit the image by typing
exit then try again by navigating to the proper directory and rerunning the command above
to enter the Docker image again.
Assign V, D, and J genes and define clonal groups¶
Most of the processing for 10x Genomics data can be handled by the
changeo-10x script supplied in the Docker container.
This script will automatically:
Define clonal groups (see the next section)
Gather and compress intermediate files
To run this script on the example dataset, enter the following command in the Docker container (the
\ just indicates a new line for visual clarity):
# Run 10x Genomics processing script changeo-10x -s filtered_contig.fasta -a filtered_contig_annotations.csv -o . \ -g human -t ig -x 0.1
-o option refers to the output directory of the processing. The
refer to the sequence and sequence annotation file outputs from Cell Ranger respectively. The
-g option indicates
species and the
-t option indicates the type of receptor. The
-x option specifies junction distance threshold
used for assigning sequences into clonal clusters.
This script will create the following files (in addition to
It will also create a /logs directory containing:
For a full listing of script options, see the 10x Genomics V(D)J annotation pipeline. It is also important to note that this pipeline uses the standard IMGT reference database of human alleles. To infer novel alleles and subject-specific genotypes, which would result in more accurate assignments, see TIgGER.
Define clonal groups manually¶
Clonal groups are B cells that descend from a common naive B cell ancestor. To group sequences into inferred clonal groups, we cluster BCR sequences that have the same heavy chain V and J genes and same junction length. We next cluster sequences with similar junction regions, using either a defined sequence distance cutoff, or an adaptive threshold (SCOPer). When available, we can also split clonal groups that have differing light chain V and J genes.
In the previous section, we used a predefined clonal clustering threshold of
-x option in the
This is not appropriate for all datasets. The current best practice is to find the
appropriate threshold for a given dataset, which can be done automatically in the
changeo-10x script by specifying
-x auto to assign clones doesn’t always work
(e.g. if there weren’t enough clones to generate a bimodal distance to nearest plot). If this command fails,
there are other options for manually defining clones from the file
changeo-10x is run successfully above, this file will be in
Otherwise it will be in the current working directory.
The first is by inspecting a plot of sequence distances.
This is supplied in the file
filtered_contig_threshold-plot.pdf. You can then define clones manually
using the chosen threshold (e.g.
# define heavy chain clones DefineClones.py -d filtered_contig_heavy_productive-T.tsv --act set --model ham \ --norm len --dist 0.09 --outname filtered_contig_heavy
If the sequence distance plot is not bimodal, it may be more appropriate to instead use SCOPer
to assign clones using an adaptive threshold. In order to be able to directly copy/paste the commands provided in this tutorial,
be sure to rename the output file
filtered_contig_heavy_clone-pass.tsv (to match the output of
Once we have defined clonal groups using heavy chains, we can split these groups based on whether or not they have differing light chain V and J genes:
# split heavy chain clones with different light chains light_cluster.py -d filtered_contig_heavy_clone-pass.tsv -e filtered_contig_light_productive-T.tsv \ -o filtered_contig_heavy_clone-light.tsv
We can also reconstruct the heavy chain germline V and J genes (using the output file from the previous command):
# reconstruct heavy chain germline V and J sequences CreateGermlines.py -d filtered_contig_heavy_clone-light.tsv -g dmask --cloned \ -r /usr/local/share/germlines/imgt/human/vdj/imgt_human_IGHV.fasta \ /usr/local/share/germlines/imgt/human/vdj/imgt_human_IGHD.fasta \ /usr/local/share/germlines/imgt/human/vdj/imgt_human_IGHJ.fasta \ --outname filtered_contig_heavy
This results in the file
filtered_contig_heavy_germ-pass.tsv which contains heavy chain sequence
information derived from
filtered_contig_heavy_clone-light.tsv with an additional column
specifying the clonal group of the sequence.
Build lineage trees¶
Lineage trees represent the series of shared and unshared mutations leading from clone’s germline sequence to the observed sequence data. There are multiple ways of building and visualizing these trees. Currently the simplest way within Immcantation is to use Alakazam, which is built around building maximum parsimony trees using PHYLIP. Alternatively, you can use IgPhyML, which builds maximum likelihood trees with B cell specific models. Here we use IgPhyML.
To run IgPhyML from within the Docker container, use the
BuildTrees.py -d filtered_contig_heavy_germ-pass.tsv --minseq 3 --clean all \ --igphyml --collapse --nproc 2 --asr 0.9
This will remove clones with fewer than 3 unique sequences (
--minseq 3), run IgPhyML (
--igphyml) parallelized across 2 cores
--nproc 2) and collapse identical sequences (
--collapse). It will also reconstruct the maximum likelihood intermediate sequences for
each node (
--asr 0.9). The number following
--asr controls the amount of reported model uncertainty (range from 0-1, 0.9 recommended).
--clean all deletes all intermediate files from this operation. This is a computationally intensive task and may take a few minutes.
The following commands in this section are meant to be entered into an
R session. Open
R within the Docker container
using the command
R. Once inside the
R session, load the appropriate libraries and read in the data:
library(alakazam) library(ape) library(dplyr) # read in the data db <- readIgphyml("filtered_contig_heavy_germ-pass_igphyml-pass.tab", format="phylo", branches="mutations")
Once built, we can visualize these trees using the R package
ape. Here, we only visualize the largest tree using the default parameters.
However, there are many ways to make more lineage tree plots, as detailed in Alakazam’s
Enter into the
R session and save the largest tree as a png image:
png("graph.png",width=8,height=6,unit="in",res=300) plot(db$trees[],show.node.label=TRUE) add.scale.bar(length=5) dev.off()
The internal nodes of this tree represent inferred intermediate sequences, while the edge lengths represent the expected number of heavy chain mutations between the nodes (see scale bar to left). If you prefer more graph-based trees, these are also detailed in Alakazam’s lineage vignette.
The reconstructed intermediate sequences for each node shown in the tree are available in the file
filtered_contig_heavy_germ-pass_igphyml-pass_hlp_asr.fasta. Each possible codon has a certain probability of occuring at each site in the sequence.
The number following
BuildTrees specifies the probability interval desired for each site. For instance,
--asr 0.8 and the relative probability of codon
ATG is 0.5 and
ATA is 0.4, IgPhyML would return
R is the IUPAC ambiguous nucleotide for A and G. These characters represent ambiguity in the reconstruction, and are particularly common in the CDR3 region:
Merge Cell Ranger annotations¶
As detailed in the Change-O reference, it is also possible to directly merge Change-O data tables with annotation information from the Cell Ranger pipeline.