10x Genomics V(D)J Sequence Analysis with Immcantation Tutorial

Overview

This tutorial is a basic walkthrough for defining B cell clonal families and building B cell lineage trees using 10x Genomics BCR sequencing data.

Knowledge of basic command line usage is assumed. Please check out the individual documentation sites for the functions detailed in this tutorial before using them on your own data.

  • For simplicity, this tutorial will use the Immcantation Lab Docker image which contains Jupyter notebook and all necessary software to run this code.

  • You can install the current Docker image with docker pull immcantation/lab:devel

  • For some operating systems, it may be necessary to use super-user privileges (sudo), and/or to have Docker Desktop running before entering the previous command.

  • It is also possible to install the packages being used separately (see pRESTO, Change-O, and Alakazam).

The primary steps of processing 10x single cell BCR data include:

  • Assigning V, D, and J genes using IgBLAST

  • Filtering records

  • Finding thresholds for cloning (automatically or manually)

  • Defining clonal groups

  • Running CreateGermlines

  • Calculating SHM frequency in the heavy chain V genes

You may also reference this page for a template pipeline script to process 10x data with Immcantation’s changeo-10x template script.

If you have any questions about this tutorial or our software, you can email us at immcantation@googlegroups.com.

Getting started

The example data is already in the container (/home/magus/data/10x_data_2subj/). If you want to, you can download it from example data and unzip it.

  • It represents the Ig V(D)J sequences from CD19+ B cells isolated from PBMCs of two healthy human donors provided by 10x Genomics under a Creative Commons Attribute license, and processed with their Cell Ranger pipeline.

  • You will find two files called filtered_contig.fasta and filtered_contig_annotations.csv in the unzipped directory (10x_data_2subj unless you rename it). They are the direct Cell Ranger output files for donor 1. We are going to use the Ig V(D)J sequences from donor 1 to show how to process V(D)J data using Immcantation.

  • You will also find a file called sc5p_v2_hs_B_1k_multi_5gex_b_Multiplex_vdj_b_all_contig_igblast_db-pass.tsv. This file contains processed Ig V(D)J sequences for donor 2, with assigned V, D, and J genes using AssignGenes.py and MakeDb.py. In the section of “Find the distance threshold for cloning automatically”, those Ig V(D)J sequences together with the sequences from donor 1, were used to calculate the nearest neighbor distances across subjects.

[1]:
%%bash
ls ../data/10x_data_2subj
filtered_contig.fasta
filtered_contig_annotations.csv
sc5p_v2_hs_B_1k_multi_5gex_b_Multiplex_vdj_b_all_contig_igblast_db-pass.tsv

Use the command versions report to list the versions of the of the Immcantation tools installed in the container:

[2]:
%%bash
versions report
immcantation: devel
date: 2022.12.15

presto: 0.7.1
changeo: 1.3.0
alakazam: 1.2.1.999
shazam: 1.1.2.999
tigger: 1.0.1.999
scoper: 1.2.1.999
dowser: 1.1.1
enchantr: 0.0.6
prestor: 0.0.7
rabhit: 0.1.5
rdi: 1.0.0
igphyml: 1.1.5
seurat: 4.3.0

airr-py: 1.4.1
airr-r: 1.4.1
blast: 2.13.0
cd-hit: 4.8.1
igblast: 1.20.0
muscle: 3.8.425
phylip: 3.697
vsearch: 2.13.6

Assign V, D, and J genes using IgBLAST

To process 10x V(D)J data, a combination of AssignGenes.py and MakeDb.py can be used to generate a TSV file compliant with the AIRR Community Rearrangement schema that incorporates annotation information provided by the Cell Ranger pipeline. The files of filtered_contig.fasta and filtered_contig_annotations.csv, generated by cellranger vdj, can be found in the outs directory.

Generate AIRR Rearrangement data from the 10x V(D)J FASTA files using the steps below (the \ just indicates a new line for visual clarity):

[3]:
%%bash
# assign V, D, and J genes using IgBLAST
AssignGenes.py igblast -s ../data/10x_data_2subj/filtered_contig.fasta -b /usr/local/share/igblast \
   --organism human --loci ig --format blast --outdir results
   START> AssignGenes
 COMMAND> igblast
 VERSION> 1.20.0
    FILE> filtered_contig.fasta
ORGANISM> human
    LOCI> ig
   NPROC> 4

PROGRESS> 06:54:54 |Done                     | 10.4 min

  PASS> 33338
OUTPUT> filtered_contig_igblast.fmt7
   END> AssignGenes

[4]:
%%bash
# convert IgBLAST output to AIRR format
MakeDb.py igblast -i results/filtered_contig_igblast.fmt7 -s ../data/10x_data_2subj/filtered_contig.fasta \
   -r /usr/local/share/germlines/imgt/human/vdj/imgt_human_*.fasta \
   --10x ../data/10x_data_2subj/filtered_contig_annotations.csv --extended
         START> MakeDB
       COMMAND> igblast
  ALIGNER_FILE> filtered_contig_igblast.fmt7
      SEQ_FILE> filtered_contig.fasta
       ASIS_ID> False
    ASIS_CALLS> False
      VALIDATE> strict
      EXTENDED> True
INFER_JUNCTION> False

PROGRESS> 06:54:55 |Done                | 0.0 min

PROGRESS> 06:55:30 |####################| 100% (33,338) 0.6 min

OUTPUT> filtered_contig_igblast_db-pass.tsv
  PASS> 21956
  FAIL> 11382
   END> MakeDb

[5]:
%%bash
ls results
alltrees.pdf
changeo
dowser_tutorial
filtered_contig_igblast.fmt7
filtered_contig_igblast_db-pass.tsv
filtered_contig_igblast_db-pass_H.tsv
filtered_contig_igblast_db-pass_H_productive-F.tsv
filtered_contig_igblast_db-pass_H_productive-T.tsv
filtered_contig_igblast_db-pass_L.tsv
filtered_contig_igblast_db-pass_L_productive-F.tsv
filtered_contig_igblast_db-pass_L_productive-T.tsv
igblast
input_sample1-n200.fasta
tigger

After running these commands, you should now have filtered_contig_igblast_db-pass.tsv and filtered_contig_igblast.fmt7 in your results directory.

  • For a full listing of what the flags mean, see the commandline usage for AssignGenes.py igblast and MakeDb.py igblast. You can also read our “Using IgBLAST” which contains both commands.

  • The --10x filtered_contig_annotations.csv specifies the path of the contig annotations file generated by cellranger vdj, which can be found in the outs directory.

Please note that:

  • all_contig.fasta can be exchanged for filtered_contig.fasta, and all_contig_annotations.csv can be exchanged for filtered_contig_annotations.csv.

  • The resulting tab-delimited table overwrites the V, D and J gene assignments generated by Cell Ranger and uses those generated by IgBLAST or IMGT/HighV-QUEST instead.

  • To process mouse data and/or TCR data, alter the --organism and --loci arguments to AssignGenes.py accordingly (e.g., --organism mouse, --loci tcr) and use the appropriate V(D)J IMGT reference database (e.g., **IMGT_Mouse_TR*.fasta**)

Filter records

The remaining commands in this tutorial R. The next two lines of code are required to be able to use R magic and run R code in this Jupyter notebook.

[6]:
# R for Python
import rpy2.rinterface
%load_ext rpy2.ipython
[7]:
%%R
# load libraries
library(alakazam)
library(data.table)
library(dowser)
library(dplyr)
library(ggplot2)
library(scoper)
library(shazam)
R[write to console]: Loading required package: ggplot2

R[write to console]: As of v1.0.0 the AIRR Rearrangement schema is now the default file format.
A description of the standard is available at https://docs.airr-community.org.
The legacy Change-O format is supported through arguments to each function
that allow the input column names to be explicitly defined.

R[write to console]: data.table 1.14.6 using 2 threads (see ?getDTthreads).  Latest news: r-datatable.com

R[write to console]:
Attaching package: ‘dowser’


R[write to console]: The following object is masked from ‘package:alakazam’:

    ExampleDbChangeo


R[write to console]:
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:data.table’:

    between, first, last


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


R[write to console]: As of v1.0.0 the AIRR Rearrangement schema is now the default file format.
A description of the standard is available at https://docs.airr-community.org.
The legacy Change-O format is supported through arguments to each function
that allow the input column names to be explicitly defined.

R[write to console]:
Attaching package: ‘scoper’


R[write to console]: The following object is masked from ‘package:alakazam’:

    ExampleDb


R[write to console]: As of v1.0.0 the AIRR Rearrangement schema is now the default file format.
A description of the standard is available at https://docs.airr-community.org.
The legacy Change-O format is supported through arguments to each function
that allow the input column names to be explicitly defined.

Removing non-productive sequences

You may wish to subset your data to only productive sequences:

[8]:
%%R
# read in the data
data <- readChangeoDb('results/filtered_contig_igblast_db-pass.tsv')
data <- data %>% filter(productive)

cat(paste("There are", nrow(data), "rows in the data.\n"))
data %>% slice_sample(n = 5) # random examples
There are 20164 rows in the data.
# A tibble: 5 × 57
  sequenc…¹ seque…² rev_c…³ produ…⁴ v_call d_call j_call seque…⁵ germl…⁶ junct…⁷
  <chr>     <chr>   <lgl>   <lgl>   <chr>  <chr>  <chr>  <chr>   <chr>   <chr>
1 CTAACTTT… GAAGAA… FALSE   TRUE    IGLV4… <NA>   IGLJ2… CAGCTT… CAGCTT… TGTCAG…
2 CTAGCCTT… TGGGGG… FALSE   TRUE    IGLV2… <NA>   IGLJ1… CAGTCT… CAGTCT… TGCAGC…
3 CGCGGTAG… GGAGGT… FALSE   TRUE    IGHV3… IGHD3… IGHJ6… GAGGTG… GAGGTG… TGTGCG…
4 ACCAGTAA… TGGGGA… FALSE   TRUE    IGKV3… <NA>   IGKJ2… GAAATT… GAAATT… TGTCAC…
5 CAAGAAAA… TGGGGA… FALSE   TRUE    IGKV3… <NA>   IGKJ4… GAAATT… GAAATT… TGTCAG…
# … with 47 more variables: junction_aa <chr>, v_cigar <chr>, d_cigar <chr>,
#   j_cigar <chr>, stop_codon <lgl>, vj_in_frame <lgl>, locus <chr>,
#   c_call <chr>, junction_length <dbl>, np1_length <dbl>, np2_length <dbl>,
#   v_sequence_start <dbl>, v_sequence_end <dbl>, v_germline_start <dbl>,
#   v_germline_end <dbl>, d_sequence_start <dbl>, d_sequence_end <dbl>,
#   d_germline_start <dbl>, d_germline_end <dbl>, j_sequence_start <dbl>,
#   j_sequence_end <dbl>, j_germline_start <dbl>, j_germline_end <dbl>, …
# ℹ Use `colnames()` to see all variable names

Cells with multiple heavy chains

If your single cell data contains cells with multiple heavy chains, you need to handle it before calling clones (B cells that descend from a common naive B cell ancestor).

A simple solution is just to remove cells with multiple heavy chains from the single cell data:

[9]:
%%R
# remove cells with multiple heavy chain
multi_heavy <- table(filter(data, locus == "IGH")$cell_id)
multi_heavy_cells <- names(multi_heavy)[multi_heavy > 1]

data <- filter(data, !cell_id %in% multi_heavy_cells)
cat(paste("There are", nrow(data), "rows in the data after filtering out cells with multiple heavy chains.\n"))
There are 18393 rows in the data after filtering out cells with multiple heavy chains.

Remove cells without heavy chains

Since most of the following analyses are based on heavy chains, we remove cells with only light chains:

[10]:
%%R
# split cells by heavy and light chains
heavy_cells <- filter(data, locus == "IGH")$cell_id
light_cells <- filter(data, locus == "IGK" | locus == "IGL")$cell_id
no_heavy_cells <- light_cells[which(!light_cells %in% heavy_cells)]

data <- filter(data, !cell_id %in% no_heavy_cells)
cat(paste("There are", nrow(data), "rows in the data after filtering out cells without heavy chains."))
There are 18166 rows in the data after filtering out cells without heavy chains.

Cloning thresholds

Find the distance threshold for cloning manually

Hierarchical clustering is a widely used distance-based method for identify clonally related sequences. An implementation of the hierarchical clustering approach is provided via the hierachicalClones function in the SCOPer R package.

It is important to determine an appropriate threshold for trimming the hierarchical clustering into B cell clones before using this method. The ideal threshold for separating clonal groups is the value that separates the two modes of the nearest-neighbor distance distribution. The nearest-neighbor distance distribution can be generated by using the distToNearest function in the shazam R package.

[11]:
%%R
dist_nearest <- distToNearest(filter(data, locus == "IGH"), nproc = 1)

# generate Hamming distance histogram
p1 <- ggplot(subset(dist_nearest, !is.na(dist_nearest)),
             aes(x = dist_nearest)) +
        theme_bw() +
        xlab("Hamming distance") + ylab("Count") +
        scale_x_continuous(breaks = seq(0, 1, 0.1)) +
        geom_histogram(color = "white", binwidth = 0.02) +
        theme(axis.title = element_text(size = 18))
plot(p1)
../_images/tutorials_10x_tutorial_22_0.png

The resulting distribution is often bimodal, with the first mode representing sequences with clonal relatives in the dataset and the second mode representing singletons. We can inspect the plot of nearest-neighbor distance distribution generated above to manually select a threshold to separates the two modes of the nearest-neighbor distance distribution.

For further details regarding inferring an appropriate threshold for the hierarchical clustering method, see the Distance to Nearest Neighbor vignette in the shazam package.

Find the distance threshold for cloning automatically

The threshold itself can be also found using the automatic findThreshold function. There are different ways to find the threshold and details can also be found in the Distance to Nearest Neighbor vignette in the shazam package.

A robust way that we recommend is to use the nearest-neighbor distance of inter (between) clones as the background and select the threshold based on the specificity of this background distribution.

[12]:
%%R
# find threshold for cloning automatically
threshold_output <- findThreshold(dist_nearest$dist_nearest,
                       method = "gmm", model = "gamma-norm",
                       cutoff = "user", spc = 0.995)
threshold <- threshold_output@threshold
threshold
[1] 0.1305861
[13]:
%%R
plot(threshold_output, binwidth = 0.02, silent = T) +
  theme(axis.title = element_text(size = 18))
../_images/tutorials_10x_tutorial_26_0.png

The nearest-neighbor distance distribution is not always bimodal. In this case, if the data have multiple subjects, we can calculate the nearest neighbor distances across subjects to initialize the Gaussian fit parameters of the nearest-neighbor distance of inter (between) clones distribution.

The nearest neighbor distances across subjects can be calculated by specifying the parameter cross in the function distToNearest. And then when we call function findThreshold, Gaussian fit parameters can be initialized by setting parameter cross = dist_crossSubj$cross_dist_nearest.

In the example below, BCR data from donor 2 together with the BCR data from donor 1, was used to calculate the nearest neighbor distances across subjects:

[14]:
%%R
# read in BCR data from donor 2
data_10x_example <- readChangeoDb('../data/10x_data_2subj/sc5p_v2_hs_B_1k_multi_5gex_b_Multiplex_vdj_b_all_contig_igblast_db-pass.tsv')
data_10x_example$subject <- "Subj2"

cat(paste("There are", nrow(data_10x_example), "rows in the donor 2 data.\n"))

# label the donor 1 data
data$subject <- "Subj1"

# combine donors 1 and 2
data_both_subjects <- rbind(data, data_10x_example)
data_both_subjects %>% slice_sample(n = 5) # random examples
There are 7060 rows in the donor 2 data.
# A tibble: 5 × 58
  sequenc…¹ seque…² rev_c…³ produ…⁴ v_call d_call j_call seque…⁵ germl…⁶ junct…⁷
  <chr>     <chr>   <lgl>   <lgl>   <chr>  <chr>  <chr>  <chr>   <chr>   <chr>
1 CCTTCCCT… GGCAGG… FALSE   TRUE    IGKV1… <NA>   IGKJ3… GACATC… GACATC… TGTCAA…
2 GTTCATTA… GAGCTA… FALSE   TRUE    IGKV4… <NA>   IGKJ3… GACATC… GACATC… TGTCAG…
3 ACTATCTA… GGGGGT… FALSE   TRUE    IGKV3… <NA>   IGKJ4… GAAATT… GAAATT… TGTCAG…
4 GGAATAAC… GGGAGC… FALSE   TRUE    IGHV1… IGHD2… IGHJ4… CAGGTC… CAGGTC… TGTGCG…
5 ATCACGAG… AGCTCT… FALSE   TRUE    IGHV3… IGHD3… IGHJ4… GAGGTG… GAGGTG… TGTGCG…
# … with 48 more variables: junction_aa <chr>, v_cigar <chr>, d_cigar <chr>,
#   j_cigar <chr>, stop_codon <lgl>, vj_in_frame <lgl>, locus <chr>,
#   c_call <chr>, junction_length <dbl>, np1_length <dbl>, np2_length <dbl>,
#   v_sequence_start <dbl>, v_sequence_end <dbl>, v_germline_start <dbl>,
#   v_germline_end <dbl>, d_sequence_start <dbl>, d_sequence_end <dbl>,
#   d_germline_start <dbl>, d_germline_end <dbl>, j_sequence_start <dbl>,
#   j_sequence_end <dbl>, j_germline_start <dbl>, j_germline_end <dbl>, …
# ℹ Use `colnames()` to see all variable names
[15]:
%%R
# calculate cross subjects distribution of distance to nearest
dist_crossSubj <- distToNearest(filter(data_both_subjects, locus == "IGH"),
                                nproc = 1, cross = "subject")

# find threshold for cloning automatically and initialize the Gaussian fit parameters of the nearest-neighbor
# distance of inter (between) clones using cross subjects distribution of distance to nearest
threshold_output <- findThreshold(dist_nearest$dist_nearest,
                                  method = "gmm", model = "gamma-norm",
                                  cross = dist_crossSubj$cross_dist_nearest,
                                  cutoff = "user", spc = 0.995)
threshold <- threshold_output@threshold
threshold
[1] 0.1305812
[16]:
%%R
plot(threshold_output, binwidth = 0.02,
     cross = dist_crossSubj$cross_dist_nearest, silent = T) +
  theme(axis.title = element_text(size = 18))
../_images/tutorials_10x_tutorial_30_0.png

In the plot above, the top plot is the nearest-neighbor distance distribution within Subj1, and the bottom plot is the nearest neighbor distances across Subj1 and Subj2.

Define clonal groups

After we decide the threshold for calling clones, the hierachicalClones function in SCOPer package can be used to call clones using single cell mode:

[17]:
%%R
# call clones using hierarchicalClones
results <- hierarchicalClones(data, cell_id = 'cell_id',
                              threshold = threshold, only_heavy = FALSE,
                              split_light = TRUE, summarize_clones = FALSE)
R[write to console]: Running defineClonesScoper in single cell mode

HierarchicalClones clusters B receptor sequences based on junction region sequence similarity within partitions that share the same V gene, J gene, and junction length, thus allowing for ambiguous V or J gene annotations. By setting it up the cell_id parameter, HierarchicalClones will run in single-cell mode with paired-chain sequences. With only_heavy = FALSE and split_light = TRUE, grouping should be done by using IGH plus IGK/IGL sequences and inferred clones should be split by the light/short chain (IGK and IGL) following heavy/long chain clustering.

After calling clones, a clonal abundance distribution can be displayed:

[18]:
%%R
# calculate and plot the rank-abundance curve
plot(estimateAbundance(results), colors = "steelblue", silent = T) +
  theme(axis.title = element_text(size = 18))
../_images/tutorials_10x_tutorial_35_0.png

Create Germlines

Before B cell lineage trees can be built, it is necessary to construct the unmutated germline sequence for each B cell clone. Typically the IGH D segment is masked because the junction region of heavy chains often cannot be reliably reconstructed. Note that occasionally errors are thrown for some clones - this is typical and usually results in those clones being excluded.

In the example below, we read in the IMGT germline references from our Docker container. If you’re using a local installation, you can download the most up-to-date reference genome by cloning the Immcantation repository and running the script:

git clone https://bitbucket.org/kleinstein/immcantation.git
./immcantation/scripts/fetch_imgtdb.sh # will create directories where it is run

And passing "human/vdj/" to the readIMGT function.

[19]:
%%R
# run createGermlines using IMGT files in Docker container.
references <- readIMGT(dir = "/usr/local/share/germlines/imgt/human/vdj")

h <- createGermlines(filter(results, locus == "IGH"), references)
k <- createGermlines(filter(results, locus == "IGK"), references)
l <- createGermlines(filter(results, locus == "IGL"), references)
[1] "Read in 1152 from 17 fasta files"

Calculate SHM frequency in the V gene

Basic mutational load calculations can be performed by the function observedMutations in the SHazaM R package:

[20]:
%%R
# calculate SHM frequency in the V gene
data_h <- observedMutations(h,
                            sequenceColumn = "sequence_alignment",
                            germlineColumn = "germline_alignment_d_mask",
                            regionDefinition = IMGT_V,
                            frequency = TRUE,
                            combine = TRUE,
                            nproc = 1)

The plot below shows the distribution of median mutation frequency of clones:

[21]:
%%R
# calculate the median mutation frequency of a clone
mut_freq_clone <- data_h %>%
                  group_by(clone_id) %>%
                  summarise(median_mut_freq = mean(mu_freq))

ggplot(mut_freq_clone, aes(median_mut_freq)) +
    geom_histogram(, binwidth = 0.005) + theme_bw() +
    theme(axis.title = element_text(size = 18))
../_images/tutorials_10x_tutorial_41_0.png

Format clones for tree building in Dowser

Before we can build lineage trees, our data must be formatted into a tibble of airrClone objects.

  • If we want to visualize information in a column on the tree, we must specify that column in the traits column.

  • To specify multiple traits, input a vector into this column.

  • If we want our clones tibble to have a column showing the subject each clone was obtained from, we can specify the subject column in the columns option.

  • By default, identical sequences will be collapsed unless they differ by the values in the traits column.

  • minseq will remove all clones with fewer than the specified number of sequences.

[22]:
%%R
clones <- formatClones(data_h,
                       traits = "c_call", columns = "subject",
                       minseq = 3)
print(clones)
# A tibble: 15 × 5
   clone_id data       locus  seqs subject
   <chr>    <list>     <chr> <int> <chr>
 1 6580     <airrClon> IGH       5 Subj1
 2 2388     <airrClon> IGH       4 Subj1
 3 146      <airrClon> IGH       3 Subj1
 4 1901     <airrClon> IGH       3 Subj1
 5 2166     <airrClon> IGH       3 Subj1
 6 2601     <airrClon> IGH       3 Subj1
 7 3857     <airrClon> IGH       3 Subj1
 8 4500     <airrClon> IGH       3 Subj1
 9 4881     <airrClon> IGH       3 Subj1
10 4978     <airrClon> IGH       3 Subj1
11 5870     <airrClon> IGH       3 Subj1
12 6381     <airrClon> IGH       3 Subj1
13 7342     <airrClon> IGH       3 Subj1
14 7877     <airrClon> IGH       3 Subj1
15 8179     <airrClon> IGH       3 Subj1

Build trees using maximum likelihood

Dowser offers multiple ways to build lineage trees, including maximum parsimony, maximum likelihood, and B cell specific models such as IgPhyML.

Here, we build trees using the likelihood maximization functions from the phangorn package in R. The tree objects themselves are saved as R phylo objects in the trees column.

[23]:
%%R
trees <- getTrees(clones, build = "pml", nproc = 1)
print(trees)
# A tibble: 15 × 6
   clone_id data       locus  seqs subject trees
   <chr>    <list>     <chr> <int> <chr>   <list>
 1 6580     <airrClon> IGH       5 Subj1   <phylo>
 2 2388     <airrClon> IGH       4 Subj1   <phylo>
 3 146      <airrClon> IGH       3 Subj1   <phylo>
 4 1901     <airrClon> IGH       3 Subj1   <phylo>
 5 2166     <airrClon> IGH       3 Subj1   <phylo>
 6 2601     <airrClon> IGH       3 Subj1   <phylo>
 7 3857     <airrClon> IGH       3 Subj1   <phylo>
 8 4500     <airrClon> IGH       3 Subj1   <phylo>
 9 4881     <airrClon> IGH       3 Subj1   <phylo>
10 4978     <airrClon> IGH       3 Subj1   <phylo>
11 5870     <airrClon> IGH       3 Subj1   <phylo>
12 6381     <airrClon> IGH       3 Subj1   <phylo>
13 7342     <airrClon> IGH       3 Subj1   <phylo>
14 7877     <airrClon> IGH       3 Subj1   <phylo>
15 8179     <airrClon> IGH       3 Subj1   <phylo>

Build trees using IgPhyML

To build lineage trees using the B cell specific models in IgPhyML, you must specify the location of the compiled IgPhyML executable in your system.

  • Here, we use its location in the Docker image.

  • The format of the output is the same regardless of the method used to build the tree.

  • With IgPhyML, however, a column of parameters is also returned, which gives the parameter values of the HLP19 model.

[24]:
%%R
trees <- getTrees(clones,
                  build = "igphyml", nproc = 1, exec = "/usr/local/share/igphyml/src/igphyml")
print(trees)
# A tibble: 15 × 7
   clone_id data       locus  seqs subject parameters        trees
   <chr>    <list>     <chr> <int> <chr>   <named list>      <named list>
 1 6580     <airrClon> IGH       5 Subj1   <named list [13]> <phylo>
 2 2388     <airrClon> IGH       4 Subj1   <named list [13]> <phylo>
 3 146      <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
 4 1901     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
 5 2166     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
 6 2601     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
 7 3857     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
 8 4500     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
 9 4881     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
10 4978     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
11 5870     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
12 6381     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
13 7342     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
14 7877     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>
15 8179     <airrClon> IGH       3 Subj1   <named list [13]> <phylo>

Visualize trees with constant region on tips

Regardless of how you build trees, they are visualized in the same manner with the plotTrees function. This will return a list of ggplot objects in the same order as the input object. Here, we color the tips by the c_call value because we specified that column in the formatClones step.

[25]:
%%R
p <- plotTrees(trees, tips = "c_call", tipsize = 4)

# plot first tree
p[[1]]

# save all trees to a pdf
treesToPDF(p, file = "results/alltrees.pdf")
png
  2

Reconstruct intermediate sequences

Sequences of intermediate nodes are automatically reconstructed during the tree build process. To retrieve them, first plot the node numbers for each node. The function collapseNodes can help clean up the tree plots.

Then, use the getNodeSeq function to retrieve the sequence at the desired node.

[26]:
%%R
trees <- collapseNodes(trees)

# plot trees with node ID numbers
p <- plotTrees(trees, tips = "c_call", tipsize = 4,
               node_nums = TRUE, labelsize = 7)
p[[1]]
../_images/tutorials_10x_tutorial_51_0.png
[27]:
%%R
# get the sequence at node 8
first_clone_id <- trees[["clone_id"]][1]
sequence <- getNodeSeq(trees, clone = first_clone_id, node = 8)
print(sequence)
                                                                                                                                                                                                                                                                                                                                                                                                     IGH
"CAGGTGCAGCTGGTGCAATCTGGGTCT...GAGTTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGACTTCTGGATACACCTTC............ASTGACTATGGTGTGAACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACGCCTAC......ACCGGGAACCCAACGTATGCCCAGGGCTTCACA...GGACGGTTTGTCTTCTCCTTGGACACCTCTGTCCGCACGGCATATCTGCAGATCAGCAGCCTGAAGGCTGAGGACACTGCCGTGTATTACTGTGCGATTATCCATGATRGTAGTACYTGGAGTCCTTTTGACTACTGGGGCCAGGGAGCCCTGGTCACCGTCTCCTCAGNN"

Analyzing B cell migration, differentiation, and evolution over time

In addition to the functions for building and visualizing trees, Dowser also implements new techniques for analyzing B cell migration and differentiation, as well as for detecting new B cell evolution over time. These are more advanced topics detailed on the Dowser site.

If you have data from different tissues, B cell subtypes, and/or isotypes and want to use lineage trees to study the pattern of those traits along lineage trees, check out the discrete trait vignette.

If you have data from multiple timepoints from the same subject and want to determine if B cell lineages are evolving over the sampled interval, check out the measurable evolution vignette.

For more advanced tree visualization, check out the plotting trees vignette.