1. Field of the Invention
The present disclosure relates generally to quantitative high-throughput sequencing of adaptive immune receptor encoding DNA (e.g., DNA encoding T cell receptors (TCR) and immunoglobulins (IG) in multiplexed nucleic acid amplification reactions. In particular, the compositions and methods described herein overcome undesirable distortions in the quantification of adaptive immune receptor encoding sequences that can result from biased over-utilization and/or under-utilization of specific oligonucleotide primers in multiplexed DNA amplification.
2. Description of the Related Art
The adaptive immune system employs several strategies to generate a repertoire of T- and B-cell antigen receptors, i.e., adaptive immune receptors, with sufficient diversity to recognize the universe of potential pathogens. The ability of T cells to recognize the universe of antigens associated with various cancers or infectious organisms is conferred by its T cell antigen receptor (TCR), which is a heterodimer of an α (alpha) chain from the TCRA locus and β (beta) chain from the TCRB locus, or a heterodimer of a γ (gamma) chain from the TCRG locus and a δ (delta) chain from the TCRD locus. The proteins which make up these chains are encoded by DNA, which in lymphoid cells employs a unique rearrangement mechanism for generating the tremendous diversity of the TCR. This multi-subunit immune recognition receptor associates with the CD3 complex and binds to peptides presented by either the major histocompatibility complex (MHC) class I or MHC class II proteins on the surface of antigen-presenting cells (APCs). Binding of TCR to the antigenic peptide on the APC is the central event in T cell activation, which occurs at an immunological synapse at the point of contact between the T cell and the APC.
Each TCR peptide contains variable complementarity determining regions (CDRs), as well as framework regions (FRs) and a constant region. The sequence diversity of αβT cells is largely determined by the amino acid sequence of the third complementarity-determining region (CDR3) loops of the α and β chain variable domains, which diversity is a result of recombination between variable (Vβ), diversity (Dβ), and joining (Jβ) gene segments in the β chain locus, and between analogous Jα and Jα gene segments in the α chain locus, respectively. The existence of multiple such gene segments in the TCR α and β chain loci allows for a large number of distinct CDR3 sequences to be encoded. CDR3 sequence diversity is further increased by independent addition and deletion of nucleotides at the Vβ-Dβ, Dβ-Jβ, and Vα-Jα junctions during the process of TCR gene rearrangement. In this respect, immunocompetence is derived from the diversity of TCRs.
The γδ TCR heterodimer is distinctive from the αβTCR in that it encodes a receptor that interacts closely with the innate immune system, and recognizes antigen in a non-HLA-dependent manner. TCRγδ is expressed early in development, and has specialized anatomical distribution, unique pathogen and small-molecule specificities, and a broad spectrum of innate and adaptive cellular interactions. A biased pattern of TCRγ V and J segment expression is established early in ontogeny. Consequently, the diverse TCRγ repertoire in adult tissues is the result of extensive peripheral expansion following stimulation by environmental exposure to pathogens and toxic molecules.
Immunoglobulins (Igs or IG), also referred to herein as B cell receptors (BCR), are proteins expressed by B cells consisting of four polypeptide chains, two heavy chains (H chains) from the IGH locus and two light chains (L chains) from either the IGK or the IGL locus, forming an H2L2 structure. H and L chains each contain three complementarity determining regions (CDR) involved in antigen recognition, as well as framework regions and a constant domain, analogous to TCR. The H chains of Igs are initially expressed as membrane-bound isoforms using either the IGM or IGD constant region exons, but after antigen recognition the constant region can class-switch to several additional isotypes, including IGG, IGE and IGA. As with TCR, the diversity of naïve Igs within an individual is mainly determined by the hypervariable complementarity determining regions (CDR). Similar to TCRB, the CDR3 domain of H chains is created by the combinatorial joining of the VH, DH, and JH gene segments. Hypervariable domain sequence diversity is further increased by independent addition and deletion of nucleotides at the VH-DH, DH-JH, and VH-JH junctions during the process of Ig gene rearrangement. Distinct from TCR, Ig sequence diversity is further augmented by somatic hypermutation (SHM) throughout the rearranged IG gene after a naïve B cell initially recognizes an antigen. The process of SHM is not restricted to CDR3, and therefore can introduce changes to the germline sequence in framework regions, CDR1 and CDR2, as well as in the somatically rearranged CDR3.
As the adaptive immune system functions in part by clonal expansion of cells expressing unique TCRs or BCRs, accurately measuring the changes in total abundance of each T cell or B cell clone is important to understanding the dynamics of an adaptive immune response. For instance, a healthy human has a few million unique rearranged TCRβ chains, each carried in hundreds to thousands of clonal T-cells, out of the roughly trillion T cells in a healthy individual. Utilizing advances in high-throughput sequencing, a new field of molecular immunology has recently emerged to profile the vast TCR and BCR repertoires. Compositions and methods for the sequencing of rearranged adaptive immune receptor gene sequences and for adaptive immune receptor clonotype determination are described in U.S. application Ser. No. 13/217,126; U.S. application Ser. No. 12/794,507; PCT/US2011/026373; and PCT/US2011/049012, all herein incorporated by reference.
To date, several different strategies have been employed to sequence nucleic acids encoding adaptive immune receptors quantitatively at high throughput, and these strategies may be distinguished, for example, by the approach that is used to amplify the CDR3-encoding regions, and by the choice of sequencing genomic DNA (gDNA) or messenger RNA (mRNA).
Sequencing mRNA is a potentially easier method than sequencing gDNA, because mRNA splicing events remove the intron between J and C segments. This allows for the amplification of adaptive immune receptors (e.g., TCRs or Igs) having different V regions and J regions using a common 3′ PCR primer in the C region. For each TCRβ, for example, the thirteen J segments are all less than 60 base pairs (bp) long. Therefore, splicing events bring identical polynucleotide sequences encoding TCRβ constant regions (regardless of which V and J sequences are used) within less than 100 bp of the rearranged VDJ junction. The spliced mRNA can then be reverse transcribed into complementary DNA (cDNA) using poly-dT primers complementary to the poly-A tail of the mRNA, random small primers (usually hexamers or nonamers) or C-segment-specific oligonucleotides. This should produce an unbiased library of TCR cDNA (because all cDNAs are primed with the same oligonucleotide, whether poly-dT, random hexamer, or C segment-specific oligo) that may then be sequenced to obtain information on the V and J segment used in each rearrangement, as well as the specific sequence of the CDR3. Such sequencing could use single, long reads spanning CDR3 (“long read”) technology, or could instead involve shotgun assembly of the longer sequences using fragmented libraries and higher throughput shorter sequence reads.
Efforts to quantify the number of cells in a sample that express a particular rearranged TCR (or Ig) based on mRNA sequencing are difficult to interpret, however, because each cell potentially expresses different quantities of TCR mRNA. For example, T cells activated in vitro have 10-100 times as much mRNA per cell than quiescent T cells. To date, there is very limited information on the relative amount of TCR mRNA in T cells of different functional states, and therefore quantitation of mRNA in bulk does not necessarily accurately measure the number of cells carrying each clonal TCR rearrangement.
Most T cells, on the other hand, have one productively rearranged TCRα and one productively rearranged TCRβ gene (or two rearranged TCRγ and TCRδ), and most B cells have one productively rearranged Ig heavy-chain gene and one productively rearranged Ig light-chain gene (either ICK or IGL) so quantification in a sample of genomic DNA encoding TCRs or BCRs should directly correlate with, respectively, the number of T or B cells in the sample. Genomic sequencing of polynucleotides encoding any one or more of the adaptive immune receptor chains desirably entails amplifying with equal efficiency all of the many possible rearranged CDR3 sequences that are present in a sample containing DNA from lymphoid cells of a subject, followed by quantitative sequencing, such that a quantitative measure of the relative abundance of each rearranged CDR3 clonotype can be obtained.
Difficulties are encountered with such approaches, however, in that equal amplification and sequencing efficiencies may not be achieved readily for each rearranged clone using multiplex PCR. For example, at TCRB each clone employs one of 54 possible germline V region-encoding genes and one of 13 possible J region-encoding genes. The DNA sequence of the V and J segments is necessarily diverse, in order to generate a diverse adaptive immune repertoire. This sequence diversity makes it impossible to design a single, universal primer sequence that will anneal to all V segments (or J segments) with equal affinity, and yields complex DNA samples in which accurate determination of the multiple distinct sequences contained therein is hindered by technical limitations on the ability to quantify a plurality of molecular species simultaneously using multiplexed amplification and high throughput sequencing.
One or more factors can give rise to artifacts that skew the correlation between sequencing data outputs and the number of copies of an input clonotype, compromising the ability to obtain reliable quantitative data from sequencing strategies that are based on multiplexed amplification of a highly diverse collection of TCRβ gene templates. These artifacts often result from unequal use of diverse primers during the multiplexed amplification step. Such biased utilization of one or more oligonucleotide primers in a multiplexed reaction that uses diverse amplification templates may arise as a function of differential annealing kinetics due to one or more of differences in the nucleotide base composition of templates and/or oligonucleotide primers, differences in template and/or primer length, the particular polymerase that is used, the amplification reaction temperatures (e.g., annealing, elongation and/or denaturation temperatures), and/or other factors (e.g., Kanagawa, 2003 J. Biosci. Bioeng. 96:317; Day et al., 1996 Hum. Mol. Genet. 5:2039; Ogino et al., 2002 J. Mol. Diagnost. 4:185; Barnard et al., 1998 Biotechniques 25:684; Aird et al., 2011 Genome Biol. 12:R18).
Clearly there remains a need for improved compositions and methods that will permit accurate quantification of adaptive immune receptor-encoding DNA sequence diversity in complex samples, in a manner that avoids skewed results such as misleading over- or underrepresentation of individual sequences due to biases in the amplification of specific templates in an oligonucleotide primer set used for multiplexed amplification of a complex template DNA population. The presently described embodiments address this need and provide other related advantages.