While most of the human genome is made up of conserved sequences shared by essentially the entire human population, a small but significant fraction of the genome is highly variable. These sequence differences are not evenly spread across the genome. Rather, certain genomic regions (“loci”) contain many more sequence variations (“polymorphisms”) than others. The identity of the specific nucleotide sequence at a particular locus (i.e., the allele present at that locus) can have significant biological implications. For example, the allele an individual carries at a particular locus can influence whether an individual is susceptible to a disease or whether a therapeutic agent is likely to be efficacious. In addition, knowledge of the identity of the alleles at a highly polymorphic locus can be used to track the ethnic and/or geographic origins of a biological sample, which can be invaluable to anthropologist and can be used forensically to link an individual with a biological sample. Given the increasing availability next-generation sequencing technology, the prospect of using next-generation sequencing data for allele identification is attractive. Unfortunately, accurately and efficiently identifying the alleles present at highly polymorphic loci using sequencing data is challenging, particularly when the sequencing data are generated using high-throughput genome-wide sequencing methods.
One set of highly-polymorphic loci for which there is a need for highly accurate allele prediction processes are those that encode Human Leukocyte Antigen (HLA) proteins. HLA proteins present antigen peptides to lymphocytes in order to mediate key immunological events, including self-antigen tolerance and immune responses to pathogens or tumors. Class I HLAs are ubiquitously expressed by all nucleated cells and present cytosolic antigens to cytotoxic T cells. Class II HLAs are primarily expressed by immune cells and present extracellular antigens to helper T cells.
Humans have six major HLA proteins, three class I proteins (HLA-A, HLA-B and HLA-C) and three class II proteins (HLA-DQ, HLA-DR and HLA-DP). Each class I protein is encoded by a single HLA locus (e.g., the HLA-A locus, the HLA-B locus and the HLA-C locus). The class II proteins, on the other hand, are heterodimers made up of an a chain and a 13 chain, each of which is encoded by its own HLA locus (e.g., the HLA-DQA1 locus, the HLA-DQB1 locus, the HLA-DRA locus, the HLA-DRB1 locus, the HLA-DRB3 locus, the HLA-DRB4 locus, the HLA-DRBS locus, the HLA-DPA1 locus and the HLA-DPB1 locus). In humans, each of the major HLA loci (both class I and class II) are present on chromosome 6. Being diploid organisms, humans carry two copies of chromosome 6, and therefore carry two copies of each HLA locus.
HLA loci are highly polymorphic. Polymorphisms in the HLA loci often result in differences in the amino acid sequences of HLA proteins. This HLA diversity allows a wide range of different antigens to be presented to immune cells within a population. However, these variations in HLA sequence also result in histoincompatibility of organs and tissues between individuals, greatly complicating surgical transplantation procedures. If the HLA proteins expressed by a transplanted organ or tissue are recognized as foreign by the transplant recipient's immune system, the likely result is organ rejection. Similarly, a transplantation that includes the transfer of immune cells that recognize as foreign the HLA proteins expressed by cells in the transplant recipient can result in graft versus host disease. The risk of graft-versus-host disease and organ or tissue rejection can be minimized if the alleles present at the HLA loci of a perspective donor and recipient encode matching HLA proteins, to the greatest extent possible. In order to determine whether there is a match, it is necessary to determine what HLA alleles are present at HLA loci in the donor and recipient, a process known as HLA typing. An individual's HLA type at an HLA locus is made up of the two HLA alleles (or the two copies of a single HLA allele if homozygous) present at the individual's two copies of the HLA locus.
HLA types are also increasingly recognized to play a significant role in numerous diseases. For instance, there are strong associations between certain HLA types and autoimmune disorders, including lupus, inflammatory bowel diseases, multiple sclerosis, arthritis and type I diabetes (e.g., Graham et al., Eur. Hum. Genet. 15:823-830 (2007); Fu et al., J. Autoimmun. 37:104-112 (2011); Cassinotti et al., Am. J. Gastroenterol 104:195-217 (2009); Luckey et al., J. Autoimmun. 37:122-128 (2011); Lemire, M., BMC Proc. 7:S33 (2009); Noble et al., Curr. Diab. Rep. 11:533-542 (2011), each of which is hereby incorporated by reference in its entirety). As one example, class II HLA DQA1*02:01(DQ2) and DRB1*03:01(DR3) are frequently present in systemic lupus erythematosus patients and are significantly associated with disease susceptibility (Graham et al., Eur. Hum. Genet. 15:823-830 (2007)). Presence of other class II HLA proteins also correlate with either the resistance or susceptibility to breast and cervical cancers (e.g., Chaudhuri et al., Proc. Nuc. Acad. Sci. USA 97:11451-11454 (2000); Garcia-Corona et al., Arch. Dermatol. 140:1227-1231 (2004), each of which is hereby incorporated by reference in its entirety).
The pathogenesis and therapeutic indications of HLA molecules highlight the need for accurate and efficient methods of HLA typing. In the past, HLA types have been resolved at low resolution by distinguishing “two-digit” antigen groups that approximate serologic specificities in peptide binding. However, for many applications, two-digit HLA typing is insufficient. For example, a single amino acid difference between two HLA proteins of the same two-digit type can result in altered T-cell recognition specificity and tissue rejection (e.g., Archbold et al., Trends. Immunol. 29:220-226 (2008); Tynan et al., Nat. Immunol. 6:1114-1122 (2005); Fleischhauer et al., N Eng. J. Med. 323:1818-1822 (1990), each of which is hereby incorporated by reference in its entirety). Consequently, high-resolution HLA typing at the amino acid sequence level (known as “four-digit” typing) can be critical. For example, resolving HLA types at high-resolution substantially improves the clinical outcome in unrelated cord blood transplantation and in cancer vaccination trials (Nagorson et al., Cancer Immunol. Immunother. 57:1903-1910 (2008); Liao et al., Bone Marrow Transplant. 40:201-208 (2007), each of which is hereby incorporated by reference in its entirety).
The highly polymorphic nature of HLA loci renders accurate, high-resolution typing a considerable challenge, particularly at high throughput. More than 7527 four-digit HLA alleles are present at the major class I and class II HLA loci in the human population. Existing HLA typing methodologies capable of resolving HLA types at four-digit resolution, such as group specific PCR by sequencing specific priming (SSP) and sequence-based typing (SBT), have low throughput. Other proposed typing strategies specifically target the HLA loci via PCR-amplification, followed by deep sequencing. Such methods require long reads and a high coverage (depth) in order to produce accurate assignment of four-digit HLA alleles. Due to cost and efficiency considerations, genome-wide sequencing, such as transcriptome or whole exome/genome sequencing, generally produce much shorter reads (<100 bases) and lower coverage. These read length and coverage limitations reduce the accuracy of current methodologies that attempt to use genome-wide sequencing processes for HLA typing. Specifically, the four-digit HLA type identification accuracy of current methods using short read sequencing has been reported to be between 32% and 84% (e.g., Boegel et al., Genome Med. 4:102 (2013); Kim and Pourmand PLoS One 8:e67885 (2013)).
In light of the foregoing, there is a need for new methods of accurately and efficiently identifying the alleles present at a locus using diverse sequencing data, including data with short read lengths and low sequence coverage.