The level of gene expression in a biological sample can vary greatly. For examples, it has been described that gene expression level follows 3 broad categories: 1) ‘high expressers,’ which are comprised of 5-10 genes that dominate ˜20% of cellular mRNAs; 2) ‘intermediate expressers’ that are comprised of 50-200 genes that occupy 40-60% of cellular mRNAs; and 3) ‘moderate expressers’ that are comprised of 10,000-20,000 genes that occupy the rest of the cellular mRNA fraction. One challenge in molecular biology and molecular genetics is to be able to capture this highly dynamic gene expression profile efficiently and effectively in order to distinguish different cell types and phenotypes in the sample.
In recent years, next generation sequencing (NGS) has provided a high throughput method in assessing gene expression profiles. During library preparation for NGS, a sample with heterogeneous cDNA species is amplified by PCR to obtain adequate sample amount and to attach NGS-compatible adapters. The sequencing process captures the number of reads for each gene from the PCR-amplified library sample to interpret the gene expression level. However, since different genes are expressed at a large range of levels, PCR amplification can skew the native gene expression. For example, a gene has 1 molecule of cDNA would require 40 cycles of PCR to achieve the same representative amount as a gene with 1000 molecules of cDNA in 30 cycles. In a heterogeneous cDNA sample, PCR is usually performed in excess cycles to adequately amplify low expressers; in those scenarios, the native gene expression profile is usually skewed by the dominating high expresser PCR products. A method to correct for such bias in PCR product is Molecular Indexing; however, high expressers such as ribosomal protein mRNAs, mitochondrial mRNAs, or housekeeping genes often dominate the sequencing run with little contribution to the experimental interpretation, rendering the sequencing cost for Molecular Index counting to be expensive.