The identification of the entire set of essential genetic instructions encoded in the bacterial genome is crucial for a complete understanding of the regulatory networks that run and program cells. The essential genome of any organism not only contains protein-coding sequences, but also essential structural elements, non-coding RNAs and regulatory sequences. The essential genome is not necessarily a static measure: it is dynamically associated with the environmental context in which an organism is placed. For instance, the essential genomic regions of an organism may differ depending upon the availability of certain foods or nutrients in the local environment. It may also depend on such factors as temperature or the presence of toxic compounds.
Traditional techniques have relied upon mutagenizing areas of the genome to abrogate the function of the adjacent genomic region, thus permitting the identification of locations that convey lethal phenotypes to an organism of interest. By mapping these lethality-inducing locations, a picture of an organism's essential genome emerges. Systematic techniques to attain such a map have been limited to the creation of in-frame deletion libraries in which each of the open reading frames (ORF) of a genome are shifted such as to render the corresponding translated region null. However, on a genome-wide level, such a method would require the laborious mutation and analysis of thousands of individual genes in a given organism. Moreover, such an approach fails to identify the essential, regulatory sequences, transcription factor binding sites, structural genome maintenance and replication features as well as non-coding regions of a genome.
In this respect, transposon mutagenesis has proven to be a valuable, high throughput tool to create mutagenic libraries. Catalyzed by transposase enzymes, transposable elements may be randomly incorporated into a host genome to create large insertional mutations. Due to their size, transposons functionally interfere with the respective transposed region, allowing for a forward genetic approach of identifying particular genes associated with a given phenotype. For essential genomic regions under a particular selective condition, such transposon insertions renders an organism unviable and are not recovered.
For genomic studies in bacteria, plasmids containing a transposable element and transposase can be individually transformed into bacteria and selectively cultured. Integration locations can then be identified through the amplification and sequencing of transposon junctions using specifically tailored primers and cross-referencing the data with the underlying bacterial genomic sequence. Regions corresponding to low levels of integration tolerance likely serve an essential function within the genome under a particular selective condition. However, this approach has been limited in significant aspects.
In light of the expansiveness of genomes, traditional low-throughput sequencing techniques to map transposon insertion sites present an unwieldy challenge: DNA from individual clones must be independently amplified, purified, and sequenced. Greater levels of resolution in discerning genomic regions of insertional tolerance will necessarily succumb to increasing cost and labor restraints. Moreover, low mapping throughput of existing methods permits only analysis of transposon libraries with low insertion complexity. This directly translates in a poor genomic resolution, which also proves problematic as transposase-mediated insertions demonstrate partial sequence bias, and thus insertion bias will distort the results of any analysis.
Ultra high-throughput sequencing strategies exist to enable a more saturated transposon mutagenesis strategy that could potentially overcome this bias issue. For example, the advent of primed synthesis in flow cells has allowed for massive parallel sequencing of millions of different short DNA templates. For instance, Illumina's sequencing by synthesis (SBS) technology can generate up to 55 gigabases (Gb) of sequencing data in a day. In this technology, adaptors complimentary to oligomers on a planar optically transparent flow cell surface are first ligated to the ends of template DNA. Subsequently, adapted single-stranded DNAs are bound to the flow cell and amplified by a process known as solid-phase “bridge” PCR. Specifically, in each PCR cycle, the loose end of a tethered DNA template arches and hybridizes to a tethered oligomer located in the vicinity on the flow cell surface. As PCR proceeds, DNA polymerase creates a double-stranded bridge between the two attached termini, resulting in two anchored single-stranded templates upon denaturation. Subsequent rounds thus generate clusters of clonally-amplified DNA of up to 1,000 identical copies, allowing for the direct visualization of fluorescently-labeled deoxytrinucleoside triphosphates (dNTPs). Such a technique results in densities on the order of ten million single-molecule clusters per square centimeter. Following cluster generation, four labeled reversible dNTP terminators, along with primers and DNA polymerase, are introduced into the flow cell for the first cycle of sequencing. Incorporated dNTPs, each bearing a different, discernable fluorophore, are then visualized through laser excitation and subsequently enzymatically cleaved and washed away for the next cycle. To this extent, the sequencing cycles are repeated to determine the sequence of bases in a DNA template, one base at a time.
One of the key innovations underlying this system and related ultra high-throughput adaptor-based sequencing strategies involves the use of adaptor sequences, which allows for in-situ template “bridge” amplification for cluster generation. Such clonal clustering, as noted previously, enables direct fluorescent visualization of high densities of DNA samples for sequencing purposes. However, this approach typically calls for the use of conventional DNA isolation, fragmentation, and ligation protocols—a laborious and time-consuming process that is necessary to ensure DNA of a suitably small size bear compatible adaptor ends for bridge amplification.
Moreover, attempts to utilize this ultra high-throughput sequencing strategy to overcome the insertion bias of low-density studies by pooling clones and sequencing transposon flanking regions en masse has ultimately resulted in short genomic reads on the order of approximately 16 bp. Given the repetitious nature of genomes, such small identified genomic sequences would prevent an accurate and robust identification of all transposon insertion sites and thus limit the dissection of the essential genome of an organism.