In general, a sequence is an ordered selection of symbols drawn from an alphabet. The positions in a sequence of length n may be numbered from 0 to n−1. Alphabets may include, but are not limited to, the deoxyribonucleic acid (DNA) alphabet (A, C, G, T), the ribonucleic acid (RNA) alphabet (A, C, G, U) and the amino acid alphabet. Sequences that can be represented using such alphabets are called biological sequences (e.g., DNA sequences, RNA sequences, and protein sequences).
DNA and RNA may be double stranded. It is typically (but not necessarily) the case that each source sequence in a set of source sequences (e.g. obtained from a DNA sequencing machine) is arbitrarily derived from one strand or the other. As a consequence, to correctly interpret the set of source sequences, it is necessary to consider for each source sequence, the sequence that would arise from reading the complementary strand. This other sequence is called the reverse complement sequence. It is equivalent to the sequence obtained by reversing the original sequence and replacing each symbol with its complementary symbol. For types of sequence for which reverse complementation is well defined, each symbol in the alphabet will have a complementary symbol (e.g., for DNA, the complements are symmetric: A and T are complementary as are C and G). For example, the reverse complement of the DNA sequence AACGCTTCGA (SEQ ID NO. 1) is TCGAAGCGTT (SEQ ID NO. 2).
Genotypic characterization (genotyping) is an important method for the identification of organisms and the determination of the relationships between them, often using DNA sequence data. Common genotyping techniques include Multiple Locus Variable number tandem repeat Analysis (MLVA) and Multi-Locus Sequence Typing (MLST), which index genetic variation at defined genomic loci (as further explained below) and create multi-locus profiles that are collected in a database. A genome is the entire complement of DNA in a cell, while a locus is a specific position in the genome (e.g., a region encoding a gene or a letter (“base”), coordinate). In the case of bacteria, the domain of life to which these techniques are most commonly applied, the genome includes the chromosome(s) and any phage or plasmid sequences contained within the cell.
Bacteria are haploid organisms, i.e., in the usual state there is only a single copy of each chromosome per cell and therefore, in the majority of cases, only a single copy of each locus (there are exceptions when loci are duplicated on the same chromosome or on a plasmid). In contrast, among non-haploid organisms, there is more than one copy of each chromosome per cell and therefore multiple copies of each locus, e.g., in humans, which are diploid, there are two copies of each chromosome in each normal cell (excluding germ cells and the X/Y chromosomes).