Various notational systems have been used to encode classes of chemical units by assigning a unique code to each chemical unit in the class. For example, a conventional notational system for encoding amino acids assigns a single letter of the alphabet to each known amino acid. A polymer of chemical units may be represented using such a notational system using a set of codes corresponding to the chemical units. Such notational systems have been used to encode polymers, such as proteins, in a computer-readable format. A polymer that has been represented in such a computer-readable format according to a notational system may be stored and processed by a computer.
Conventional notational schemes for representing chemical units have represented the chemical units as characters (e.g., A, T, G, and C for nucleic acids), and have represented polymers of chemical units as sequences or sets of characters. Various operations may be performed on such a notational representation of a chemical unit or a polymer comprised of chemical units. For example, a user may search a database of chemical units for a query sequence of chemical units. In such a case, the user typically provides a character-based notational representation of the sequence in the form of a sequence of characters, which is compared against the character-based notational representations of sequences of chemical units stored in the database. Character-based searching algorithms, however, are typically slow because such algorithms search by comparing individual characters in the query sequence against individual characters in the sequences of chemical units stored in the database. The spread of such algorithms is therefore related to the length of the query sequence, resulting in particularly poor performance for long query sequences.
The study of molecular and cellular biology is focused on the macroscopic structure of cells. We now know that cells have a complex microstructure that determine the functionality of the cell. Much of the diversity associated with cellular structure and function is due to the ability of a cell to assemble various building blocks into diverse chemical compounds. The cell accomplishes this task by assembling polymers from a limited set of building blocks referred to as monomers. The key to the diverse functionality of polymers is based in the primary sequence of the monomers within the polymer and is integral to understanding the basis for cellular function, such as why a cell differentiates in a particular manner or how a cell will respond to treatment with a particular drug.
The ability to identify the structure of polymers by identifying their sequence of monomers is integral to the understanding of each active component and the role that component plays within a cell. By determining the sequences of polymers it is possible to generate expression maps, to determine what proteins are expressed, to understand where mutations occur in a disease state, and to determine whether a polysaccharide has better function or loses function when a particular monomer is absent or mutated.