When performing data processing, a computer is used to take data, which may be represented as a structure in a computer memory and/or a file format on a persistent computer memory, and to perform operations, called data operation, on the data. Data operations are typically performed on data is typically demarcated into discrete sets, called data sets. Typical data operations on data sets in the course of data processing may include searching, which is retrieval a desired subset of a data set; sorting, which is re-organizing the data set; and transformation which is converting the data set from one representation to another.
Over time processing power available for data processing has increased rapidly, but in many cases the amount of data applied to data processing techniques has increased even more rapidly. Accordingly, data processing is in need of improved searching, sorting, transformation, and other data operations.
Data operations are generally improved either by reducing the amount of working memory used to perform the operation, or by improving the processing efficiency of the operation as to reduce processing time. In most cases, the amount of working memory and processing efficiency results in an optimization tradeoff. Reducing the amount of working memory in an operation often results in lower processing efficiency. Conversely, increasing processing efficiency results in a larger amount of memory used during processing. It is relatively rare to achieve reduced memory utilization and greater processing efficiency in the same optimization.
Nonetheless, for large data sets, which are data sets so large that performing data operations are too slow to enable interactive processing, improving processing efficiency at the expense of memory utilization may render the optimization impractical. Increasing the size of a very large data set may result in the amount of memory utilized to be larger than the amount of memory available. Accordingly, even if an optimization for a data operation's processing improvement is significant, it may not be available for implementation because of the amount of available is insufficient. Thus many optimization techniques are impractical for large data set applications.
Presently there are many large data set applications. Some examples include, document processing, image processing, multimedia processing and bioinformatics. For example, in the case of bioinformatics, the data processed is comprised of genetic information which define an organism. Genetic information is comprised of a series of base pairs adenine-thymine and guanine-cytosine. The more complex the organism, the more base pairs are used to defined the organism. For example, the Escherichia Coli bacterium uses approximately 4.6 million base pairs. In contrast, simple viruses may use as little as a few thousand base pairs.
A major application of bioinformatics is in the analysis of genetic conditions in human beings, in the search for medical therapies. The genetic information for a human being is 3.2 billion base pairs. Accordingly, every byte allocated to a base pair in an effort to improve processing, potentially adds an additional 3.2 Gb of working memory. When performing sequence comparisons, with different instances of human beings or other organisms under analysis, the amount of memory used during data processing may rapidly expand to an unmanageable amount.
Accordingly, there is a need for techniques to improve processing speed of data operations on large data sets, such as in bioinformatics, while reducing the amount of memory used.