1. Field of the Invention
The invention generally relates to the field of data processing and more particularly to dictionary sorting of data.
2. Background Information
Sorting in general is well developed and optimized for putting a sequence of numbers into increasing or decreasing numerical order. See for instance Numerical Recipes in C, Chapter 8 (Sorting), (WILLIAM H. PRESS, et al., NUMERICAL RECIPES IN C, Cambridge University Press, 1988). Sorting routines for use in sorting other forms of data are often derived from the routines developed for sorting numbers. However, routines thus derived typically do not give the optimal solutions to the problems associated with sorting non-numeric data. Non-numeric data typically has special characteristics that make it poorly suited for use with routines derived from numerical sorting routines.
For example, textual data is formed in characters, and an often used sorting order for textual data is dictionary order. When two words or sentences are compared, the first characters of each word are compared first, then the second characters of each word are compared if the first characters were the same, and so forth. Thus, one comparison of text is constructed of several numerical comparisons. What is needed is a method of sorting that takes advantage of the characteristics of textual data.
Moreover, dictionary sorting is an integral part of the Burrows-Wheeler transform as described by Burrows and Wheeler, (M. Burrows and D. J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, Digital Systems Research Center Research Report 124, http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html). Implementing this transform efficiently requires use of a method of sorting that is close to optimum for dictionary sorting of text. Thus, what is needed is a more optimal method of sorting textual data than the methods derived from methods of sorting numerical data.
The invention involves a method of sorting a text document, the text document composed of a sequence of characters. The method comprises counting each character of the sequence of characters pointed to by a marker. The method further comprises sorting markers for each character into a set of groups, each group corresponding to a distinct value of the characters in the sequence of characters, the groups created based on the count of each distinct value of the characters in the sequence of characters. The method further comprises repeating for each group of the set of groups containing more than one marker, counting each character following the character previously counted for that marker, and sorting the markers within each group into further groups of the set of groups, each further group of the set of groups corresponding to a distinct value of the characters in the sequence of characters, each further group of the set of groups created based on the count of each distinct value of the characters in the sequence of characters, until no group contains more than one marker.