A common need of software applications is to sort data. And while computer systems can work with many kinds of data, only binary values can be represented natively. Written human languages are not designed as mathematical representations, and thus are only represented indirectly in computers by using artificial numerical representations, rather than as glyphs or symbols written using paper and ink.
Because text strings are not a native data format for a computer, text strings are typically represented using a sequence of numbers, each representing a character in the text string. While numbers can be represented naturally in a binary system, such as in a computer, text is designed to represent human language, such as symbols representing letters, concepts, or phonemes. Common systems for representing text assign each glyph or character of a chosen language or alphabet a number. Such systems have been used at least since Émile Baudot used 5-bit representations for letters, punctuation, and control characters in the late 1800s.
Thus, to represent a human-language text string in a computer, a sequence of characters or glyphs is entered as a sequence of the corresponding numerical representation of each character or glyph. A common Baudot-type representation or encoding scheme for text strings is the American Standard Code for Information Interchange, or ASCII. In an ASCII encoding scheme, each character is assigned a numeric value. For example, the capital letters from ‘A’ to ‘Z’ are assigned values from 65 to 90 (in decimal), inclusive, while the lower-case letters from ‘a’ to ‘z’ are assigned values from 97 to 122, inclusive.
However, in some cases it is desirable to have different characters regarded as equivalent. For example, when sorting multiple text strings, it may be desirable to treat a capital ‘T’ as being equivalent to a lower-case ‘t.’ Thus, the word “test” may be represented by many different text strings, such as with capital letters (TEST), mixed capital and lower case letters (Test), or all lower case letters (test). However, when sorting or comparing these three text strings, in some cases, they should be regarded as equivalent. Thus, when using a conventional character set, it may be necessary to temporarily convert all three text strings to an equivalent text string having a normalized representation, such as all capitals (“TEST”), so that when the text strings are compared, they are determined to be equivalent.
Thus, to ensure sorting functions generate the expected results, because the same text may be represented in a number of different ways, text strings are typically pre-processed to be put into a consistent, equivalent representation, e.g. all capitals, so that comparisons achieve an expected result. For example, without any pre-processing, a sort function may sort the text string “Test” before the text string “apple” because the capital ‘T’ in ‘test’ has a lower numerical value (84 in ASCII) than the lower-case ‘a’ (97 in ASCII) in ‘apple.’ This is not an expected result in most cases if the text strings were to be sorted in alphabetical order. Thus, pre-processing, which imposes extra computational overhead, is typically used to ensure comparisons are handled correctly. However, reformatting text in connection with a comparison can slow the sort process. In addition, if text is stored in a database, this overhead may be needed each time data is accessed and sorted.
An alternate approach is to store both a normalized version of a text string (such as a so-called sort key) as well as the original text string. Thus, sorts may be performed based on the normalized text string, but a request for the text string, such as to display to a user, would return the original text string. However, storing two versions or copies of the same data uses additional resources and may be inefficient. Thus, it would be advantageous to represent text strings in a manner that allows a representation of a desired set of characters, but also allows for sorting without the need for pre-processing to ensure the sort function returns the expected results. It would also be advantageous for such a representation to not consume additional memory to store multiple versions of the same text string.