Computer systems and processors handle character strings, such as letters, numbers, symbols, and the like, based on sets of standardized character codes. A prevalent function of handling character strings is sorting, also known as collation. Collation is one of the fundamental operations on computers, and is used in practically every application.
Generally, it is straightforward to determine the simple ordering of characters based on a primary “strength” difference. For example, “a” has a primary strength difference from “b”. However, characters can also differ from each other in more subtle ways at lower levels of strength, such as case, contractions, accent markings, etc. For example, character strings may sort differently based on whether they include upper-case versus lower-case characters (e.g., “A” versus “a”). Character strings may also sort differently based on whether they act as contractions or expansions. For example, in Slovak, “ch” is sorted as it if were contracted to single letter after “c”. As another example, in German, “ä” is sorted as it if were expanded to “ae”.
Unfortunately, different languages, such as English, Swedish, Hungarian, Japanese, have very different conventions for alphabetically ordering (or collating) strings of text. It can be quite difficult to precisely determine what the alphabetical order should be for a given language due to the multiple levels of strength in which characters may differ. In addition, across different languages, there can be tremendous variety in terms of how sequences of one or more characters are handled. For example, some nations may have standards that specify how to perform alphabetic sorting. However, many do not. Even if a standard exists, it may have multiple options. For example, Deutsches Institut fur Normung (“DIN”) standard 5007 for German collation provides multiple options for sorting text. This often leads to a wide variety of implementations for sorting even under the same standard.
Collation may also vary by specific application, even within the same language. Dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character. Collation can also be customized or configured according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase (or vice versa), etc. Thus collation implementations must often deal with complex linguistic conventions and provide for common customizations based on market or user preferences.
Despite, these difficulties, it is increasingly important to provide collation tools and methods that can replicate the precise ordering used by different cultures, and different systems. Sorting and collation is a key function in computer systems, for example, whenever a list of strings is presented to users in a sorted order so that they can easily and reliably find individual strings. Collation is also crucial for the operation of databases, not only in sorting records but also in selecting sets of records with fields within given bounds. Therefore, it would be desirable to provide methods and systems that are capable of determining the order appropriate for a given language, location, or application. It may also be desirable to provide methods and systems that can automatically gather and implement the unique rules for collation of a particular language.