Collation is the general term for the process and function of determining the sorting order of strings of characters. It is a key function in computer systems. Users of collations use them to more easily and reliably find individual character strings. Thus it is widely used in user interfaces and in searches. It is also crucial for the operation of databases, not only in sorting records but also in selecting sets of records with fields within given bounds in a search.
Collation is not uniform; it varies according to language and culture. Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character. Collation can also be commonly customized or configured according to user preference, such as ignoring punctuation or not, putting uppercase before lowercase or vice versa. Linguistically correct searching also needs to use the same mechanisms. For example, just as “v” and “w” sort as if they were the same base letter in Swedish, a loose search should pick up words which use either one of the letters.
Thus collation implementations deal with the often-complex linguistic conventions that communities of people have developed over the centuries for ordering text in their language, and provide for common customizations based on user preferences. And while doing all of this, of course, performance is critical in terms of search time and storage. Binary sorts, for example using B-Trees, depend on the value ascribed to a character. Binary sorts use the <, ≦, >, ≧, and = operators to choose between different branches on the B-Tree. Some languages set different values for characters. For example, in Swedish, z<Ö, but in German, Ö<z. Other differences involve the use of case sensitive (CS) or case insensitive (CI) words, accent sensitive (AS) or accent insensitive (AI) words, width sensitive or width insensitive, and kana sensitive or insensitive words.
The conventions that people have developed over the centuries for collating text in their language are often quite complex. Languages vary not only regarding which types of sorts to use and in which order they are to be applied, but also in what constitutes a fundamental element for sorting. For example, Swedish treats ä as an individual letter, sorting it after z in the alphabet; German, however, sorts it either like ae or like other accented forms of a, thus following a in value. In Slovak, the digraph ch sorts as if it were a separate letter after c. Examples from other languages and scripts abound.
Databases use collation rules to search for terms within their databases. It can readily be seen that a search for a given word or term in a database using one collation will yield different results if the same search were conducted using a different collation. This is purposeful and expected as the collation rules are as distinct as the human language used. A collation rules set used in one language is tailored to that language to yield a language-specific result for its respective database. But, as indicated above, there are many collations for a single language. For example, a single language may have sensitivities according to character case, character width, accent use, and kana. There are 16 permutations of the different collations for a single language using the four sensitivities. Assuming there are 50 different languages, then there are 800 different possible collations.
FIG. 1 depicts a server 10, such as Microsoft's Structured Query Language Server (SQL Server™) available from Microsoft® in Redmond, Wash. The server 10 contains a centralized system resource available to provide data, methods, and services to databases 30, 40 and 50. Each database, 30, 40, 50, can each search the system resource database 20 using a database query. Each database may support a different human language and therefore may have the 16 collations associated with that specific language. For example, database 1 (30) may be a Japanese language database having sensitivities relating to case, width, accent and kana and thus have 16 different permutations of collations containing the sensitivities.
A query initiated in database 1 (30) may be queried against the system resource database 20. The system resource database is required to be capable of accommodating 50 human language collations sets, each set having up to 16 different permutations. Accordingly, the system resource database should reasonably be expected to have capability of 800 collation rules. Given that up to 800 different search rule sets may be applicable, the system resource database 20 may be forced to be large to accommodate the rule sets. Searching up to 16 rules sets per language can also slow down the return of query results to a supported database 30, 40 and 50.
Thus, there is a need for a method and system that support searches between databases and that can accommodate different human languages in a time and space efficient manner. The present invention addresses the aforementioned needs and solves them with additional advantages as expressed herein.