This invention relates in general to database management systems performed by computers. More particularly, this invention relates to efficient processing of database queries involving selection against heterogeneous data.
Heterogeneous data access is becoming an important feature in database systems because it facilitates the integration of information from a variety of data sources. The performance of queries involving external data sources is greatly affected by the cost of transporting data over the network. This cost can be reduced by pushing down subqueries, particularly selections and projections, to the external data source. However, if the external data source uses a different collating sequence, predicates consisting of comparison of a column (also known as a field) with a string constant cannot be pushed down unchanged. A xe2x80x9cpredicatexe2x80x9d is a qualifier or condition for a search.
A collating sequence is essentially the computer equivalent of alphabetical order. One typical collating sequence is the order of characters based on their binary values in ASCII (American Standard Code for Information Interchange) and another typical collating sequence is the order of characters based on their binary values in EBCDIC (Extended Binary-Coded Decimal (BCD) Interchange Code). Although these two sequences have similar characters or collating elements, their collating sequences (sort orders) are significantly different, as shown in FIG. 1. For example, both collating sequences contain all the capital and lower case letters of the (English) alphabet, but in the ASCII capital letters all sort before lower case letters but in EBCDIC this is reversed. Also, in ASCII digits sort before letters but not in EBCDIC.
A collating sequence is separate from a character (code) set: two countries or languages may use the same character set but have different collating sequences. A list of different language identifiers having similar character sets but different collating sequences is shown in Table 1. The purpose of a collating sequence is to define a total order on character strings. This is done by means of weights assigned to collating elements, which most often correspond to a single character but may consist of a sequence of characters. For example, in Spanish, the sequence xe2x80x9c11xe2x80x9d is treated as a single collating element. In the simplest case, each character is assigned a unique weight. To determine the ordering of two strings, the weights of the characters are compared proceeding from left to right. If the two strings compare equal on the first level weights, they are compared on second level weights (if any), then on third level weights, and so on.
Many database systems, including SQL Server, support queries over heterogeneous data sources. If the system detects that some tables referenced in a query are managed by external data sources, the system decomposes the query into a set of single-source queries, submits them, and performs any additional processing needed to integrate the data returned from the local queries. External data sources differ greatly in their query capabilities and interfaces, ranging from simple one-table sources with no query capability to full-fledged SQL database systems. Many aspects of heterogeneity in multidatabase systems, such as data model differences, naming differences, format differences, structure differences, and conflicting data, have been studied and are described in Bright, M. W. et al., xe2x80x9cA taxonomy and current issues in multidatabase systemsxe2x80x9d, IEEE Computer, March 1992, pages 50-60.
Many data sources have selection (i.e., filtering) and projection capability. A query optimizer decomposes the query and generates an overall, efficient plan. Often this means delegating as much as possible of the processing to the external data sources, taking into account their processing capabilities. In particular, pushing down selections and projections to an external data source is usually desirable because it reduces the amount of data shipped over the network. However, if the external data source uses a different collating sequence, predicates- involving string comparisons cannot be pushed down unchanged. In practice, systems do not push down such predicates, thereby resulting in a loss of efficiency.
For example, consider a SQL Server running on a Windows NT system that uses the (binary) ASCII collating sequence (WINDOWS NT(copyright) is a registered trademark of Microsoft Corp.). Among other things, it provides access to a course table that is stored in a DB2 database running on a mainframe system that uses the (binary) EBCDIC collating sequence. (DB2 is a trademark of International Business Machines Corp.) Suppose the SQL Server receives the query xe2x80x9cselect CourseNo from course where CourseNo is between xe2x80x98CH020xe2x80x99 and xe2x80x98CS499xe2x80x99xe2x80x9d, and the table has 1000 records and the query will select 100 of the records. If the selection predicate can be pushed down and evaluated at the source, the data communication cost would be reduced by 90% because only the data satisfying the selection predicate would need to be transferred to SQL server.
However, the two systems use different collating sequences, and so the query cannot just be pushed unchanged. Doing so would produce the wrong result. As shown in FIG. 1, xe2x80x98Exe2x80x99 collates after xe2x80x980xe2x80x99 in the (binary) ASCII collating sequence but before xe2x80x980xe2x80x99 in the (binary) EBCDIC collating sequence. Thus, for example, courses with prefix xe2x80x98CHExe2x80x99 will be selected if the query is evaluated by SQL Server, but not if it is evaluated by DB2. Typical systems handle queries of this type by retrieving all course rows and evaluating the predicate at the query site. There is a need for a mechanism to convert comparisons between a string valued column (field) and a constant to correctly account for differences in collating sequences.
Many atomic SQL predicates are independent of the collating sequence; this is true for operators such as IS NULL and IS NOT NULL. Similarly, comparisons of numeric data types (integer, float, decimal) and dates are independent of the collating sequence. BETWEEN and NOT BETWEEN are simply shorthand notation for two comparisons. IN and NOT IN followed by a set of string constants can be transformed into a set of predicates with equals (=) and not equal ( less than  greater than ) operators. Therefore, the predicates requiring conversion are mainly string comparisons using one of the operators  less than , xe2x89xa6,  greater than , and xe2x89xa7. Predicates with operators=and  less than  greater than may require conversion when some characters have the same weight in the collating sequence.
Although the art of database query processing is well developed, there remain some problems inherent in this technology, particularly sending a query between databases having different collating sequences. Therefore, a need exists for compensating for differences in collating sequences using predicate conversion that overcomes the drawbacks of the prior art.
The present invention is directed to a method for use in a heterogeneous database environment including a first data source using a first collating sequence (I1) and a second data source using a second collating sequence (I2). The method comprises the steps of: receiving a query including a source predicate (e1) for the first collating sequence I1; converting the source predicate e1 to a target predicate (e2) for the second collating sequence I2; and submitting the target predicate e2 to the second data source.
According to one aspect of the present invention, the source predicate e1 is converted to the target predicate e2 in accordance with a conversion table.
According to another aspect of the present invention, the method further comprises the step of simplifying the target predicate e2 prior to submitting the converted query to the second data source. The step of simplifying comprises at least one of the steps of combining range values, exploiting domain constraints, and reducing to prime implicants. Combining range values comprises the steps of obtaining a set of intervals defined in the target predicate e2, and combining the set of intervals into at least one larger interval by taking the union of some of the intervals. Exploiting domain constraints comprises the steps of: detecting at least one external constraint at the second data source; and modifying the target predicate e2 taking into account this constraint. Reducing to prime implicants comprises the steps of: converting the target predicate e2 to conjunctive normal form; eliminating redundant atomic predicates from the target predicate e2; converting the target predicate e2 to disjunctive normal form; deleting contradictory terms and empty ranges from the target predicate e2; and converting the target predicate e2 to conjunctive normal form.
According to another aspect of the present invention, each of the first and second collating sequences comprises either ASCII or EBCDIC. (This is just an example, not a condition for predicate conversion. The source and target collating sequences are not limited to ASCII and EBCDIC.)
According to a further aspect of the present invention, the step of converting comprises a step of imprecise conversion of the source predicate e1. According to one method of imprecise conversion, the source predicate e1 comprises a plurality of predicates, and the step of imprecise converting comprises discarding at least one of the plurality of predicates. According to another method of imprecise converting, the source predicate e1 comprises a string of characters, and the step of imprecise conversion comprises truncating the string of characters. According to another method of imprecise conversion, the source predicate e1 comprises a plurality of ranges, and the step of imprecise conversion comprises merging at least two of the ranges into a larger range covering the two input ranges.
Another embodiment within the scope of this invention includes a conversion table for use in the method described above. Preferably, the conversion table comprises a plurality of rules for converting characters from the first collating sequence I1 operated on by one of a plurality of operators to the second collating sequence I2, such that the query comprising the source predicate e1 can be converted to the target predicate e2 by applying the rules. The operators comprise less than ( less than ), less than or equal to (xe2x89xa6), greater than ( greater than ), and greater than or equal to (xe2x89xa7). Each of the rules is either precise or imprecise.
Another embodiment within the scope of this invention includes a method for generating a conversion rule for a conversion table between a first collating sequence (I1) having a first plurality of characters, each having a respective weight, and a second collating sequence (I2) having a second plurality of characters, each having a respective weight. The method comprises the steps of: for a subject character operated on by an operator, the subject character being one of the first plurality of characters and the operator being one of less than ( less than ), less than or equal to (xe2x89xa6), greater than ( greater than ), and greater than or equal to (xe2x89xa7), determining which ones of the first plurality of characters in the first collating sequence have a weight that satisfies the operator operating on the subject character; detecting the characters in the second collating sequence that correspond to the first plurality of characters that have been determined to have a weight that satisfies the operator operating on the subject character; and determining at least one interval of the detected characters, where the conversion rule is responsive to the at least one interval.
According to one aspect of the present invention, the method further comprises the step of OR""ing each interval of the at least one interval together to form the conversion rule.
According to another aspect of the present invention, each respective weight in the first collating sequence is different and each respective weight in the second collating sequence is different.
According to another aspect of the present invention, each respective weight consists of only one level weight.
According to further aspects of the present invention, at least two respective weights in one of the first collating sequence and the second collating sequence are identical, and the method further comprises the steps of determining which of the characters in the first collating sequence have the same weight as the subject character; and prior to determining the at least one interval, detecting the characters in the second collating sequence that correspond to the first plurality of characters that have been determined to have the same weight as the subject character.
According to further aspects of the present invention, each respective weight comprises at least a primary weight and a secondary weight, and the method further comprises the steps of: detecting each character in the second collating sequence that has the same primary weight as the character that corresponds to the subject character; determining sequences of adjacent detected characters at the second collating sequence; determining which of the detected characters has a lowest primary weight; determining which of the detected characters has a highest primary weight; determining at least one interval of the characters of the second collating sequence having the lowest primary weight character and the highest primary weight character as a lowest and a highest endpoint, respectively; determining if there is more than one character in each the interval that has the lowest primary weight, and if so, changing the lowest endpoint to a character having a next lowest primary weight; and determining if there is more than one character in each the interval that has the highest primary weight, and if so, changing the highest endpoint to a character having a next highest primary weight.
Another embodiment within the scope of this invention includes a heterogeneous database system, comprising: a first data source using a first collating sequence (I1); a second data source using a second collating sequence (I2); a conversion table generator; a conversion table generated by the conversion table generator; and a predicate converter for receiving a query including a source predicate (e1) for I1 and, using the conversion table, converting e1 to a target predicate (e2) for I2.
According to one aspect of the present invention, the database system further comprises a predicate simplifier for simplifying the target predicate e2.
According to another aspect of the present invention, the first collating sequence I1 and the second collating sequence I2 each comprise a plurality of characters, each character having a weight, at least two of the characters in at least one of the first and second collating sequences having identical weights.
According to a further aspect of the present invention, the first collating sequence I1 and the second collating sequence I2 each comprise a plurality of characters, each character having multiple weights.
The foregoing and other aspects of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.