1. Technical Field
The present invention relates in general to distributed database searches and in particular to reliably searching a heterogeneous distributed data base in which a diverse set of character mappings are employed. Still more particularly, the present invention relates to a data structure and case and character set insensitive search method and apparatus for reliably searching a distributed database which spans multiple system character encoding schemes for underlying data.
2. Description of the Related Art
Databases are employed by enterprises as key repositories of information. To a large extent, the value of the data stored within a database is determined by the reliability of accessing the stored information upon demand. In large databases with many persons entering the data, the integrity of the data may become compromised by differences in data entry techniques. Entries may be made, for instance, in various combinations of casings.
In distributed databases, particularly those which range across a variety of operating systems utilizing different character encoding schemes such the American National Standard Code for Information Interchange (ASCII) or Unicode on Windows NT or PC and Unix servers/workstations versus the Extended Binary-Coded Decimal Interchange Code (EBCDIC) on IBM mainframes, data entry and character encoding variances pose a problem for matching search keys. Traditional matching methods for searching are based on exact matches, such that possible matches are missed, particularly for operating systems or character sets which do not support case mapping or where a discrepancy in character encoding exists.
For instance, an operator searching for records within a database while on the telephone with a customer may enter xe2x80x9cdavid kumhyrxe2x80x9d in the name field(s) of the database search engine""s user interface. The search application may search the database on a remote Unix-based system and find no match to the original name data if the data was originally entered as xe2x80x9cDavid Kumhyrxe2x80x9d. A similar result may occur even if there is no variance in the entered text between entries due to data not matching because the character encoding schemes differ. The hex encoded value for the first character (xe2x80x9cDxe2x80x9d) in the search key xe2x80x9cDavid Kumhyrxe2x80x9d is C4 in EBCDIC and 44 in ASCII (0044 in Unicode).
Fuzzy search algorithms applied to the problem described above tend to generate large quantities of non-matching data which must be further qualified to determine a match. A case and character set insensitive search method is therefore required. Often there is a need to create a search string which is a case insensitive equivalent to the base text. However, casing is a locale and language dependent operation, which is further dependent on the character encoding scheme employed, such as EBCDIC.
It would be desirable, therefore, to provide a case and character set insensitive search method and apparatus. It would further be advantageous if the search method could be transparently employed across systems utilizing incongruent character set encodings.
It is therefore one object of the present invention to provide an improved method, system and computer program product for distributed database searches.
It is another object of the present invention to provide an improved method, system and computer program product for reliably searching a heterogeneous distributed data base in which a diverse set of character mappings are employed.
It is yet another object of the present invention to provide a method, system and computer program product for is reliably case and character set insensitive searching of distributed databases which span multiple system character encodings for underlying data.
The foregoing objects are achieved as is now described. A search string for searching data distributed among various hosts employing different character encoding schemes or having different case-mapping capabilities is entered in a multi-field text string class. The multi-field text string class includes methods for transliterating characters within the original search string based on defined character equivalence tables. When the search string is received at a data host, a comparison is made of the operating system run on the originating data processing system, identified in a sourceVariant field of the multi-field text string class, and the operating system run on the data host, identified in a targetVariant field of the multi-field text string class. If necessary, an appropriate character equivalence table is selected and a variant of the search string is generated by transliteration. The search string variant is then passed to a search engine to search local data, either with or without the original search string, and matches identified are returned as matches for the original search string. Accurate search results are therefore produced despite the presence of different character encoding schemes, such as EBCDIC versus ASCII/Unicode, or operating systems which do not support case mapping.
The above as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.