For many years, businesses have used computers to manage information such as numbers and text, primarily in the form of coded data. However, business data represents only a small portion of the world's information. As storage, communication, and information processing technologies advance and the cost of these technologies decrease, it becomes more feasible to digitize and store large volumes of other various types of data. Once digitized and stored, the data is required to be available for distribution on demand to users at their place of business, home, or other locations.
New digitization techniques have emerged in the last decade to digitize images, audio, and video, giving rise to a new type of digital information. These digital objects are significantly different from the business data that computers managed in the past, often requiring more advanced information system infrastructures with new capabilities, such as “digital libraries” or content management systems.
New digital technologies can do much more than replace physical objects with their electronic representations. These technologies enable instant access to information; support fast, accurate, and powerful search mechanisms; provide new “experiential” (i.e., virtual reality) user interfaces; and implement new ways of protecting the rights of information owners. These properties make digital library solutions attractive and acceptable to corporate information service organizations as well as to the information owners, publishers, and service providers.
Generally, business data is created by a business process, such as an airline ticket reservation, a deposit at a bank, or a claim processing at an insurance company. Most of these processes have been automated by computers and produce business data in digital form such as text and numbers, i.e., structured coded data. In contrast, the use of digital data is not fully predictable. Consequently, digital data cannot be fully pre-structured because it is the creative result of a human being or it is the digitization of an object of the real world such as, for example, x-rays or geophysical mapping, rather than a computer algorithm. While the present invention is described for example purpose in terms of digital data, it should be clear that the present invention is not limited to digital data.
The average size of business data in digital form is relatively small. A banking record that comprises a customer's name, address, phone number, account number, balance, etc., and may represent only a few hundred characters and a few hundreds or thousands of bits. The digitization of digital information such as image, audio, or video produces a large set of bits called an “object” or binary large objects (“blobs”). For example, a digitized image may take as much as 30 MB of storage. The digitization of a movie, even after compression, may take as much as 3 GB to 4 GB of storage.
Digital information is typically stored as much larger objects, ever increasing in quantity and therefore requiring special storage mechanisms. Conventional business computer systems have not been designed to directly store such large objects. Specialized storage technologies may be required for certain types of information such as media streamers for video or music. Because certain digital information needs to be preserved or archived, special storage management functions are required for providing automated backup and migration to new storage technologies as they become available and as old technologies become obsolete.
For performance reasons, digital data is often placed in the proximity of the users with the system supporting multiple distributed object servers. Consequently, a logical separation between applications, indices, and data is required to ensure independence from any changes in the location of the data.
The indexing of business data is often embedded into the data itself. When the automated business process stores a person's name in the column “NAME”, it actually indexes that information. Digital information objects usually do not contain indexing information. Developers or librarians typically create this “meta data” or “metadata”. The indexing information for information is typically kept in standard business-like databases separated from the physical object.
In a digital library or a content management system, the digital object can be linked with the associated indexing information since both are available in digital form. Integration of this legacy catalog information with the digitized object is one of the advantages of content management or digital library technology. Different types of objects can be categorized differently as appropriate for each object type. Existing standards such as, for example, MARC records for libraries or Finding Aids for archiving of special collections can be used when appropriate.
The indexing information used for catalog searches in physical libraries is typically the name of the book, author, title, publisher, ISBN, etc., enriched by other information created by librarians. This other information may comprise abstracts, subjects, keywords, etc. In contrast, digital libraries may contain the entire content of books, images, music, films, etc.
Technologies are desired for full text searching, image content searching (searching based on color, texture, shape, etc.), video content searching, and audio content searching. A specialized search engine usually conducts each type of search. The integrated combination of catalog searches, for example, using SQL in conjunction with content searches provides powerful search and access functions. These technologies can also be used to partially automate further indexing, classification, and abstracting of objects based on content. The term multi-search refers to searches employing more than one search engine, for example text and image search.
To harness the massive amounts of information spread throughout these many networks of varying types of content, a user desires to be able to simultaneously search numerous storage facilities without considering the particular implementation of each storage facility. In this context, the term datastore is used to refer to a generic data storage facility, whereas heterogeneous is used to indicate that the datastores need not be of the same type. A federated datastore is composed as an aggregation of several heterogeneous datastores configured dynamically by the application user.
Currently, the ability to search across many different types of datastores in many different geographical locations is achieved by the use of a federated datastore system, which provides mechanisms for conducting a federated multi-search and update across heterogeneous datastores. For example, each datastore may represent a company or division of a company. A division manager requires access to his or her local datastore but not to the datastores of other division managers. Conversely, a corporate officer may require access to the datastores of all the divisions, located, for example, in New York, San Francisco, London, and Hong Kong. A federated system is capable of searching all the databases, combining and aggregating the data into one report, and presenting the report to the corporate officer.
In a transparent, heterogeneous information integration environment such as a federated datastore system, query capability and semantics vary in each of the remote data sources. In such an environment with diverse remote data sources, conventional federated query compilers analyze query elements of a query statement in every user input query according to the capability and semantics of the remote data sources. The conventional federated query compilers determine which query elements can be evaluated remotely. If an element in the query is supported by the remote data source and also provides the same semantics in the remote data source as it does in the federated server, then the federated query compiler sends the query element to the remote data source through one or more remote queries. Such a query is capable of “pushdown” to the remote data source and is described as “pushdownable”. The result set of the remote query is returned to the federated server for any further local processing. Query results are then returned to the user.
Although this approach to managing queries in a heterogeneous environment has proven to be useful, it would be desirable to present additional improvements. The method of conventional query compilers has improved query performance by sending part of the original user input SQL to the remote data source for evaluation. However, a query element sometimes cannot be included in a remote query (i.e., the query statement element is not “pushdownable”) due to different capabilities or semantics in the remote data sources. In such a situation, the remote data source returns unfiltered data to the federated server. Consequently, the performance of such a query is poor because of the communication overhead required to transfer those non-qualifying rows from the remote data source to the federated server. Such communication overhead can be quite large when the size of the qualified data is small compared to the size of data returned without filtering from the remote data source.
What is therefore needed is a system, a computer program product, and an associated method for performing an inexact query transformation in a heterogeneous environment. The need for such a solution has heretofore remained unsatisfied.