1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular, to providing an architecture to enable search gateways as part of a federated search.
2. Description of Related Art
The present invention relates to a system and method for representing and searching multiple heterogeneous datastores and managing the results of such searches. Datastore is a term used to refer to a generic data storage facility, such as a relational data base, flat-file, hierarchical data base, etc. Heterogeneous is a term used to indicate that the datastores need not be similar to each other. For example, each datastore may store different types of data, such as image or text, or each datastore may be based on a different theory of data model, such as Digital Library/VisualInfo or Domino Extended Search (DES).
For nearly half a century computers have been used by businesses to manage information such as numbers and text, mainly in the form of coded data. However, business data represents only a small part of the world's information. As storage, communication and information processing technologies advance, and as their costs come down, it becomes more feasible to digitize other various types of data, store large volumes of it, and be able to distribute it on demand to users at their place of business or home.
New digitization technologies have emerged in the last decade to digitize images, audio, and video, giving birth to a new type of digital multimedia information. These multimedia objects are quite different from the business data that computers managed in the past, and often require more advanced information management system infrastructures with new capabilities. Such systems are often called “digital libraries.”
Bringing new digital technologies can do much more than just replace physical objects with their electronic representation. It enables instant access to information; supports fast, accurate, and powerful search mechanisms; provides, new “experiential” (i.e. virtual reality) user interfaces; and implements new ways of protecting the rights of information owners. These properties make digital library solutions even more attractive and acceptable not only to corporate IS organizations, but to the information owners, publishers and service providers.
Generally, business data is created by a business process (an airline ticket reservation, a deposit at the bank, and a claim processing at an insurance company are examples). Most of these processes have been automated by computers and produce business data in digital form (text and numbers). Therefore it is usually structured coded data. Multimedia data, on the contrary, cannot be fully pre-structured (its use is not fully predictable) because it is the result of the creation of a human being or the digitization of an object of the real world (x-rays, geophysical mapping, etc.) rather than a computer algorithm.
The average size of business data in digital form is relatively small. A banking record—including a customers name, address, phone number, account number, balance, etc.—represents at most a few hundred characters, i.e. few hundreds/thousands of bits. The digitization of multimedia information (image, audio, video) produces a large set of bits called an “object” or “blobs” (Binary Large Objects). For example, a digitized image of the parchments from the Vatican Library takes as much as the equivalent of 30 million characters (30 MB) to be stored. The digitization of a movie, even after compression, may take as much as the equivalent of several billions of characters (3-4 GB) to be stored.
Multimedia information is typically stored as much larger objects, ever increasing in quantity and therefore requiring special storage mechanisms. Classical business computer systems have not been designed to directly store such large objects. Specialized storage technologies may be required for certain types of information, e.g. media streamers for video or music. Because certain multimedia information needs to be preserved “forever” it also requires special storage management functions providing automated back-up and migration to new storage technologies as they become available and as old technologies become obsolete.
Finally, for performance reasons, the multimedia data is often placed in the proximity of the users with the system supporting multiple distributed object servers. This often requires a logical separation between applications, indices, and data to ensure independence from any changes in the location of the data.
The indexing of business data is often imbedded into the data itself. When the automated business process stores a person's name in the column “NAME,” it actually indexes that information. Multimedia information objects usually do not contain indexing information. This “meta data” needs to be created in addition by developers or librarians. The indexing information for multimedia information is often kept in “business like” databases separated from the physical object.
In a Digital Library (DL), the multimedia object can be linked with the associated indexing information, since both are available in digital form. Integration of this legacy catalog information with the digitized object is crucial and is one of the great advantages of DL technology. Different types of objects can be categorized differently as appropriate for each object type. Existing standards like MARC records for libraries, Finding Aids for archiving of special collections, etc. . . can be used when appropriate.
The indexing information used for catalog searches in physical libraries is mostly what one can read on the covers of the books: authors name, title, publisher, ISBN, . . . enriched by other information created by librarians based on the content of the books (abstracts, subjects, keywords, . . . ). In digital libraries, the entire content of books, images, music, films, etc. . . are available and “new content” technologies are needed; technologies for full text searching, image content searching (searching based on color, texture, shape, etc. . . . ), video content searching, and audio content searching. The integrated combination of catalog searches (e.g. SQL) with content searches will provide more powerful search and access functions. These technologies can also be used to partially automate further indexing, classification, and abstracting of objects based on content.
To harness the massive amounts of information spread throughout these networks, it has become necessary for a user to search numerous storage facilities at the same time without having to consider the particular implementation of each storage facility.
Object-oriented approaches are generally better suited for such complex data management. The term “object-oriented” refers to a software design method which uses “classes” and “objects” to model abstract or real objects. An “object” is the main building block of object-oriented programming, and is a programming unit which has both data and functionality (i.e., “methods”). A “class” defines the implementation of a particular kind of object, the variables and methods it uses, and the parent class it belongs to.
Some known programming tools that can be used for developing search and result-management frameworks include IBM VisualAge C++, Microsoft Visual C++, Microsoft Visual J++, and Java.
There is a need in the art for an improved federated system. In particular, there is a need in the art for an architecture to enable search gateways as part of a federated search.