The invention relates to computer database systems and more specifically to distributed computer database systems.
The basis for communication whether it is between people or computer systems is a shared background that allows them to understand each other. This involves sharing both of the following: (1) a language for communication; and (2) a domain conceptualization that defines the shared vocabulary along with relationships that may hold between the concepts denoted by the terms in the vocabulary.
The problem of translation between different languages is important, and many computer systems have been developed for this purpose. Translation between different domain conceptualizations is also important. Translation between domain conceptualizations is called mediation. Domain conceptualizations are also called ontologies. For example, the vocabulary of Americans differs from that of the British even though they share a common language. In the UK, one would say xe2x80x9cliftxe2x80x9d for what is called an xe2x80x9celevatorxe2x80x9d in the US. Mediation would be required in order to understand what is being meant by these terms.
For a more complex example, the domain of medicine has a large vocabulary of terms for chemicals, genes, laboratory procedures, diseases, etc. Within medicine there are many subdomains that use different terminology for the same concept. Terminology can also vary from one company to another, and even small groups within a single company can have their own specialized vocabulary. Some will use the term xe2x80x9cMunchausen Syndromexe2x80x9d while others prefer xe2x80x9cChronic factitious illness with physical symptomsxe2x80x9d. Some might even prefer to expand the term xe2x80x9cfactitious illnessxe2x80x9d to xe2x80x9cintentional production or feigning of symptoms or disabilities, either physical or psychologicalxe2x80x9d to make it understandable to someone with minimal medical background.
The problem of mediation between domain conceptualizations is especially difficult for computer systems because they generally have no mechanism for dealing with miscommunication as a result of misunderstood terminology. For example, conventional search engines simply match words in a query with words in documents. Some search engines consider the possibility of synonymous words, but the fact that the words might belong to different domains is not considered.
For example, suppose that one wishes to find occurrences of xe2x80x9cJobxe2x80x9d in the Bible. Job is one of the persons mentioned in the Bible, and one of the books in the Bible is named after him. However, modern search engines do not generally understand this, and they will make errors such as matching xe2x80x9cJobxe2x80x9d with xe2x80x9cworkxe2x80x9d because they regard these two words as synonymous.
Current search engines support only a very limited ontology with just a few concepts. Moreover, the ontology is inflexibly built into the search engine and only one ontology is supported. In general, indexes of current database systems are thus limited to a single ontology.
A collection of documents, data or other kinds of information objects will be called an object database. Information objects can be images, sound and video streams, as well as data objects such as text files and structured documents. Each information object is identified uniquely by an object identifier (OID). An OID can be an Internet Universal Resource Locator (URL) or some other form of identifier such as a local object identifier.
To assist in finding information in an object database, special search structures are employed called indexes. Current technology generally requires a separate index for each attribute or feature. Even the most sophisticated indexes currently available are limited to a very small number of attributes. Since each index can be as large as the database itself, this technology does not function well when there are hundreds or thousands of attributes, as is often the case when objects such as images, sound and video streams are directly indexed. Furthermore, there is considerable overhead associated with maintaining each index structure. This limits the number of attributes that can be indexed. Current systems are unable to scale up to support databases for which there are: many object types, including images, sound and video streams; millions of features; queries that involve many object types and features simultaneously; and new object types and features being continually added.
Further information can be had regarding some of the concepts discussed herein with reference to the following publications:
1 L. Aiello, J. Doyle, and S. Shapiro, editors. Proc. Fifth Intern. Conf. on Principles of Knowledge Representation and Reasoning. Morgan Kaufman Publishers, San Mateo, Calif., 1996.
2 K. Baclawski. Distributed computer database system and method, December 1997. U.S. Pat. No. 5,694,593. Assigned to Northeastern University, Boston, Mass.
3 K. Baclawski and D. Simovici. An abstract model for semantically rich information retrieval. Technical report, Northestern University, Boston, Mass., March 1994.
4 A. Campbell and S. Shapiro. Algorithms for ontological mediation. Technical report, State University of New York at Buffalo, Buffalo, N.Y., 1998.
5 A. Del Bimbo, editor. The Ninth International Conference on Image Analysis and Processing, volume 1311. Springer, September 1997.
6 N. Fridman Noy. Knowledge Representation for Intelligent Information Retrieval in Experimental Sciences. PhD thesis, College of Computer Science, Northeastern University, Boston, Mass., 1997.
7 R. Jain. Content-centric computing in visual systems. In The Ninth International Conference on Image Analysis and Processing, Volume II, pages 1-13, September 1997.
8 Y. Ohta. Knowledge-Based Interpretation of Outdoor Natural Color Scenes. Pitman, Boston, Mass., 1985.
9 G. Salton. Automatic Text Processing. Addison-Wesley, Reading, Mass., 1989.
10 G. Salton, J. Allen, and C. Buckley. Automatic structuring and retrieval of large text files. Comm. ACM, 37(2):9-108, February 1994.
11 A. Tversky. Features of similarity. Psychological review, 84(4):327-352, July 1977.
The disclosures of the publications referenced in this xe2x80x9cBackground of the Inventionxe2x80x9d are incorporated herein by reference.
It would be desirable to provide an information retrieval system that can retrieve information from a database, including documents, images and other forms of multimedia, talking into account ontologies and using a single indexing system, and otherwise overcome many disadvantages and limitations of current systems.
The invention resides in performing, preferably in parallel over a distributed network of computer nodes, ontology mediation and information retrieval in response to a user query in order to retrieve information objects conforming to target ontologies specified in the query.
Briefly, the invention can provide an information retrieval system for processing a query for word based and non-word based retrieval of information from a database by extracting a number of features from the query according to its ontology, fragmenting each of the features into feature fragments, and hashing each of the feature fragments into hashed feature fragments. The hashed feature fragments can be used in accessing a hash table for obtaining object identifiers therefrom that can be used for obtaining information from the database relevant to the query and to its target ontologies.
In another aspect, the invention resides in an information indexing system for indexing information for facilitated retrieval from a database, by extracting a number of features from the information, fragmenting each of the features into feature fragments, and hashing each of the feature fragments into hashed feature fragments. The hashed feature fragments are used in accessing a hash table for storing object identifiers at locations determined by the hashed feature fragments and the ontology identifiers. The information retrieval apparatus can be implemented in a distributed computer database system.
In general, the term xe2x80x9cfeaturexe2x80x9d as used herein means any information or knowledge associated with an information object or derived from the content of the information object, regardless of whether the information object represents a document, image or other multimedia, which has meaning within the applicable domain and conforms to the applicable ontology. Thus, for example, where the information object represents a photographic image of a human face, e.g., for entry in a photography contest, the features of the image include the eyes, nose and mouth because they can be perceived when the image is viewed by the judges. When the same image is used for skin disease diagnosis, the domain and ontology shift, and the features can include even blemishes that are not noticeable with the unaided eye.
More specifically, the distributed computer database system in accordance with an aspect of the invention can include one or more front end computers and one or more computer nodes interconnected by a network into a search engine for retrieval of objects processed by a variety of interrelated ontologies. Each object conforms to a specific ontology. A query is an object that conforms to a specific ontology, which is to be used for retrieval of objects conforming to one or more target ontologies. A query includes the ontology to be used for processing the query and the target ontologies of the objects to be retrieved. A query from a user is transmitted to one of the front end computers which forwards the query to one of the computer nodes, termed the home node, of the search engine. The home node extracts features from the query, according to its ontology. These features are then fragmented and the feature fragments hashed. Each hashed feature fragment and the list of target ontologies is transmitted to one node on the network. For example, a first portion of the hashed feature fragment can be used as an address index to identify the one node. Each node on the network that receives a hashed feature uses the hashed feature fragment of the query to perform a search on its respective partition of the database. For example, a second portion of the hashed feature fragment can be used as an index into the node""s local database. The results of the searches of the local databases include the object identifiers (OIDs) of objects that match the query and the ontologies within which they were processed, as well as equivalent hashed features within other ontologies. These other hashed feature fragments are forwarded, as needed, to their respective nodes, and this search process continues on those nodes and is repeated until the desired target ontologies are reached. When the target ontologies are reached, the results of the searches of the local databases are gathered by the home node. The results of the query are then computed for each target ontology. The computation performed can include a similarity function based on the features that are in common with the query as well as the features that are in the query but not in the object. The similarity function is used to rank the objects. The OIDs of the objects that have the largest similarity value are transmitted to the front end node.
The return of the ranked OIDs as just described constitutes a basic level of service, called level 1. If requested, higher levels of service may be provided. For level 2 or level 3 service, the OIDs obtained in the basic service above are transmitted to the nodes on the network by using a portion of each OID as an addressing index. In addition, if level 3 service is requested, the features each object has in common with the query are transmitted along with the OIDs to the same nodes on the network. Each node on the network which receives an OID uses the OID to perform a search on its respective database for the corresponding object information. In level 2 service, auxiliary information is retrieved and transmitted to the front end node. The auxiliary information can include, e.g., the URL of the object or an object summary or both. For level 3 service, a dissimilarity value is computed based on the features that the object possesses but the query does not. The dissimilarity value as well as the auxiliary information about the object are transmitted to the home node. The dissimilarity values are gathered by the home node which uses them to modify the similarity values of the objects obtained in the first level of processing. The modified similarity values are used to rank the objects. The OIDs and any auxiliary information about the objects that have the largest similarity value are transmitted to the front end node. Regardless of the level of service requested, the front end node formats the response to the user based on the OIDs and any auxiliary information transmitted by the home node. For example, if the front end node is a World Wide Web server, then the front end node constructs a page in HTML format containing a reference to a URL and auxiliary information for each object. The front end transmits the formatted response to the user.
Accordingly, the invention can provide an information retrieval system that can retrieve information from a database, including documents, images and other forms of multimedia, taking into account ontologies and using a single indexing system, and otherwise overcome many disadvantages and limitations of current systems. The invention can also provide an information indexing system coordinated with the retrieval system for facilitated retrieval of the information. Such information indexing and retrieval systems can be based on a distributed model and, consequently, highly scalable, versatile, robust and economical.