This invention relates to data retrieval and learning systems.
In the context of distributed information systems (e.g. the Internet), there is a need to provide end users with a centralized access and search service to information residing in multiple heterogeneous on-line catalogs. These on-line catalogs should be viewed by the users as if they were using the very same access method, information classification and nomenclature. This concept is called xe2x80x9cinformation integrationxe2x80x9d and is the subject of several research and development efforts. Among them are:
Stanford University Knowledge Systems Laboratory (KSL) Ontology Server Projects.
Microelectronics and Computer Technology Corporation (MCC)xe2x80x94InfoSleuth Project (MCC, Austin, Tex.).
The main problems associated with information integration include dealing with the different conceptualization systems and selecting resources.
Dealing with different conceptualization systems includes providing access to relevant information that is accessible through different classification methods and described using non-identical nomenclatures. This requires bridging the gap between the different conceptualization systemsxe2x80x94the one used by the user to describe his query and those used by each of different information resources. These conceptualization differences range from different classification methods to different nomenclature. For example, consider a user searching for xe2x80x9cRS232 Cable for Printerxe2x80x9d which is listed in one on-line catalog under the name xe2x80x9cRS232 cablexe2x80x9d in the sub-section called xe2x80x9cAccessoriesxe2x80x9d in the super-section called xe2x80x9cPrintersxe2x80x9d and in another on-line-catalog under the name xe2x80x9cPrinter cablexe2x80x9d in the section xe2x80x9cHardware accessories.xe2x80x9d This is a very tough task, since it involves the formalization of xe2x80x9cknowledge.xe2x80x9d
Dealing with resource selection includes deciding which one of the available information resources is relevant for a specific information request. For example, there is no point in accessing resources providing information about restaurants when the user is looking for an automobile. In the domain level, this is an easy task. However, in larger arrays of information resources from similar domains, the problem becomes harder.
The research projects listed above deal with different aspects of these problems and make different assumptions about the environment. However, prior to the present invention, there have been no general-purpose information integration systems. There are two main reasons for this:
1. There are no automatic mechanisms to xe2x80x9cconnectxe2x80x9d to new information resources. Current solutions to the task of connecting to information resources are based on the assumption that xe2x80x9csomeonexe2x80x9dxe2x80x94either the information requester or the information providerxe2x80x94provides an information source xe2x80x9cwrapperxe2x80x9d that enables xe2x80x9csmoothxe2x80x9d integration to the data.
2. There was no way to automatically create a large-scale conceptualization system. A current solution to the problem of creating a common unified conceptualization system is a manual solution provided by the Knowledge System Laboratory (KSL) at Stanford University. The KSL staff has developed a set of tools and services to support the process of manually building and achieving consensus on a common shared conceptualization system (termed xe2x80x9cOntologyxe2x80x9d).
It is only natural, then, that the lack of a real world conceptualization system adversely affects both the quality of the information being retrievedxe2x80x94recall and precisionxe2x80x94and the quality of the user-computer interaction. That is, real world information integration requires the automatic acquisition of a conceptual knowledge base, i.e., a conceptualization system.
In recent years, the task of automatic knowledge acquisition was usually approached by corpus-based NLP. Free text documents were used as a source for learning different relations between words, e.g., by contextual similarity.
The emergence of a global standard computer network, and more specifically, the Internet, has led to the proliferation of classified on-line catalogs. This enables use of information navigation systems. One of the innovations of the present invention is the usage of the knowledge embedded in these very navigation systems as a new source for the knowledge acquisition task in order to generate a so called unified classification information graph. Information navigation systems, by their nature, imply hierarchy relations between categories, hence they provide more precise category-relations information then free text does. The categories and the hierarchy relations between categories is utilized in the process of generating the unified classification information graph.
The present invention offers a solution to overcome the difficulties in the usage of multiple resources so as to generate the desired unified classification information graph. For example, the same piece of information may be expressed in different word order or levels of abstraction.
Since on-line catalogs are by nature subject to frequent (and occasionally also major) changesxe2x80x94e.g., new products/categories are added and/or others are deletedxe2x80x94it is important to assure that all or at least most of the modifications that occur in the on-line catalogs will be reflected in the resulting unified classification information graph. Accordingly, one of the important advantages of the system is the dynamic nature thereof, i.e., the ability to dynamically scan the multiple information resources and update, whenever required, the resulting unified information graph.
Thus the invention fulfills a long felt need by providing a system and method for obtaining and integrating multiple classification information resources using a single unified access interface.
One aspect of the invention provides for a method for dynamically obtaining a unified classification information graph that provides a navigation system for a user to access sought information. The method includes providing multiple information resources that include a respective hierarchy of categories each of which is associated with a category; leaf categories in said hierarchy being connected to information pages. The method also includes generating a unified classification information graph utilizing at least the hierarchy of categories and the categories of said multiple information resources; said unified classification graph includes a hierarchy of unified categories; leaf unified categories in said hierarchy being connected to information pages. Information pages accessible through the hierarchy of said multiple information resources are also accessible through the hierarchy of said unified classification information graph.
In one embodiment, the providing multiple information resources includes providing at least some of the multiple information resources that are located in sites of the Internet.
In another embodiment, the providing multiple information resources includes providing at least some of the multiple information resources that are located in databases.
In still another embodiment, the providing multiple information resources includes providing at least some of the multiple information resources that are located in an on-line catalog.
Still further, there is provided the step of associating categories in the hierarchy of categories in the multiple information resources with hyperlinks.
Yet still further, there is provided the step of associating categories in the hierarchy of categories in the multiple information resources with menus.
In one embodiment, the generating of a unified classification information graph includes:
initializing so as to generate a respective xe2x80x9clink graphxe2x80x9d that corresponds to each information resource. The link graph includes link graph categories.
normalizing the link graph categories so as to generate a classification graph that includes classification graph categories.
unifying the classification graph so as to generate the unified classification information graph.
In this embodiment there is further provided the step of providing URL pointers of the on-line catalog for generating the link graph.
Another aspect of the invention provides for a machine having a memory that contains data representing a unified classification information graph generated by the above method.
Still further, there is provided memory for storing data accessible by an application program, which program is accessed by a user through a user interface for the user to access sought information. The application program is executed on a data processing system. The data includes a data structure stored in the memory, the data structure including a unified classification information graph generated from multiple information resources. The unified classification information graph includes a hierarchy of unified categories; leaf unified categories in said hierarchy being connected to information pages. Information pages that are accessible through the multiple information resources are also accessible through the hierarchy of the unified classification information graph.
The invention further provides for a system for dynamically obtaining a unified classification information graph that provides a navigation system for a user to access sought information. The system includes an input device receiving multiple information resources that include a respective hierarchy of categories each of which associated with a category. Leaf categories in the hierarchy are connected to information pages. The system also includes a generator, generating a unified classification information graph utilizing at least the hierarchy of categories and the categories of said multiple information resources. The unified classification information graph includes a hierarchy of unified categories. Leaf unified categories in the hierarchy are connected to information pages. Information pages accessible through the hierarchy of the multiple information resources are also accessible through the hierarchy of said unified classification information graph.
Another aspect of the invention provides for use with a unified classification information graph generated by the above method, a method for retrieving information of interest. The method includes providing a user query, and identifying unified categories in the unified classification information graph which substantially match said query. According to the latter embodiment there is further provided the step of identifying the at least one information page in the unified classification information graph that is connected to the unified categories.
Preferably, any information page that is connected to a leaf unified category in the unified classification information graph contains information that can be described by the unified category information of said unified leaf category. Unified category information stands for the unified category of the leaf category and the unified categories of all its ancestors in the hierarchy.
Still further, preferably, all the information pages in the multiple information resources that contain information that can be described by the unified category information of said leaf unified category are connected to the latter.