This invention relates to daft retrieval and learning systems.
In the context of distributed information systems (e.g the Internet), there is a need to provide end users with a centralized access and search services to information residing in multiple heterogeneous on-line catalogs. These on-line catalogs should be viewed by the users as if they were using the very same access method, information classification and nomenclature. This concept is called xe2x80x98information integrationxe2x80x99 and is the subject of several research and development effort. Among them are:
Stanford University, Knowledge Systems Laboratory (KSL) Ontology Server Projects
Microelectronics and Computer Technology Corporation (MCC)xe2x80x94Infoslouth
There are three main problems associated with information integration:
1) Different conceptualization systemsxe2x80x94providing access to relevant information which is accessible through different classification methods and described using non identical nomenclatures. This means bridging the gap between the different conceptualization systemsxe2x80x94the one used by the user to describe his query and those used by each of different information resources. These conceptualization differences range from classification method to nomenclature (e.g. the user is looking for xe2x80x9cRS232 Cable for Printerxe2x80x9d which is listed in one on-line catalog under the name xe2x80x9cRS232 cablexe2x80x9d in the sub section xe2x80x9cAccessoriesxe2x80x9d in the super section xe2x80x9cPaintersxe2x80x9d and in another on-line-catalog under the name xe2x80x9cPrinter cablexe2x80x9d in the section xe2x80x9cHardware accessoriesxe2x80x9d). This is a very tough task, since it involves the formalization of xe2x80x9cknowledgexe2x80x9d.
2) Resource selectionxe2x80x94deciding which of the available information resources is relevant for a specific information request (e.g. there is no point in accessing resources providing information about restaurants when the user looks for automobile). Indeed, in the domain level, it is an easy task However, in larger arrays of information resources from similar domains, the problem becomes harder.
All the research projects listed above deal with different aspects of these problems making different assumptions on the environment. However, to-date, there are no general purpose information integration systems at all. There are two main reasons for this:
1. There are no automatic mechanisms to xe2x80x9cconnectxe2x80x9d to new information resources. Current solutions to the task of connecting to information resources are based on the assumption that xe2x80x9csomeonexe2x80x9d (either the information requester or the information provider) provides information source xe2x80x9cwrapperxe2x80x9d that enables xe2x80x9csmoothxe2x80x9d integration to the data.
2. There is no way to automatically a create large scale conceptualization system. The current solution to the problem of creating a common unified conceptualization system is a manual solution provided by the Knowledge System Laboratory (KSL) at Stanford University. The KSL staff has developed a set of tools and services to support the process of manually building and achieving consensus on a common shared conceptualization system (termed xe2x80x9cOntologyxe2x80x9d.
It is only natural, then, that the lack of a real world conceptualization system adversely affects both the quality of the information being retrieved (recall and precision) and the quality of the user-computer interaction. That is, real world information integration requires the automatic acquisition of conceptual knowledge base (conceptulization system).
In recent years, the task of automatic knowledge acquisition was usually approached by corpus-based NLP. Free text documents were used as a source for learning different relations between words (e.g., contextual similarity).
The emergence of a global standard computer network, and more specifically, the Internet, has led to the proliferation of classified on-line catalogs. This enables to use the information navigation systems. One of the innovations of the present invention is the usage of the knowledge embedded in these very navigation systems as a new source for the knowledge acquisition task in order to generate a so called unified classification information graph. Information navigation systems, by their nature, imply hierarchy relations between categories, hence enable to learn more precise category-relations information then free text does. The categories and the hierarchy relations between categories is utilized in the process of generating the unified classification information graph.
The present invention offers how to overcome the difficulties in the usage of the multiple resources (e.g. the same piece of information may be expressed in word order or levels of abstraction) so as to generated the desired unified classification information graph.
Since on-line catalogs are by nature subject to frequent (and occasionally also major) changes (e.g. new products/categories are added and/or others are deleted) it is important to sure that all or at least most of the modifications that occurred in the online catalogs will be reflected in the resulting unified classification information graph. Accordingly, one of the important advantages of the system is the dynamic nature thereof, i.e. the ability to dynamically scan the multiple information resources and update, whenever required, the resulting unified information graph.
The invention fulfills thus a long felt need by providing a system and method for obtaining and integrating multiple classification information resources using a single unified access interface.
By one aspect, the invention provides for a method for dynamically obtaining a unified classification information graph which provides a navigation system for a user to access sought information, comprising:
providing a multiple information resources that include a respective hierarchy of categories each of which associated with a category; leaf categories in said hierarchy being connected to information pages;
generating a unified classification information graph utilizing at least the hierarchy of categories and the categories of said multiple information resources; said unified classification graph includes hierarchy of unified categories; leaf unified categories in said hierarchy being connected to information pages;
whereby, information pages accessible through the hierarchy of said multiple information resources are also accessible through the hierarchy of said unified classification information graph.
By one embodiment, said step (a) includes providing at least some of said multiple information resources that are located in sites of the Internet.
By another embodiment, said step (a) includes providing at least some of said multiple information resources that are located in databases.
By still another embodiment said step (a) includes providing at least some of said multiple information resources that are located in on-line catalog.
Still further there is provided the step of associating categories in said hierarchy of categories in said multiple information resources with hyper-links.
Yet still further there is provided the step of associating categories in said hierarchy of categories in said multiple information resources with menus.
By one embodiment said step (b) includes:
(i) initalization so as to generate respective link graph that correspond to each information resource; said link graph includes link graph categories;
(ii) normalizing the link graph categories so as to generate classification graph that includes classification graph categories; and
(iii) unifying said classification graph so as to generate said unified classification information graph.
By this embodiment there is further provided the step of providing a URL pointer of said on-line catalog for generating said ink graph.
By another aspect the invention provides for a machine having a memory which contains data representing a unified classification information graph which was generated by the above method.
Still further, there is provided a memory for storing data for access by an application program, which approgram is accessed by a user through a user interface for the user to access sought information; the application program being executed on a data processing system; the data comprising:
data structure stored in said memory, which data structure includes a unified classification information generated from a multiple information resources;
said unified classification graph includes hierarchy of unified categories; leaf unified categories in said hierarchy being connected to information pages;
whereby, information pages accessible through the multiple information resources are also accessible through the hierarchy of said unified classification information graph.
The invention further provides for a system for dynamically obtaining a unified classification information graph which provides a navigation system for a user to access sought information, comprising:
input device receiving a multiple information resources that include a respective hierarchy of categories each of which associated with a category; leaf categories in said hierarchy being connected to information pages;
generator, generating a unified classification information graph utilizing at least the hierarchy of categories and the categories of said multiple information resources; said unified classification graph includes hierarchy of unified categories; leaf unified categories in said hierarchy being connected to information pages;
whereby, information pages accessible through the hierarchy of said multiple information resources are also accessible through the hierarchy of said unified classification information graph.
By another aspect the invention provides for use with a unified classification information graph generated by the above method, a method for retriving information of interest comprising:
(i) providing user query;) and (ii) identifying unified categories in said unified classification information graph which substantially match said query.
According to the latter embodiment there is further provided the step of:
identiying the at least one information page in said unified classification information graph that is connected to said unified categories.
Preferably, any information page that is connected to a leaf unified category in the unified classification information graph contains information that can be described by the unified category information of said unified leaf category. Unified category information stands for the unified category of the leaf category and the unified categories of all its ancestors in the hierarchy.
Still further preferably, all the information pages in the multiple information resources that contain information that can be described by the unified category information of said leaf unified category are connected to the latter.