The invention relates to methods and apparatus for the classification of information sources and the display of information to a user.
The increasing popularity of high-speed computer networking has made large amounts of data available to individuals. Methods used in the past for dealing with information were adequate when the amount of information was small, but they do not scale up to handle the enormous amount of information that is now easily accessible.
Research is a fundamental activity of knowledge workers, whether they are scientists, engineers or business executives. While each discipline may have its own interpretation of research, the primary meaning of the word is a xe2x80x9ccareful and thorough search.xe2x80x9d In most cases, the thing one is searching for is information. In other words, one of the most important activities of modern educated individuals is searching for information. Whole industries have arisen to meet the need for thorough searching. These include libraries, newspapers, magazines, abstracting services and online search services.
Not surprisingly, the search process itself has been studied at least since the 1930s, and a standard model was developed by the mid-1960s. In this model, the searcher has an xe2x80x9cinformation needxe2x80x9d which the searcher tries to satisfy using a large collection or xe2x80x9ccorpusxe2x80x9d of information sources. The information sources that satisfy the searcher""s needs are the xe2x80x9crelevantxe2x80x9d information sources. The searcher expresses an information need using a formal statement called a xe2x80x9cquery.xe2x80x9d Queries may be expressed using topics, categories and/or words. The query is then given to a search intermediary. In the past, the intermediary was a person who specialized in searching. It is more common today for the intermediary to be a computer system. Such systems are called information retrieval systems or online search engines. The search intermediary tries to match the topics, categories and/or words from the query with information sources in the corpus. The intermediary responds with a set of information sources that, so it is hoped, satisfies the searcher""s needs.
Also, in accordance with the standard model, another very commonly used technique to find information in a corpus is to start with a document and then follow citations or references within the document to find other documents in the corpus. References in these documents are then used to find further documents. This technique is called xe2x80x9cbrowsingxe2x80x9d and online browsing tools are now becoming very popular. Such tools allow a searcher to quickly follow references contained in information sources, often by simply xe2x80x9cclickingxe2x80x9d on a word or picture within the information source. In the standard model for information retrieval, a sharp distinction is made between searching using queries and searching using references.
Computerized search engines have been developed to assist in information retrieval. Some are primarily based on matching words in a query with words in text documents. In practice, this means that this type of search engine cannot search effectively for features of images and other kinds of multimedia. Non-word based techniques currently employ approaches to extracting relevant information that are different and distinct from those used in word based systems and generally involve extracting data xe2x80x9cfeaturesxe2x80x9d from the raw data. Features of images, sound and video streams can be represented in a computer system as a set of data structures stored in a database.
Features can be as simple as the value of an attribute such as brightness of an image, but many features are more complicated and are thus represented using a complex data structure. Typically, features can be extracted from structured documents by parsing the document to produce data structures, and can be extracted from unstructured documents by using one of the many feature extraction algorithms that have been developed for implementation on a computer. As in the case of structured documents, feature extraction from an unstructured document produces data structures.
A large variety of feature extraction algorithms has been developed for media such as sound, images and video streams. For a discussion of such algorithms, see The Ninth International Conference on Image Analysis and Processing, A. Del Bimbo, editor, v. 1311, Springer Verlag and Company, September 1997, which is incorporated in its entirety by reference.
The data structures that represent features typically conform to a xe2x80x9cdata modelxe2x80x9d for the database that determines the kinds of components and attribute values that are allowed. Each feature can have one or more values associated with components of the data structure that represents the feature. In the simplest case, the data structure can have a single component with an associated value, and the feature can be represented by one attribute of the object. Features that are more complex can be represented by several inter-related components, each of which may have attribute values. The data model for features at the domain level is often called an xe2x80x9contology.xe2x80x9d An ontology models knowledge within a particular domain, such as, for example, medicine. An ontology can include a concept network, specialized vocabulary, syntactic forms and inference rules. In particular, an ontology specifies the features that objects can possess as well as how to extract features from objects. When the extracted features are represented as a computer data structure, the data structure is called a xe2x80x9cknowledge representationxe2x80x9d of the information source.
In the standard model, the quality of a search is measured using two numbers. The first number represents how thorough the search was. It is the fraction of the total number of relevant information sources that are presented to the searcher. This number is called the xe2x80x9crecall.xe2x80x9d If the recall is less than 100%, then some relevant information sources have been missed. The second number represents the fraction of the total number of information sources that are presented to the searcher that are judged to be relevant. This number is called the xe2x80x9cprecision.xe2x80x9d If the precision is less than 100%, then some irrelevant information sources were presented to the searcher.
The recall can always be increased by adding many more information sources to those already presented, which can decrease the precision. Similarly, the precision can be increased by reducing the number of references retrieved and presented to the searcher, which can decrease the recall. Ideally, the recall and precision should be balanced so as to achieve a search that is as careful and thorough as possible. However, typical online search engines can achieve only about 60% recall and 40% precision. Surprisingly, these performance rates have not changed significantly in the last 20 years.
The standard model for information retrieval uses recall and precision as measures of xe2x80x9crelevance.xe2x80x9d Relevance is a central concept in human (as opposed to computer) communication. This was recognized already in the 1940s when information science was first being formed as a discipline. The first formal in-depth discussion of relevance occurred in 1959, and the topic was discussed intensively during the 1960s and early 1970s. As a result of such discussions, researchers began to study relevance from a human perspective. The two best-known studies were by Cuadra and Katter and by Rees and Schultz, both of which appeared in 1967. The main conclusions of these studies are that the recall and precision rates used in the standard model for information retrieval do not accurately represent how people perceive relevance. People perceive an information source to be relevant if it extends their knowledge and, thus, relevance is determined by the difference between what is known and what is yet to be known. For example, if a search uncovers an information source that is already known to a searcher, the searcher will consider the source to be redundant rather than relevant. However, in accordance with the standard model for information retrieval, such a source would be considered perfectly relevant.
Therefore, there is a need for a search tool that improves the recall and precision of searches and also produces results that are perceived as relevant by the searcher.
In accordance with one embodiment, both the information sources and queries are processed to generate knowledge representations that consist of graph structures. The knowledge representation graph structures are converted into graph structure views and the graph structure views for both the query and the information sources are then displayed to a searcher. By manipulating the graph structure views for each information source, the searcher can examine the source for relevance.
In accordance with another embodiment, available information sources are classified by comparing the knowledge representation of a query with the knowledge representations of the information sources by matching the graph structures with graph matching algorithms. Those information sources that have a substructure that matches the query in full, or in part, are classified by the largest matching substructure of the query. Thus, it is possible for a searcher to request the xe2x80x9cnext occurrencexe2x80x9d of a knowledge representation graph structure in an information source. In this case, the computer system searches the current information source knowledge representation for another substructure that matches the query graph structure occurring at a subsequent point in the information source. Similarly, requesting a xe2x80x9cprevious occurrencexe2x80x9d causes the system to search for a matching substructure occurring at a previous point in the information source.
In still another embodiment, information sources are classified by constructing hierarchies of knowledge representations. The simplest construction is obtained by using the knowledge representation of a query as the top of the hierarchy. The structures in the hierarchy are then substructures of the query. The hierarchy of structures may also be constructed by using the knowledge representation of the query as the bottom of the hierarchy. Structures in the hierarchy, in this case, are structures that contain the query. Views of this hierarchy can be displayed to a searcher with a substructure view being displayed adjacent to the information source from which it was derived.
In accordance with yet another embodiment, the graph structure corresponding to a knowledge representation consists of vertices joined by directed edges. Each vertex represents a concept that can be visually portrayed as a word, phase and/or icon. A vertex may also contain a category that is visually portrayed either textually or by a distinct shape, color and/or icon. An edge may be labeled by an edge type. Different types of edges can be distinguished by using a textual label or by using a distinct shape, color and/or icon. Two vertices that are joined by an edge are called adjacent vertices. The categories, concepts and edge types used to construct the graph structure are specified by an ontology for the knowledge domain.
In accordance with a further embodiment, the vertices of a graph structure view can be displayed on a computer screen next to the corresponding items, such as words, phrases and visual features, of an information source view. Selecting a vertex in the graph structure view causes the selected vertex and vertices adjacent to the selected vertex to be xe2x80x9chighlighted.xe2x80x9d In addition, the corresponding items in the information source view are highlighted. Similarly, selecting a feature in the information source view causes the corresponding vertex in the graph structure to be highlighted. Highlighting can be accomplished by using the same feature (such as the same color or the same location on the screen) for corresponding parts of the two views.
By selecting a succession of vertices in the graph structure view, a searcher can perform knowledge navigation of the information source. By successively selecting items in the information source view, a searcher can perform knowledge exploration of the information source.