The World Wide Web (WWW) contains a vast amount of information which is generally not indexed. As is well-known, a user seeking specific information on the WWW is faced with a problem in trying to locate such information efficiently. In many cases, this leads to a sub-problem for a user who needs to select one or more documents from a finite set of documents. Typically, this problem is encountered when a user must choose from among the many results returned by a search engine query.
Examples of other sets of documents encountered include: the set of documents gathered by an agent; the set of documents accessible through links from the current document; a user""s set of bookmarked documents; and the various forms of a navigation history list. An aspect of the present invention is to provide a representation of a structured document that can be easily scanned, so that when presented with a number of links to documents, a user can more quickly select the relevant documents of interest without being forced to examine each one.
Such sets of document links are typically presented as lists of document titles, sometimes with additional information such as a text summary or a date, as well as a hyperlink to the subject document. A typical example is shown in FIG. 1, which gives partial results of a query on the Alta Vista search engine on the keyword xe2x80x9csnowboardingxe2x80x9d. The summarized information is useful, especially if the pre-selection of the documents in the group is accurate. Unfortunately, it is a cumbersome format, for it requires the user to read through a lot of text in order to determine which are the relevant documents. Sets of the kind described above often contain dozens, sometimes hundreds of documents, so that reading all such passages is generally impracticable, requiring unreasonable expenditure of time and effort. Moreover, since WWW document titles do not always reveal of the nature of the contents, the user may be forced to retrieve and evaluate a large number of documents for more detailed perusal in order to determine which pages are useful.
Related research has been conducted in the fields of information visualization and document summarizing. Work in document summarization attempts to abstract the essence of a document through text-based semantic analysis. See, for example, Paice, C. Constructing Literature Abstracts by Computer: Techniques and Prospects. Information Processing and Management, 26:171-186, 1990. The goal is usually to output a text display, and when applied to the WWW, this gives results much like those of the search engines: too dense to scan quickly.
Work in information visualization has provided a visual abstraction to the results of semantic analysis. See, for example, Mukherjea, S., and Foley, J. Visualizing the World-Wide Web with the Navigational View Builder. Technical Report #95-09 of the Graphics, Visualization and Usability Center, Georgia Institute of Technology, 1995 and Preiser, U., Structured Visualization of Search Results, To appear in: Data Highways and Information Flooding, a Challenge for Classification and Data Analysis, Proceedings of the 21st Annual Conference of the Gesellschaft fuer Klassifikation e.V.
However, such systems have generally provided abstractions which are quite sophisticated and difficult for an untrained user to interpret. In this respect, they do not provide much improvement over text-based systems for quick scanning. Both types of systems are based mainly on free-text analysis for their semantic distillation. Unfortunately, free-text analysis often requires some domain-knowledge in order to provide useful results. The WWW offers such a diverse collection of documents, that such domain-knowledge generally cannot be assumed on the part of a user.
Some approaches require the Web document author to include additional meta-information on their pages. While this technique can allow very specific information about a page""s contents to be obtained immediately and without error, it has the drawback of relying heavily on the author""s perception of what is needed, and the author may not have the same goals as the Web user. Another problem is that an author may not invest the effort to include all of the document""s meta information, especially if several systems require different meta information.
It is herein recognized that only a small fraction of Web documents currently contain complete meta-data, and other pages are completely opaque to meta-data reliant systems. Furthermore, some authors may include arguably incorrect meta-information which may be primarily intended to attract readers to their document rather than to accurately represent the document""s content. It is herein recognized that the expanding commercial nature of the WWW, is likely to increase the motivation for such information being entered. See, for example, Ross, E., Drive Search Engine Traffic To Your Site, Avatar Online Magazine, HYPERLINK http://www.avatarmag.com/columns/sitemanage/default.htm http://www.avatarmag.com/columns/sitemanage/default.htm_, February 1997.
In accordance with the principles of the present invention, simple visual cues like shape, color and icons are used to represent features which can be detected in a document syntactically. Combinations of well-recognized visual cues are utilized to make the system in accordance with the invention useful to an untrained user and by basing a semantic abstraction on syntactic features, the intractability of arbitrary text analysis is largely avoided, although the approach is complementary with such analysis. By not requiring any special web-page formatting like meta-tags, the system in accordance with the invention can be used universally on HTML WWW documents.
It is herein recognized that it may be helpful to a clearer understanding of the invention to liken an aspect of the system in accordance with the invention to caricature drawings which one might find, for example, as part of a political cartoon. In such a cartoon, the entire set of famous humans can be distinguished by a representation which varies only over a few dimensions or characteristics: the size and shape of eyes, ears, nose, head and hair. The capacity of the human brain to recognize patterns of visual information allows one to identify the lampooned character by the exaggerated representations of appurtenant basic features.
It is also herein recognized that in many cases, humans can recognize caricatures of people more easily than more precisely accurate drawings. See the so-called xe2x80x9ccaricature advantagexe2x80x9d in, for example, Rhodes, G., Brennan, S., and Carey, S. Identification and ratings of caricatures: Implications for mental representations of faces. Cognitive Psychology, 19(4), 1987, 473-497 and Benson, P. J., and Perrett, D. I. Perception and recognition of photographic quality caricatures: Implications for the recognition of natural images. European Journal of Cognitive Psychology, 3(1), 1991, 105-135.
In accordance with an aspect of the invention, a method for managing document information on an information net, such as the World Wide Web (WWW), comprises the steps of inputting a structured document; extracting ye selected document properties from the structured document; forming a feature vector representative of the properties; and outputting the feature vector.
In accordance with another aspect of the invention, the method includes a step of forming a caricature derived from the feature vector.
In accordance with another aspect of the invention, the step of forming a caricature derived from the feature vector comprises: inputting a caricature template; utilizing the feature vector and the caricature template to map features of the structured document to visual representations; generating a caricature specification from the visual representations; rendering the caricature specification; and visually displaying the caricature specification.
In accordance with another aspect of the invention, the step of extracting selected document properties comprises extracting properties relating to media content data from the structured document.
In accordance with another aspect of the invention, the step of extracting selected document properties comprises a step of extracting a representative image from the structured document.
In accordance with another aspect of the invention, the step of extracting selected document properties comprises a step of extracting properties relating to link density from the structured document by determining the ratio of the number of text characters within hyperlink anchors and a weighted number of hyperlinks within images and maps to the total number of rendered text characters.
In accordance with another aspect of the invention, a method for managing document information on the World Wide Web (WWW), comprises the steps of: inputting a structured document; extracting selected document properties from the structured document, comprising any of: basic document properties, media content data, link density, document complexity, and a representative image from the structured document; forming a feature vector representative of the properties; inputting a caricature template, forming a caricature from the template and the feature vector; and visually displaying the caricature.
In accordance with another aspect of the invention, the caricature is displayed utilizing its attributes of sizes to represent one of the document properties.
In accordance with another aspect of the invention, the caricature is displayed utilizing its attributes of colors to represent one of the document properties.
In accordance with another aspect of the invention, the caricature is displayed utilizing its attributes of shapes to represent one of the document properties.