The invention relates to information storage and retrieval and more particularly to an efficient scheme for assigning data objects in a collection to clusters based on similarities in their contents and characteristics.
Computer users are increasingly finding navigating document collections to be difficult because of the increasing size of such collections. For example, the World Wide Web on the Internet includes millions of individual pages. Moreover, large companies"" internal Intranets often include repositories filled with many thousands of documents.
It is frequently true that the documents on the Web and in Intranet repositories are not very well indexed. Consequently, finding desired information in such a large collection, unless the identity, location, or characteristics of a specific document are well known, can be much like looking for a needle in a haystack.
The World Wide Web is a loosely interlinked collection of documents (mostly text and images) located on servers distributed over the Internet. Generally speaking, each document has an address, or Uniform Resource Locator (URL), in the exemplary form xe2x80x9chttp://www.server.net/directory/file.htmlxe2x80x9d. In that notation, the xe2x80x9chttp:xe2x80x9d specifies the protocol by which the document is to be delivered, in this case the xe2x80x9cHyperText Transport Protocol.xe2x80x9d The xe2x80x9cwww.server.netxe2x80x9d specifies the name of a computer, or server, on which the document resides; xe2x80x9cdirectoryxe2x80x9d refers to a directory or folder on the server in which the document resides; and xe2x80x9cfile.htmlxe2x80x9d specifies the name of the file.
Most documents on the Web are in HTML (HyperText Markup Language) format, which allows for formatting to be applied to the document, external content (such as images and other multimedia data types) to be introduced within the document, and xe2x80x9chotlinksxe2x80x9d or xe2x80x9clinksxe2x80x9d to other documents to be placed within the document, among other things. xe2x80x9cHotlinkingxe2x80x9d allows a user to navigate between documents on the Web simply by selecting an item of interest within a page. For example, a Web page about reprographic technology might have a hotlink to the Xerox corporate web site. By selecting the hotlink (often by clicking a marked word, image, or area with a pointing device, such as a mouse), the user""s Web browser is instructed to follow the hotlink (usually via a URL, frequently invisible to the user, associated with the hotlink) and read a different document.
Obviously, a user cannot be expected to remember a URL for each and every document on the Internet, or even those documents in a smaller collection of preferred documents. Accordingly, navigation assistance is not only helpful, but necessary.
Accordingly, when a user desires to find information on the Internet (or other large network) that is not already represented in the user""s bookmark collection, the user will frequently turn to a xe2x80x9csearch enginexe2x80x9d to locate the information. A search engine serves as an index into the content stored on the Internet.
There are two primary categories of search engines: those that include documents and Web sites that are analyzed and used to populate a hierarchy of subject-matter categories (e.g., Yahoo), and those that xe2x80x9ccrawlxe2x80x9d the Web or document collections to build a searchable database of terms, allowing keyword searches on page content (such as AltaVista, Excite, and Infoseek, among many others).
Also known are recommendation systems, which are capable of providing Web site recommendations based on criteria provided by a user or by comparison to a single preferred document (e.g., Firefly, Excite""s xe2x80x9cmore like thisxe2x80x9d feature).
xe2x80x9cGooglexe2x80x9d (www.google.com) is an example of a search engine that incorporates several recommendation-system-like features. It operates in a similar manner to traditional keyword-based search engines, in that a search begins by the user""s entry of one or more search terms used in a pattern-matching analysis of documents on the Web. It differs from traditional keyword-based search engines (such as AltaVista), in that search results are ranked based on a metric of page xe2x80x9cimportance,xe2x80x9d which differs from the number of occurrences of the desired search terms (and simple variations upon that theme).
Google""s metric of importance is based upon two primary factors: the number of pages (elsewhere on the Web) that link to a page (i.e., xe2x80x9cinlinks,xe2x80x9d defining the retrieved page as an xe2x80x9cauthorityxe2x80x9d), and the number of pages that the retrieved page links to (i.e., xe2x80x9coutlinks,xe2x80x9d defining the retrieved page as a xe2x80x9chubxe2x80x9d). A page""s inlinks and outlinks are weighted, based on the Google-determined importance of the linked pages, resulting in an importance score for each retrieved page. The search results are presented in order of decreasing score, with the most important pages presented first. It should be noted that Google""s page importance metric is based on the pattern of links on the Web as a whole, and is not limited (and at this time cannot be limited) to the preferences of a single user or group of users.
Another recent non-traditional search engine is IBM""s CLEVER (CLient-side EigenVector Enhanced Retrieval) system. CLEVER, like Google, operates like a traditional search engine, and uses inlinks/authorities and outlinks/hubs as metrics of page importance. Again, importance (based on links throughout the Web) is used to rank search results. Unlike Google, CLEVER uses page content (e.g., the words surrounding inlinks and outlinks) to attempt to classify a page""s subject matter. Also, CLEVER does not use its own database of Web content; rather, it uses an external hub, such as an index built by another search engine, to define initial communities of documents on the Web. From hubs on the Web that frequently represent people""s interests, CLEVER is able to identify communities, and from those communities, identify related or important pages.
Direct Hit is a service that cooperates with traditional search engines (such as HotBot), attempting to determine which pages returned in a batch of results are interesting or important, as perceived by users who have previously performed similar searches. Direct Hit tracks which pages in a list of search results are accessed most frequently; it is also able to track the amount of time users spend at the linked sites before returning to the search results. The most popular sites are promoted (i.e., given higher scores) for future searches.
Alexa is a system that is capable of tracking a user""s actions while browsing. By doing so, Alexa maintains a database of users"" browsing histories. Page importance is derived from other users"" browsing histories. Accordingly, at any point (not just in the context of a search), Alexa can provide a user with information on related pages, derived from overall traffic patterns, link structures, page content, and editorial suggestions.
Knowledge Pump, a Xerox system, provides community-based recommendations by initially allowing users to identify their interests and xe2x80x9cexpertsxe2x80x9d in the areas of those interests. Knowledge Pump is then able to xe2x80x9cpushxe2x80x9d relevant information to the users based on those preferences; this is accomplished by monitoring network traffic to create profiles of users, including their interests and xe2x80x9ccommunities of practice,xe2x80x9d thereby refining the community specifications. However, Knowledge Pump does not presently perform any enhanced search and retrieval actions like the search-engine-based systems described above.
While the foregoing systems and services blend traditional search engine and recommendation system capabilities to some degree, it should be recognized that none of them are presently adaptable to provide search-engine-like capabilities while taking into account the preferences of a smaller group than the Internet as a whole. In particular, it would be beneficial to be able to incorporate community- or cluster-based recommendations into a system that is capable of retrieving previously unknown documents from the Internet or other collection of documents.
Accordingly, when dealing with a large collection, or corpus, of documents, it is useful to be able to search, browse, retrieve, and view those documents based on their content. However, this is difficult in many cases because of limitations in the documents. For example, there are many kinds of information available in a typical collection of documents, the files on the World Wide Web. There are text files, HTML (HyperText Markup Language) documents including both text and images, images by themselves, sound files, multimedia files, and other types of content.
To easily browse and retrieve images, each image in a collection ideally should be labeled with descriptive information including the objects in the image and a general description of the image. However, identification of the objects in an unrestricted collection of images, such as those on the web, is a difficult task. Methods for automatically identifying objects are usually restricted to a particular domain, such as machine parts. And having humans identify each image is an onerous undertaking, and in some cases impossible, as on the web.
Much research in information retrieval has focused on retrieving text documents based on their textual content or on retrieving image documents based on their visual features. Moreover, with the explosion of information on the web and corporate intranets, users are inundated with hits when searching for specific information. The task of sorting through the results to find what is really desired is often tedious and time-consuming. Recently, a number of search engines have added functionality that permits users to augment queries from traditional keyword entries through the use of metadata (e.g., Hotbot, Infoseek). The metadata may take on various forms, such as language, dates, location of the site, or whether other modalities such as images, video or audio are present.
Recently, however, there has been some research on the use multi-modal features for retrieval. Presented herein are several approaches allowing a user to locate desired information based on the multi-modal features of documents in the collection, as well as similarities among users"" browsing habits.
Set forth herein is an approach to document browsing and retrieval in which a user iteratively narrows a search using both the image and text associated with the image, as well as other types of information related to the document, such as usage. Disparate types of information such as text, image features and usage are referred to as xe2x80x9cmodalities.xe2x80x9d Multi-modal clustering hence is the grouping of objects that have data from several modalities associated with them.
The text surrounding or associated with an image often provides an indication of its context. The method proposed herein permits the use of multi-modal information, such as text and image features, for performing browsing and retrieval (of images, in the exemplary case described herein). This method is applicable more generally to other applications in which the elements (e.g., documents, phrases, or images) of a collection can be described by multiple characteristics, or features.
One difficulty in the use of multiple features in, search and browsing is the combination of the information from the different features. This is commonly handled in image retrieval tasks by having weights associated with each feature (usually image features such as color histogram, texture, and shape) that can be set by the user. With each revision of the weights, a new search must be performed. However, in employing a heterogeneous set of multi-modal features, it is often difficult to assign weights to the importance of different features. In systems that employ metadata, the metadata usually has finite, discrete values, and a Boolean system that includes or excludes particular values can be used. Extending the concept to multi-modal features that may not be discrete leads exacerbates the question of how to combine the features.
Current image retrieval systems (such as QBIC, Virage, and Smith and Chang) commonly display a random selection of images or allow an initial text query (such as a starting point. In the latter case, a set of images with that associated text is returned. The user selects the image most similar to what they are looking for, a search using the selected image as the query is performed and the most similar images are displayed. This process is repeated as the user finds images closer to what is desired. In some systems, the user can directly specify image features such as color distribution and can also specify weights on different features, such as color histograms, texture, and shape. In web pages, text such as URLs may also provide clues to the content of the image. Current image retrieval technology also allows the use of URL, alt tags, and hyperlink text to index images on the web. One approach also attempts to determine for each word surrounding an image caption whether it is likely to be a caption word and then matches caption words to xe2x80x9cvisual focixe2x80x9d or regions of images (such as the foreground). The Webseek image search engine and MARS-2 allow for relevance feedback on images by marking them as positive or negative exemplars.
In contrast to those image-based retrieval systems, there are text-based search engines that provide the ability to group results or identify more documents that are similar to a specific document. Entire topics or specific words in a topic can be required or excluded. A new search is then performed with the new query, or a narrowing search is performed on the previously returned set of results. The Excite search engine has a xe2x80x9cmore like thisxe2x80x9d functionality that performs a search using one particular document as the example for a new search; it refines the query by basing it on the selected document and performing a new search. This approach is unlike the method set forth herein, as it does not allow for searching based on multiple features in multiple modalities.
Decision trees, such as CART or ID3, perform iterative splitting of data. A tree is created by selecting a feature for splitting at each node. As in the present method, a different feature may be selected each time, or a combination of features may be used to define an aggregate similarity measure. The selection of features in creating a decision tree is usually performed automatically from a set of data, based on some criteria such as minimizing classification error or maximizing mutual information.
Accordingly, there is a need for a system that is capable of flexibly handling multi-modal information in a variety of contexts and applications. It is useful to be able to perform queries, while also subsequently refining and adjusting search results by characteristics other than direct text content, namely image characteristics and indirect text characteristics. It is also useful to be able to track individuals"" information access habits by way of the characteristics of the documents those users access, thereby enabling a recommendation system in which users are assigned to similar clusters.
This disclosure sets forth a framework for multi-modal browsing and clustering, and describes a system advantageously employing that framework to enhance browsing, searching, retrieving and recommending content in a collection of documents.
Clustering of large data sets is important for exploratory data analysis, visualization, statistical generalization, and recommendation systems. Most clustering algorithms rely on a similarity measure between objects. This proposal sets forth a data representation model and an associated similarity measure for multi-modal data. This approach is relevant to data sets where each object has several disparate types of information associated with it, which are called modalities. Examples of such data sets include the pages of a World Wide Web site (modalities here could be text, inlinks, outlinks, image characteristics, text genre, etc.).
A primary feature of the present invention resides in its novel data representation model. Each modality within each document is described herein by an n-dimensional vector, thereby facilitating quantitative analysis of the relationships among the documents in the collection.
In one application of the invention, a method is described for serially using document features in different spaces (i.e., different modalities) to browse and retrieve information. One embodiment of the method uses image and text features for browsing and retrieval of images, although the method applies generally to any set of distinct features. The method takes advantage of multiple ways in which a user can specify items of interest. For example, in images, features from the text and image modalities can be used to describe the images. The method is similar to the method set forth in U.S. Pat. No. 5,442,778 and in D. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey, xe2x80x9cScatter/Gather: A cluster-based approach to browsing large document collections,xe2x80x9d Proc. 15th Ann. Int""l SIGIR""92, 1992 (xe2x80x9cScatter/Gatherxe2x80x9d) in that selection of clusters, followed by reclustering of the selected clusters is performed iteratively. It extends the Scatter/Gather paradigm in at least two respects: each clustering may be performed on a different feature (e.g., surrounding text, image URL, image color histogram, genre of the surrounding text); and a xe2x80x9cmapxe2x80x9d function identifies the most similar clusters with respect to a specified feature. The latter function permits identification of additional similar images that may have been ruled out due to missing feature values for these images. The image clusters are represented by selecting a small number of representative images from each cluster.
In an alternative application of the invention, various document features in different modalities are appropriately weighted and combined to form clusters representative of overall similarity.
Various alternative embodiments of the invention also enable clustering users and documents according to one or more features, recommending documents based on user clusters"" prior browsing behaviors, and visually representing clusters of either documents or users, graphically and textually.
Initially, a system for representing users and documents in vector space and for performing browsing and retrieval on a collection of web images and associated text on an HTML page is described. Browsing is combined with retrieval to help a user locate interesting portions of the corpus or collection of information, without the need to formulate a query well matched to the corpus. Multi-modal information, in the form of text surrounding an image and some simple image features, is used in this process. Using the system, a user progressively narrows a collection to a small number of elements of interest, similar to the Scatter/Gather system developed for text browsing, except the Scatter/Gather method is extended hereby to use multi-modal features. As stated above, some collection elements may have unknown or undefined values for some features; a method is presented for incorporating these elements into the result-set. This method also provides a way to handle the case when a search is narrowed to a part of the space near a boundary between two clusters. A number of examples are provided.
It is envisioned that analogous to a database with various metadata fields, the documents in the present collection are characterized by many different features, or (probably non-orthogonal) xe2x80x9cdimensions,xe2x80x9d many of which are derived from the contents of the unstructured documents.
Multi-modal features may take on many forms, such as user information, text genre, or analysis of images. The features used in the present invention can be considered a form of metadata, derived from the data (text and images, for example) and its context, and assigned automatically or semi-automatically, rather than current image search systems, in which metadata is typically assigned manually. Table 1 lists several possible features (all of which will be described in greater detail below); it will be recognized that various other features and modalities are also usable in the invention, and that the features of Table 1 are exemplary only.
Methods are presented herein for combining rich xe2x80x9cmulti-modalxe2x80x9d features to help users satisfy their information needs. At one end of the spectrum, this involves ad-hoc retrieval (applied to images), providing simple, rapid access to information pertinent to a user""s needs. At the other end, this involves analyzing document collections and their users. The common scenario is the World Wide Web, which consists of the kind of unstructured documents that are typical of many large document collections.
Accordingly, this specification presents methods of information access to a collection of web images and associated text on an HTML page. The method permits the use of multi-modal information, such as text and image features, for performing browsing and retrieval of images and their associated documents or document regions. In the described approaches, text features derived from the text surrounding or associated with an image, which often provide an indication of its content, are used together with image features. The novelty of this approach lies in the way it makes text and image features transparent to users, enabling them to successively narrow down their search to the images of interest. This is particularly useful when a user has difficulty in formulating a query well matched to the corpus, especially when working with an unfamiliar or heterogeneous corpus, such as the web, where the vocabulary used in the corpus or the image descriptors are unknown.
The methods presented herein are premised on an advantageous data representation model in which document (and user) features are embedded into multi-dimensional vector spaces. This data representation model facilitates the use of a consistent and symmetric similarity measure, which will be described in detail below. With the data representation and similarity models set forth herein, it is possible to represent users and clusters of users based on the contents and features of the documents accessed by those users (i.e., collection use data), thereby improving the ability to cluster users according to their similarities.
Furthermore, a recommendation system based on multi-modal user clusters is possible with the collection of multi-modal collection use data as described below. A set of clusters is induced from a training set of users. A user desiring a recommendation is assigned to the nearest cluster, and that cluster""s preferred documents are recommended to the user.
Finally, this disclosure sets forth improved methods of visually representing clusters of documents and clusters of users. While documents are frequently stored hierarchically, enabling a hierarchical visual representation, the same is not usually true for users. Accordingly, the present invention allows for a view of user data by way of the a hierarchical view of the documents accessed or likely to be accessed by the appropriate users. Documents and clusters of documents can be visualized similarly, and also textually by way of clusters"" xe2x80x9csalient dimensions.xe2x80x9d
Although the use of clustering in image retrieval is not new, it has usually been used for preprocessing, either to aid a human during the database population stage, or to cluster the images offline so that distance searches during queries are performed within clusters. In the present invention, iterative clustering and selection of cluster subsets can help a user identify images of interest. Clustering is used for interactive searching and presentation, and relevance feedback is implicit in the user""s choice of clusters. Because the user is dealing with clusters, not individual images, the feedback step is also easier to perform.
The various forms of multi-modal clustering set forth herein can be used for information access: for browsing a collection in order to find a document; for understanding a collection that is new to the user; and for dealing with cases of xe2x80x9cnothing foundxe2x80x9d (in which clustering can help the user reformulate his or her query by formulating it in the vocabulary that is appropriate for the collection).
Accordingly, in an embodiment of the present invention, a method for clustering data objects relies on an efficient method of selecting initial cluster centers. This is performed by xe2x80x9cwavefront clustering,xe2x80x9d which identifies as initial centers points situated between a representative centroid and several randomly chosen data objects.