For almost as long as computers have existed, their designers and users have sought improvements to the user interface. Especially as computing power has increased, a greater portion of the available processing capacity has been devoted to improved interface design. Recent examples have been Microsoft Windows variants and Internet web browsers. Graphic interfaces provide significant flexibility to present data using various paradigms, and modern examples support use of data objects and applets. Traditional human computer interfaces have emphasized uniformity and consistency; thus, experienced users had a shortened learning curve for use of software and systems, while novice users often required extensive instruction before profitable use of a system. More recently, intuitive, adaptable and adaptive software interfaces have been proposed, which potentially allow faster adoption of the system by new users but which requires continued attention by experienced users due to the possibility of interface transformation.
While many computer applications are used both on personal computers and networked systems, the field of information retrieval and database access for casual users has garnered considerable interest. The Internet presents a vast relatively unstructured repository for information, leading to a need for Internet search engines and access portals based on Internet navigation. At this time, the Internet is gaining popularity because of its “universal” access, low access and information distribution costs, and suitability for conducting commercial transactions. However, this popularity, in conjunction with the non-standardized methods of presenting data and fantastic growth rate, have made locating desired information and navigation through the vast space difficult. Thus, improvements in human consumer interfaces for relatively unstructured data sets are desirable, wherein subjective improvements and wholesale adoption of new paradigms may both be valuable, including improved methods for searching and navigating the Internet.
Generally speaking, search engines for the World Wide Web (WWW, or simply “Web”) aid users in locating resources among the estimated present one billion addressable sites on the Web. Search engines for the web generally employ a type of computer software called a “spider” to scan a proprietary database that is a subset of the resources available on the Web. Major known commercial search engines include such names as Yahoo, Excite, and Infoseek. Also known in the field are “metasearch engines,” such as Dogpile and Metasearch, which compile and summarize the results of other search engines without generally themselves controlling an underlying database or using their own spider. All the search engines and metasearch engines, which are servers, operate with the aid of a browser, which are clients, and deliver to the client a dynamically generated web page which includes a list of hyperlinked universal resource locators (URLs) for directly accessing the referenced documents themselves by the web browser.
A Uniform Resource Identifier (RFC 1630) is the name for the standard generic object in the World Wide Web. Internet space is inhabited by many points of content. A URI (Uniform Resource Identifier is the way you identify any of those points of content, whether it be a page of text, a video or sound clip, a still or animated image, or a program. The most common form of URI is the Web page address, which is a particular form or subset of URI called a Uniform Resource Locator (URL). A URI typically describes: the mechanism used to access the resource; the specific computer that the resource is housed in; and the specific name of the resource (a file name) on the computer. Another kind of URI is the Uniform Resource Name (URN). A URN is a form of URI that has “institutional persistence,” which means that its exact location may change from time to time, but some agency will be able to find it.
The structure of the World Wide Web includes multiple servers at distinct nodes of the Internet, each of which hosts a web server which transmits a web page in hypertext markup language (HTML) or extensible markup language (XML) (or a similar scheme) using the hypertext transport protocol (http). Each web page may include embedded hypertext linkages, which direct the client browser to other web pages, which may be hosted within any server on the network. A domain name server translates a top-level domain (TLD) name into an Internet protocol (IP) address, which identifies the appropriate server. Thus, Internet web resources, which are typically the aforementioned web pages, are thus typically referenced with a URL, which provides the TLD or IP address of the server, as well a hierarchal address for defining a resource of the server, e.g., a directory path on a server system.
A hypermedia collection may be represented by a directed graph having nodes that represent resources and arcs that represent embedded links between resources. Typically, a user interface, such as a browser, is utilized to access hyperlinked information resources. The user interface displays information “pages” or segments and provides a mechanism by which that user may follow the embedded hyperlinks. Many user interfaces allow selection of hyperlinked information via a pointing device, such as a mouse. Once selected, the system retrieves the information resource corresponding to the embedded hyperlink. As hyperlinked information networks become more ubiquitous, they continue to grow in complexity and magnitude, often containing hundreds of thousands of hyperlinked resources. Hyperlinked networks may be centralized, i.e. exist within a single computer or application, or distributed, existing over many computers separated by thousands of kilometers. These networks are typically dynamic and evolve over time in two dimensions. First, the information content of some resources may change over time, so that following the same link at different times may lead to a resource with slightly different, or entirely different information. Second, the very structure of the networked information resources may change over time, the typical change being the addition of documents and links. The dynamic nature of these networks has significant ramifications in the design and implementation of modern information retrieval systems.
One approach to assisting users in locating information of interest within a collection is to add structure to the collection. For example, information is often sorted and classified so that a large portion of the collection need not be searched. However, this type of structure often requires some familiarity with the classification system, to avoid elimination of relevant resources by improperly limiting the search to a particular classification or group of classifications.
Another approach used to locate information of interest to a user, is to couple resources through cross-referencing. Conventional cross-referencing of publications using citations provides the user enough information to retrieve a related publication, such as the author, title of publication, date of publication, and the like. However, the retrieval process is often time-consuming and cumbersome. A more convenient, automated method of cross-referencing related documents utilizes hypertext or hyperlinks. Hyperlink systems allow authors or editors to embed links within their resources to other portions of those resources or to related resources in one or more collections that may be locally accessed, or remotely accessed via a network. Users of hypermedia systems can then browse through the resources by following the various links embedded by the authors or editors. These systems greatly simplify the task of locating and retrieving the documents when compared to a traditional citation, since the hyperlink is usually transparent to the user. Once selected, the system utilizes the embedded hyperlink to retrieve the associated resource and present it to the user, typically in a matter of seconds. The retrieved resource may contain additional hyperlinks to other related information that can be retrieved in a similar manner.
It is well known to provide search engines for text records which are distributed over a number of record sets. For example, the Internet presently exists as literally millions of web servers and tens of millions or more of distinct web page uniform resource locators (URLs). A growing trend is to provide web servers as appliances or control devices, and thus without “content” of general interest. On the other hand, the traditional hypertext transport protocol (HTTP) servers, or “web servers”, include text records of interest to a variety of potential users. Also, by tradition, the web pages, and particularly those with human readable text, are indexed by Internet search engines, thereby making this vast library available to the public.
Recently, the number and variety of Internet web pages have continued to grow at a high rate, resulting in a potentially large number of records that meet any reasonably broad or important search criteria. Likewise, even with this large number of records available, it can be difficult to locate certain types of information for which records do in fact exist, because of the limits of natural language parsers or Boolean text searches, as well as the generic ranking algorithms employed.
The proliferation of resources on the Web presents a major challenge to search engines, all of which employ proprietary tools to sift the enormous document load to find materials and sites relevant to a user's needs. Generally speaking, the procedure followed in making a search is as follows. User enters a string of words onto a character-based “edit line” and then strikes the “enter” key on user's keyboard or selects a search button using a pointing device. The string of words may be fashioned by a user into a Boolean logical sentence, employing the words “AND,” “OR,” and “NOT,” but more typically the user enters a set of words in so-called “natural language” that lack logical connectors, and software called a “parser” takes user's natural language query and estimates which logical connections exist among the words. Such parsers have improved markedly in recent years through employment of techniques of artificial intelligence and semantic analysis. Having parsed the phrase, the search engine then uses its database, derived from a spider that has previously scanned the Web, for materials relevant to the query. This process entails a latency period while user waits for the search engine to return results. The search engine then returns, it is hoped, references to relevant web pages or documents, identified by their URLs or a hypertext linkage to title information as a set of hits, to the user, often parceled out at the rate of ten per request. If further hits are desired, there is a wait while a request for further hits is processed, and this typically entails another, fresh search and another latency period, wherein the search engine is instructed to return ten hits starting at the next, previously undisplayed, record. Often, each return hypertext markup language (HTML) page is accompanied by advertising information, which subsidized the cost of the search engine and search process. This advertising information is often called a “banner ad”, and may be targeted to the particular user based on an identification of the user by a login procedure, an Internet cookie, or based on a prior search strategy. Other times, the banner ads are static or simply cycle between a few options.
A well-recognized problem with existing search engines is the tendency to return hits for a query that are so incredibly numerous, sometimes in the hundreds, thousands, or even millions, that it is impractical for user to wade through them and find relevant results. Many users, probably the majority, would say that the existing technology returns far too much “garbage” in relation to pertinent results. This has lead to the desire among many users for an improved search engine, and in particular an improved Internet search engine.
In response the garbage problem, search engines have sought to develop unique proprietary approaches to gauging the relevance of results in relation to a user's query. Such technologies employ algorithms for either limiting the records returned in the selection process (the search) and/or by sorting selected results from the database according to a rank or weighting, which may be predetermined or computed on the fly. The known techniques include counting the frequency or proximity of keywords, measuring the frequency of user visits to a site or the persistence of users on that site, using human librarians to estimate the value of a site and to quantify or rank it, measuring the extent to which the site is linked to other sites through ties called “hyperlinks” (see, Google.com and Clever.com), measuring how much economic investment is going into a site (Thunderstone.com), taking polls of users, or even ranking relevance in certain cases according to advertiser's willingness to bid the highest price for good position within ranked lists. As a result of relevance testing procedures, many search engines return hits in presumed rank order or relevance, and some place a percentage next to each hit which is said to represent the probability that the hit is relevant to the query, with the hits arranged in descending percentage order.
However, despite the apparent sophistication of many of the relevance testing techniques employed, the results typically fall short of the promise. Thus, there remains a need for a search engine for uncontrolled databases that provides to the user results, which accurately correspond the desired information sought.
Advertisers are generally willing to pay more to deliver an impression (e.g., a banner ad or other type of advertisement) to users who are especially sensitive to advertisements for their products or are seeking to purchase products corresponding to those sold by the advertisers, and the economic model often provides greater compensation in the event of a “click through”, which is a positive action taken by the user to interact with the ad to receive further information.
This principle, of course, actually operates correspondingly in traditional media. For example, a bicycle manufacturer in generally is willing to pay more per subscriber to place advertisements in a magazine having content directed to bicycle buffs than in a general interest publication. However, this principle has not operated very extensively in the search engine marketplace, partly because there is little differentiation among the known characteristics of the users of particular search engines, and because, even after a search inquiry in submitted, there may be little basis on which to judge what user's intention or interest really is, owing to the generality or ambiguity of user's request, so that even after a search request is processed, it may be impossible to estimate the salient economic, demographic, purchasing or interest characteristics of the user in the context of a particular search. In fact, some “cookie” based mechanisms provide long-term persistence of presumed characteristics even when these might be determined to be clearly erroneous. Thus, the existing techniques tend to exaggerate short term, ignorance based or antithetical interests of the user, since these represent the available data set. For example, if a child seeks to research the evils of cigar smoking for a school class project, a search engine might classify the user as a person interested in cigar smoking and cigar paraphernalia, which is clearly not the case. Further, the demographics of a cigar aficionado might tempt an advertiser of distilled liquors to solicit this person as a potential client. The presumed interest in cigars and liquor might then result in adult-oriented materials being presented. Clearly, the simple presumptions that are behind this parade of horribles may often result in erroneous conclusions.
Another inherent problem with the present technology of search engines is that the user, to make a request for information, must use words from natural language, and such words are inherently ambiguous. For example, suppose user enters the word “bat” as a search query to a search engine to search the database generated by its associated spider, and produce a set of ranked results according to the relevance algorithms. The word bat, however, has several possible meanings. The user could mean a “baseball bat”, or the mammalian bat, or maybe even a third or forth meaning. Because the technology of existing search engines cannot generally distinguish various users intentions, typically such engines will return results for all possible meanings, resulting in many irrelevant or even ludicrous or offensive results.
Yet another problem with existing search engine technologies relates to the problem of organizing results of a search for future use. Internet browsers, which are presently complex software applications that remain operative during the course of a search, can be used to store a particular URL for future use, but the lists of URLs created in this way tend to become very long and are difficult to organize. Therefore, if a user cannot remember a pertinent URL (many of which are long or obscure), the user may be forced to go search again for resources that user might wish were ready at hand for another use. On the other hand, in some instances, it may be more efficient to conduct a new search rather than recalling a saved search.
Although a few search engines for the mass market exist that charge a fee for use, this model has not been popular or successful. Instead, most search engines offer free access, subject to user tolerating background advertising or pitches for electronic commerce sales or paid links to sites that offer goods and services, including the aforementioned banner ads. These advertisements are typically paid for by sponsors on a per impression basis (each time a user opens the page on which the banner ad appears) or on a “click-through basis” (normally a higher charge, because user has decided to select the ad and “open it up” by activating an underlying hyper-link). In addition, most search engines seek “partners” with whom they mutually share hyperlinks to each other's sites. Finally, the search engines may seek to offer shopping services or merchandise opportunities, and the engines may offer these either globally to all users, or on a context sensitive basis responsive to a user's particular search.
Therefore, the art requires improved searching strategies and tools to provide increased efficiency in locating a user's desired content, while preventing dilution of the best records with those that are redundant, off-topic or irrelevant, or directed to a different audience.
The art also requires an improved user interface for accessing advanced search functionality from massive database engines.
Definition of Search Domain
Multiple database search systems are well known. For example, Dialog Information Services (now known as Knight-Ridder Information, Inc.), provides several hundred databases (also known as “collections”) available to searchers. In this case, each collection is a separate accounting unit. Some of these databases contain bibliographic abstracts, while others contain full-text documents. In use, a user is able to define a search query, which can be executed against a single or a plurality of databases. While tools are available to assist the user in defining the database(s) against which to search, fundamentally, the user manually selects individual databases which are of interest, for example based on his past experience, or manually selects a group of databases, selected by the information provider and related to a particular topic. When a query is applied to the group of databases, the information service retrieves the number of hits in each database, and often collates them to avoid duplication and to rank them according to a single criterion. The user then accesses databases of interest to view individual records.
As vast public networks, such as the Internet, become available, new opportunities in searching have become available, not only to searching professionals, but to lay users. New types of information providers are arising who use public, as well as private, databases to provide bibliographic research data and documents to users. When a user has an interest in a topic, he may not know what resources can be assembled for a search, nor the location of the resources. Since the resources frequently change, a user will have less interest in the source of the reply compared to the relevance of the reply. It is well known that distributed collections can be treated as a single collection. Typically, each sub-collection is searched individually, and the reports of all components are combined in a single list. The single list can then be ranked by search engines in accordance with an algorithm and given a weight, taking into account the nature of a particular collection, the determined relevance to the search query, and searcher-entered parameters. Methods are also available for normalizing document scores to obtain scores that would be obtained if individual document collections were merged into a single, unified collection.
One existing problem in the prior art is that the scores for each document are not absolute, but dependent on the statistics of each collection and on the algorithms associated with the search engines. A second existing problem is that the standard prior art procedure requires two passes. In a first pass, statistics are collected from each search engine in order to compute the weight for each query term. In a second step, the information from the first step is passed back to the search engine for each collection, which then assigns a particular weight or score to each hit or identified document. A third problem that exists is that the prior art requires that all collections use the same type search engine, or at least a bridge component to translate into a common format.
U.S. Pat. No. 5,659,732, expressly incorporated herein by reference, proposes a method for searching multiple collections on a single pass, with ranking of documents on a consistent basis so that if the same document appears in two different databases, it would be scored the same when the results are merged. In this system, it is not required that the same search engine be used for all collections. Each participating search engine server returns statistics about each query term in each of the documents returned. A final relevance score is then computed at the client end, rather than in the respective server. In this manner, all relevance scores are processed at the client in the same manner regardless of differences in the search engines.
U.S. Pat. No. 5,634,051, expressly incorporated herein by reference, proposes an information storage, searching and retrieval system for a large domain of archived data of various types, in which the results of a search are organized into discrete types of documents and groups of document types so that users may easily identify relevant information. The system includes means for storing a large domain of data contained in multiple source records, at least some of the source records comprising individual documents of multiple document types; means for searching substantially all of the domain with a single search query to identify documents responsive to the query; and means for categorizing documents responsive to the query based on document type, including means for generating a summary of the number of documents responsive to the query which fall within various predetermined categories of document types. The means for categorizing documents and generating the summary preferably includes a plurality of predetermined sets of categories of document types, and further includes means for automatically customizing the summary by automatically selecting one of the sets of categories, based on the identity of the user or a characteristic of the user (such as the user's professional position, technical discipline, industry identity, etc.), for use in preparing the summary. In this way, the summary for an individual user is automatically customized to a format that is more easily and efficiently utilized and assimilated. Alternately, the set of categories selected may be set up to allow the user to select a desired set of categories for use in summarizing the search results.
According to U.S. Pat. No. 5,634,051, expressly incorporated herein by reference, a process of storing, searching and retrieving information for use with a large domain of archived data of various types involves storing in electronically retrievable form a large domain of data contained in documents obtained from multiple source records, at least some of the source records containing documents of multiple types; generating an electronically executable search query; electronically searching at least a substantial portion of such data based on the query to identify documents responsive to the query; and organizing documents responsive to the query and presenting a summary of the number of documents responsive to the query by type of document independently of the source record from which such documents were obtained. According to a preferred embodiment thereof, the method also involves defining one or more sets of categories of document types, each category corresponding to one or more document types, selecting one of the sets of categories for use in presenting a summary of the results of the search, and then sorting documents responsive to the query by document type utilizing the selected set of categories, facilitating the presentation of a summary of the number of documents responsive to the query which fall within each category in the selected set of categories. The selection of the set of categories to be utilized may be performed automatically based on predetermined criteria relating to the identity of or a personal characteristic of the user (such as the user's professional background, etc.), or the user may be allowed to select the set of categories to be used. The query generation process may contain a knowledge base including a thesaurus that has predetermined and embedded complex search queries, or use natural language processing, or fuzzy logic, or tree structures, or hierarchical relationship or a set of commands that allow persons seeking information to formulate their queries. The search process can utilize any available index and search engine techniques including Boolean, vector, and probabilistic, as long as a substantial portion of the entire domain of archived textual data is searched for each query and all documents found are returned to the organizing process. The sorting/categorization process prepares the search results for presentation by assembling the various document types retrieved by the search engine and then arranging these basic document types into sometimes-broader categories that are readily understood by and relevant to the user. The search results are then presented to the user and arranged by category along with an indication as to the number of relevant documents found in each category. The user may then examine search results in multiple formats, allowing the user to view as much of the document as the user deems necessary. According to the present invention, the self-expressed limits of this patent may be relaxed, allowing use in conjunction with other techniques to achieve a useful result.
Information retrieval systems are designed to store and retrieve information provided by publishers covering different subjects. Both static information, such as works of literature and reference books, and dynamic information, such as newspapers and periodicals, are stored in these systems. Information retrieval engines are provided within prior art information retrieval systems in order to receive search queries from users and perform searches through the stored information. Most information retrieval systems seek to provide the user with all stored information relevant to the query. However, many existing searching/retrieval systems are not adapted to identify the best or most relevant information yielded by the query search. Such systems typically return query results to the user in such a way that the user must retrieve and examine every document returned by the query in order to determine which documents are most relevant. It is therefore desirable to have a document searching system which not only returns a list of relevant information to the user based on a search query, but also returns the list to the user in such a form that the user can readily identify which information returned from the search is most relevant to the query topic. The system may also provide a ranking or sorting algorithm over which the user may exert control, top assist in defining relevancy.
Existing systems for searching and retrieving files from databases, based on user queries, are directed primarily to the searching and retrieval of textual documents. However, there is a growing volume of multi-media information being published that is not primarily textual. Such multi-media information corresponds, for example, to still images, motion video sequences and digital audio sequences, which may be stored and retrieved by digital computers. It would be desirable from the point of view of an individual using an information searching/retrieval system to be able to be able to query a library or database and identify not only text documents, but also multi-media files that are responsive and relevant to the user's query. Moreover, it would be desirable if the searching system could return to the user not only a single list identifying both text and multi-media information responsive to the query search, but also enables the user to readily identify which of the text and multi-media files were most relevant to the query topic.
It is well known in the prior art of information retrieval systems to permit a user to specify a selected subject within a larger group of subjects for searching. For example, a user may wish to search only sports literature, medical literature or art literature. This avoids unnecessary searching through database documents that are not relevant to the user's subject of interest. In order to provide this capability, information retrieval systems must generally categorize documents received from publishers (or drawn from accessible databases) according to their subject, prior to adding them to the database. By seeking to perform this analysis after receiving a search query, the query response would be slowed and the same analysis potentially performed many times. However, present techniques for topically analyzing incoming documents often requires a human individual to read each incoming and make a determination regarding its subject. This process is very time consuming and expensive, as there is often a large number of incoming documents to be processed. The subjecting process may be further complicated if certain documents should properly be categorized in more than one subject. Automated systems for categorizing documents have been developed, for example based on semantic structures; however, these may be of variable quality or make erroneous conclusions.
Many publishers that provide documents to proprietary information retrieval systems require record keeping in order to ensure accurate royalty payments. Record keeping permits the publishers to determine the interest level in various documents produced by the publisher, and potentially the demographics of users retrieving such documents. Thus, it would be desirable to have a searching/retrieval system that tracked not only how often each document stored in the system database was retrieved by users, but also the demographics or respective user profile of the users retrieving the documents and the query searches used to identify and retrieve such documents.
U.S. Pat. Nos. 5,640,553, 5,717,914, 5,737,734, and 5,742,816, expressly incorporated herein by reference, are directed to a method and apparatus for identifying textual documents and multi-media files corresponding to a search topic. A plurality of document records, each of which is representative of at least one textual document, are stored, and a plurality of multi-media records, each of which is representative of at least one of multi-media file, are also stored. The document records have associated text information fields from one of the textual documents. The multi-media records have multi-media information fields for representing only digital images (i.e., still images or motion video image sequences), digital audio or graphics information, and associated text fields associated with the multi-media information fields. A single search query corresponding to the search topic is received, preferably in a natural language format, and an index database is searched in accordance therewith to simultaneously identify document and multi-media records. The index database has a plurality of search terms corresponding to terms represented by the text information fields and the associated text fields, as well as a table for associating each of the document and multi-media records with one or more of the search terms. A search result list having entries representative of both textual documents and multi-media files related to the single search query is generated, with links to the underlying data files.
The Collection Selection Problem
In order to maximize the desirability for users to access a particular private document collection, and preferably related sets of private collections, a collection access provider may acquire licensed rights to make available a wide variety of individual collections of content related documents as discrete databases that can be manually selected for search by a user. Typically, searches and retrievals of information from the discrete databases are subject to specific access fees determined based on the relative commercial worth of the information maintained in the individual databases. Consequently, access fees are typically calculated on the number of documents that are variously searched, reviewed, and/or retrieved in preparation of a search report from a particular database. A known problem in providing access to multiple databases is the relative difficulty or inefficiency in identifying an optimal database or set of databases that should be searched to obtain the best search report for some particular unstructured, or ad hoc, database query. In order to support even the possibility of ad hoc queries, the database search must be conducted on a full text or content established basis.
Existing full text search engines typically allow a user to search many databases simultaneously. For example, commercial private collection access providers, such as Dialog, allow a user to search some 500 or more different databases either individually or in manually selected sets. Consequently, the selection of a most appropriate set of databases to search may place a substantial burden on the user for each query. The user must manually determine and select a particular set of databases that must, by definition, contain the desired results to a query. Such a database set selection is difficult since the selection is made preemptively and independent of the query. This burden may be even more of an issue where access fees are charged for conducting a search against a database even where no search responsive documents are found or examined. In the aggregate, this problem is typically referred to as the “collection selection problem.” The collection selection problem is complicated further when the opportunity and desire exists to search any combination of public and private document collections. The Internet effectively provides the opportunity to access many, quite disparately located and maintained, databases. The importance of solving the selection collection problem thus derives from the user's desire to ensure that, for a given ad hoc query, the best and most comprehensive set of possible documents will be returned for examination and potential use at minimum cost. The collection selection problem is formidable even when dealing with a single collection provider. Dialog, an exemplary collection access provider, alone provides access to over 500 separate databases, many with indistinct summary statements of scope and overlapping coverage of topics. With over 50,000 major databases estimated presently available on the Internet, the collection selection problem is therefore impractical to solve reliably and efficiently by a user.
Some approaches to providing automated or at least semi-automated solutions to the collection selection problem have been developed. Known techniques, such as WAIS (wide area information server), utilize a “server of servers” approach. A “master” database is created to contain documents that describe the contents of other “client” databases, as may be potentially available on the Internet. A user first selects and searches the master database to identify a set of client databases that can then be searched for the best results for a given query. In many instances, a master WAIS database is constructed and updated manually. The master database can also be generated at least semi-automatically through the use of automatons (similar to spiders, but which must probe database servers, rather than available, typically non-dynamically generated web pages) that collect information freely from the Internet. The performance of such automatons, however, is often imperfect, if not simply incorrect, in their assessments of client databases. Even at best, certain client databases, including typically private and proprietary document collections, may block access by the automatons and are thus completely unrepresented in the master database. Even where database access can be obtained and document summaries automatically generated, the scaling of the master database becomes problematic if only due to the incomplete, summary, and mischaracterization of document summary entries in the master database. Manual intervention to prepare and improve automaton generated document summaries may enhance the usefulness of the master database, but at great cost. When any manual intervention is required, however, the scaling of the master database comes at least at the expense of the useful content of the master database document summary entries. With greatly increased scale, often only abbreviated document titles or small fractions of the client database documents can be collected as summaries into the master database. As scale increases, succinct manually generated summaries of client database documents become increasingly desired, if not required, to provide any adequate content for the master database document entries. Unfortunately, even at only a modest scale, a master database of manually generated or modified document summaries becomes an impracticable construct to build or maintain.
Perhaps one of the most advanced scalable approaches to constructing and using a meaningful master database is a system known as GLOSS (Glossary-of-Servers Server). An automaton is typically used to prepare a master database document for each client database that is to be included within GLOSS. Each master database document effectively stores the frequency of whatever potential query terms occur within the corresponding client collection of documents. The master database documents are then stored as the master records that collectively form the master database. In response to a user query, GLOSS operates against the master database documents to estimate the number of relevant client collection documents that exist in the respective client collections. These relevant document estimates are determined from a calculation based on the combined query term frequencies within each of the master database documents. GLOSS then assumes that client databases ranked as having the greatest number of combined query term occurrences are the most relevant databases to then search. Utilizing a relevance system based on term frequency inherently constrains the type and effectiveness of queries that can be meaningfully directed against the master database. In addition, the estimator used by GLOSS is by definition nonspecific to any client document. The GLOSS system is therefore highly subject to failures to identify client databases that may contain only a relatively few instances of the query terms, yet may contain relevant documents.
Other approaches to establishing a quantitative basis for selecting client database sets include the use of comprehensive indexing strategies, ranking systems based on training queries, expert systems using rule-based deduction methodologies, and inference networks. These approaches are used to examine knowledge base descriptions of client document collections. Indexing and ranking systems both operate typically against the client databases directly to, in effect, create categorizations of the client databases against search term occurrences. All possible query terms are indexed in the case of comprehensive indexing, while a limited set of predefined or static query terms are used in the case of simple ranking. Indexing thus generates a master database of selectable completeness that is nonetheless useable for selecting a most likely relevant set of client databases for a particular query. Ranking also generates a master database, though based on the results of a limited set of broad test queries intended to collectively categorize subsets of the available client databases. In effect, categorization by fixed query term results in generally orthogonal lists of ranked client database sets. Expert system approaches typically operate on client database scope and content descriptions to deduce or establish a basis for subsequently deducing a most likely set of databases that will likely contain the most relevant documents for a particular query. Finally, inference networks utilize a term-frequency based probabilistic approach to estimating the relevance of a particular client database as against other client databases. The known implementations of inference networks are unable to accurately rank the potential relevance of client databases of diverse size and differences in the generation of summaries for each of the client databases considered. Thus, the known approaches to solving the client database collection selection problem are generally viewed as inefficient in the assembly, construction, and maintenance of a master document database. These known systems are also viewed as often ineffective in identifying the likely most relevant documents within entire sets of collections because real world collections are often highly variable in size, scope, and content or cannot be uniformly characterized by existing quantitative approaches.
Another and perhaps practically most significant limitation of these known systems is that each must be self-contained in order to operate. This is a direct result of each system utilizing a proprietary algorithm, whether implemented as a manual operation or through the operation of an automaton, to universally assemble the information necessary to create or populate the master database documents from the raw collection documents. As such, these known systems cannot depend on one-another or on any other indexing systems; each must be responsible for both the total generation and subsequent exclusive utilization of their master database summary record documents. Consequently, there remains a need for an enhanced system of handling the collection selection problem in view of the ever-increasing number and scale of collections available on the Internet and the increasing variety of the collections, both in terms of existing organization and informational content.
U.S. Pat. Nos. 5,640,553, 5,717,914, 5,737,734, and 5,742,816, expressly incorporated herein by reference, are directed to a computer-implemented method and apparatus for composing a composite document on a selected topic from a plurality of information sources by searching the plurality of information sources and identifying, displaying and copying files corresponding to the selected topic. A plurality of records, each of which is representative of at least one information file, are stored in a database. A single search query corresponding to the search topic is received. The database is searched in accordance with the single search query to identify records related to the single search query. A search result list is then generated having entries representative of information files identified during the database search, and the search result list is displayed in a first display window open on a user display. Inputs representative of at least first and second selected entries from the search result list are received from the user, the first and second selected entries respectively corresponding to first and second information files. A second display window for displaying at least a portion of the first information file is opened on the user display, a third display window for displaying at least a portion of the second information file is opened on the user display, and a document composition window for receiving portions of the and second first information files is opened on the user display. The composite document is then composed by copying portions of the first and second information files from the second and third display windows, respectively, to the document composition window. The system also supports user accounting for system use.
U.S. Pat. No. 5,845,278, expressly incorporated herein by reference, provides a method of selecting a subset of a plurality of document collections for searching, in response to a predetermined query, based on accessing a meta-information data file that correlates the query significant search terms present in a particular document collection with normalized document usage frequencies of such terms within the documents of each document collection and a normalized document collection frequency of documents that include the search significant terms within the set of document collections. By access to the meta-information data file, a relevance score for each of the document collections is determined. The method then returns an identification of the subset of the plurality of document collections having the highest relevance scores for use in evaluating the predetermined query. The meta-information data file may be constructed to include document normalized term frequencies and other contextual information that can be evaluated in the application of a query against a particular document collection. This contextual information may include term proximity, capitalization, and phraseology as well as document specific information such as, but not limited to collection name, document type, authors, date of publication, publisher, keywords, summary description of contents, price, language, country of publication, number of documents included in collection, and publication name. An advantage of this type of system is that the method provides for both automated and manual description to be used in selecting collections that contain the most likely relevant documents in relation to an ad hoc query.
U.S. Pat. No. 5,845,278 thus relates to a method of selecting a subset of a plurality of document collections for searching in response to a predetermined query, based on accessing a meta-information data file that describes the query significant search terms that are present in a particular document collection correlated to normalized document usage frequencies of such terms within the documents of each document collection. By access to the meta-information data file, a relevance score for each of the document collections is determined, and an identification of the subset of the plurality of document collections returned having the highest relevance scores for use in evaluating the predetermined query. The meta-information data file may be constructed to include document normalized term frequencies and other contextual information that can be evaluated in the application of a query against a particular document collection. This other contextual information may include term proximity, capitalization, and phraseology as well as document specific information such as, but not limited to collection name, document type, document title, authors, date of publication, publisher, keywords, summary description of contents, price, language, country of publication, publication name. Statistical data for the collection may include information such as the number of documents in the collection, the total size of the collection, the average document size and average number of words in the base document collection.
U.S. Pat. No. 5,878,423, expressly incorporated herein by reference, relates to an index associated with a database that is dynamically processed in an information retrieval system to create a set of questions for use when processing a data inquiry from a user. The index, a structured guide used when searching the database, has different information domains. After one of these domains is selected, a particular order of the index categories within the selected domain is determined, typically by referring to the order lookup table within the index. A script corresponds to the selected domain. Within the script, there are questions corresponding to each index category within the selected domain. These questions are dynamically used to prompt the user. Only the questions corresponding to active index categories are arranged into the set of questions having a question set order corresponding to the particular order of the index categories. In an iterative process, the first question is identified and used to prompt the user to select a term from a scaled down vocabulary of terms (i.e., only those terms associated with the first question and corresponding first index category). Upon selecting the term, a search of the database is performed by the search engine module based upon the selected term. If there is too much information returned from the search, the next question is identified and the iterative process is repeated. Thus, this general technique may be used to iteratively select appropriate collections.
The index is provided having a variety of domains and a variety of terms. In addition to the previous description of an index, an index may be generally described as a data structure which maintains terms associated with information in a database, index categories associated with the terms, domains of particular index categories, and group headings. Each of the group headings may be hierarchically related to each other and correspond to information in the database in a vertical fashion. In other words, a hierarchical relationship between each of the group headings creates a vertical hierarchy with one or more levels. One of the domains is selected from the index. The selected domain has a variety of index categories, and is associated with a portion of the terms in the index. Each of the index categories is associated with a question so as to provide a set of questions for the selected domain. Next, the particular order of index categories in the selected domain is determined, corresponding to the particular order associated with the index categories in the selected domain. Generally, if any of the index categories are inactive (or the proposed question appears to have no discriminating power), the question corresponding to the inactive index category is deleted from the set. Deleting such questions from the set dynamically adapts the set to include only questions related to available information within the database. This, in turn, allows for a more contextual and appropriate response to selections made by a user and permits the data inquiry to be processed more intelligently. Furthermore, deleting such questions from the set avoids wasting valuable transaction processing time and the users time. Next, the first question in the set is identified from the remaining questions in the set. Typically, the terms of the index are scaled to include only those terms associated with the index category corresponding to the identified question. The user is then prompted to select one of the scaled terms. The database is searched for information associated with the selected term. If the amount of information retrieved from the database during the search exceeds a predetermined threshold, the method identifies the next question in the question set order and repeats the above-described steps. However, if the amount of information does not exceed the predetermined threshold, then the information is delivered to the user. Delivery is typically accomplished by transmitting a signal having the information back to the user. From these described steps, the set of questions is dynamically created for use when processing the inquiry from the user. According to the present invention, the user may be provided with a query status with each successive screen, to allow him to determine an appropriate threshold or determine when to examine the search results manually.
Targeted Advertising
The current wide-ranging use of computer systems provides a relatively large potential market to providers of electronic content or information. These providers may include, for example, advertisers and other information publishers such as newspaper and magazine publishers. A cost, however, is involved with providing electronic information to individual consumers. For example, hardware and maintenance costs are involved in establishing and maintaining information servers and networks. One source that can be secured to provide the monetary resources necessary to establish and maintain such an electronic information distribution network includes commercial advertisers. These advertisers provide electronic information to end users of the system by way of electronically delivered advertisements, in an attempt to sell products and services to the end users. The value of a group of end users, however, may be different for each of the respective advertisers, based on the product or services each advertiser is trying to sell and the class or classification of the user. Thus, it would be beneficial to provide a system, which allows individual advertisers to pay all, or part of the cost of such a network, based on the value each advertiser places on the end users the advertiser is given access to. In addition, advertisers often desire to target particular audiences for their advertisements. These targeted audiences are the audiences that an advertiser believes is most likely to be influenced by the advertisement or otherwise provide revenues or profits. By selectively targeting particular audiences the advertiser is able to expend his or her advertising resources in an efficient manner. Thus, it would be beneficial to provide a system that allows electronic advertisers to target specific audiences, and thus not require advertisers to provide an single advertisement to the entire population, the majority of which may have no interest whatsoever in the product or service being advertised or susceptibility to the advertisement.
U.S. Pat. No. 5,724,521, expressly incorporated herein by reference, provides a method and apparatus for providing electronic advertisements to end users in a consumer best-fit pricing manner, which includes an index database, a user profile database, and a consumer scale matching process. The index database provides storage space for the titles of electronic advertisements. The user profile database provides storage for a set of characteristics that corresponds to individual end users of the apparatus. The consumer scale matching process is coupled to the content database and the user profile database and compares the characteristics of the individual end users with a consumer scale associated with the electronic advertisement. The apparatus then charges a fee to the advertiser, based on the comparison by the matching process. In one embodiment, a consumer scale is generated for each of multiple electronic advertisements. These advertisements are then transferred to multiple yellow page servers, and the titles associated with the advertisements are subsequently transferred to multiple metering servers. At the metering servers, a determination is made as to where the characteristics of the end users served by each of the metering servers fall on the consumer scale. The higher the characteristics of the end users served by a particular metering server fall, the higher the fee charged to the advertiser.
Each client system is provided with an interface, such as a graphic user interface (GUI), that allows the end user to participate in the system. The GUI contains fields that receive or correspond to inputs entered by the end user. The fields may include the user's name and possibly a password. The GUI may also have hidden fields relating to “consumer variables.” Consumer variables refer to demographic, psychographic and other profile information. Demographic information refers to the vital statistics of individuals, such as age, sex, income and marital status. Psychographic information refers to the lifestyle and behavioral characteristics of individuals, such as likes and dislikes, color preferences and personality traits that show consumer behavioral characteristics. Thus, the consumer variables, or user profile data, refer to information such as marital status, color preferences, favorite sizes and shapes, preferred learning modes, employer, job title, mailing address, phone number, personal and business areas of interest, the willingness to participate in a survey, along with various lifestyle information. The end user initially enters the requested data and the non-identifying information is transferred to the metering server. That is, the information associated with the end user is compiled and transferred to the metering server without any indication of the identity of the user (for example, the name and phone number are not included in the computation). The GUI also allows the user to receive inquiries, request information and consume information by viewing, storing, printing, etc. The client system may also be provided with tools to create content, advertisements, etc. in the same manner as a publisher/advertiser.
Structured Information Presentation
As the amount of information available to a computer user increases, the problem of coherently presenting the range of available information to the computer user in a manner which allows the user to comprehend the overall scope of the available information becomes more significant. Furthermore, coherent presentation of the relationship between a chosen data unit of the available information to the rest of the available information also becomes more significant with the increase of information available to the user. Most of the existing methods utilize lists (e.g., fundamentally formatted character-based output), not graphic models, to indicate the structure of the available information. The main problem associated with the use of lists is the difficulty of indicating the size and complexity of the database containing the available information. In addition, because the lists are presented in a two-dimensional format, the manner of indicating the relationship between various data units of the available information is restricted to the two-dimensional space. Furthermore, because presentation of the lists normally requires a significant part of the screen, the user is forced to reduce the amount of screen occupied by the list when textual and visual information contained in the database is sought to be viewed. When this occurs, the user's current “position” relative to other data units of the available information is lost. Subsequently, when the user desires to reposition to some other data unit (topic), the screen space occupied by the lists must be enlarged. The repeated sequence of adjusting the screen space occupied by the lists tends to distract the user, thereby reducing productivity.
One attempt to alleviate the above-described problem is illustrated by U.S. Pat. No. 5,021,976, expressly incorporated herein by reference, which discloses a system for enabling a user to interact with visual representations of information structures stored in a computer. In a system of this type, a set of mathematical relationships is provided in the computer to define a plurality of parameters which may be of interest to the user, which mathematical relationships are also capable of indicating a degree of correlation between the defined parameters and segments of information contained in a defined information system. In addition, an “automatic icon” with multiple visual features is provided to enable the user to visualize the degree of correlation between the parameters of interest to the user and the particular data unit stored in the computer that is being examined by computer. As the degree of correlation for a given parameter changes, the visual feature representing that parameter will change its appearance.
Another attempt to coherently present a large body of information to a computer user is illustrated by U.S. Pat. No. 5,297,253, expressly incorporated herein by reference, which discloses a computer-user-interface navigational system for examining data units stored in the memory of a computer system. In this navigational system, the user interface shows a continuous and automatically updated visual representations of the hierarchical structure of the information accessed. By using an input/output device to manipulate icons that appear in a navigational panel, the user can navigate through the information hierarchy. As the user traverses the information hierarchy, a node icon representing each level in the hierarchy accessed by the user is displayed. The user is also able to directly select any level in the information hierarchy between the entry point and the level at which the user is currently located.
Yet another approach to coherently presenting a large body of information to a computer user is “SEMNET,” described in: Raymonde Guindon, ed., Cognitive Science and Its Applications for Human-Computer Interaction, (Hillsdale, N.J.: Lawrence Erlbaum Associates, Inc., 1988), 201-232. SEMNET is a three-dimensional graphical interface system that allows the users to examine specific data units of an information base while maintaining the global perspective of the entire information base. The SEMNET developers propose organizing component data units of an information base into various levels of hierarchy. At the lowest level of hierarchy, the most basic data units are organized into various sets, or cluster-objects of related information. At the next level of hierarchy, related cluster-objects from the lower hierarchical level are organized into a higher-level cluster-object. Continuing in this manner, SEMNET achieves a hierarchical organization of the information base. In the graphic display, related data units within a cluster-object are connected by lines, or arcs. In addition, using a “fisheye view” graphic presentation, SEMNET displays the most basic data units near the chosen data unit but only cluster-objects of increasing hierarchy as the distance increases from the chosen data unit. In this manner, the user is able to visualize the organization of the information base relative to the chosen data unit. See, U.S. Pat. No. 5,963,965, expressly incorporated herein by reference.
U.S. Pat. No. 5,812,134, expressly incorporated herein by reference, relates to a system for interactive, dynamic, three-dimensional presentation of a database structure, seeking to allows the user to efficiently navigate through the database to examine the desired information. The system graphically depicts the organization of the information base as “molecules” consisting of structured parallel “threads” of connected nodes, each encompassing a specific aspect of the overall database. Within a given thread, the component nodes, which share a commonality of subject, are arranged in a natural, linear progression that reflects the organizational structure of the information subject represented by the thread, thereby providing the user with a visual guide suggesting the appropriate sequence of nodes to be viewed. By providing a hierarchical representation of the organizational structure of the entire database, the navigational system provides the user with both the “position” of the information unit being currently examined relative to the remainder of the database, as well as the information regarding the overall size and complexity of the database. The system also provides the user with the capability to define one or more “customized” navigation “paths” over the database, as well as copy and modify existing units of information. The system therefore provides an interface system for presenting on a monitor of a computer system a dynamic, graphic representation of organization of one of a portion of information and entire information within an information base, the entire information within the information base being organized into at least first hierarchical level having at least a plurality of first-sublevel information units, a plurality of second-sublevel information units, and at least one third-sublevel information unit, each of the first-sublevel, second-sublevel and third-sublevel information units having an identifier, each of the second-sublevel information units comprising at least one first-sublevel information unit, the at least one third-sublevel information unit comprising a plurality of the second-sublevel information units, the dynamic, graphic representation implying a specific search path that a user may take in examining the available information, the interface system comprising means for generating a coded data map reflecting the organization of the entire information within the information base based at least on the identifiers for each of the first-sublevel, second-sublevel and third-sublevel information units; and means for presenting on the monitor the dynamic, graphic representation of the organization of the one of the portion of information and the entire information within the information base, the graphic presentation means generating the dynamic graphic representation based on the data map, the dynamic graphic representation comprising at least one molecule for the first hierarchical level of organization, the at least one first-hierarchical-level molecule having at least one first-hierarchical-level thread of multiple first-hierarchical-level nodes connected in sequence, each of the multiple first-hierarchical-level nodes representing one of the plurality of second-sublevel information units, and the at least one first-hierarchical-level thread representing the at least one third-sublevel information unit; wherein the sequence of first-hierarchical-level nodes in the at least one first-hierarchical-level thread corresponds to an organization of the at least one third-sublevel information unit. Thus, a taxonomy is constructed and employed to assist the user.
U.S. Pat. No. 5,774,357, expressly incorporated herein by reference, relates to a system that is adaptive to either or both of a user input and a data environment. Therefore, the user interface itself and/or the data presented through the user interface, such as a web browser, may vary in dependence on a user characteristic and the content of the data.
U.S. Pat. No. 5,886,698, expressly incorporated herein by reference, relates to a system wherein images representing search results are displayed on a screen of a computer system. The search results are responsive to a search in a database initiated by a user by entering a keyword or keywords via an input device. The images are displayed in such a way that an image corresponding to the highest matching value is the largest is size, while remaining matches are represented by images in direct proportion to their relevance to the keyword. In addition, the relevance of an image is indicated by its proximity to the keyword displayed on the screen: the closer the displayed image to the keyword, the more relevant the match represented by that image is to that keyword. This display operation is equivalent to two simultaneous searches with Boolean operators “OR” and “AND”. A graphical squeegee may be dragged across images representing search results in order to filter the results based on a keyword. The squeegee is displayed as a vertical bar and is associated with a keyword. As the squeegee is moved across the screen, images relevant to the keyword are moved while remaining images are not moved.
U.S. Pat. No. 5,918,236, expressly incorporated herein by reference, relates to system for generating and displaying point of view and generic “gists” for use in a document browsing system. Each point of view gist provides a synopsis or abstract that reflects the content of a document from a predetermined point of view or slant. A content processing system analyzes documents to generate a thematic profile for use by the point of view gist processing. The point of view gist processing generates point of view gists based on the different themes or topics contained in a document by identifying paragraphs from the document that include content relating to a theme for which the point of view gist is based. In one embodiment, the user of a document browsing system inputs topics to select a particular point of view gist for a document. A document browsing system may also display point of view gists based on a navigation history of categories selected by a user though use of a document browsing system. In another embodiment, a document browsing system generates and displays generic gists, which include content relating to the document themes.
Intelligent Searching
When a user of an information searching/retrieval system enters a search query into the system, the query must be parsed. Based on the parsed query, a listing of stored documents relevant to the query is provided to the user for review. In the prior art, it is known to use semantic networks when parsing a query. Semantic networks make it possible to identify words not appearing in the query, but which logically correspond to or are associated with the words used in the query. The number of words used to search the database is then expanded by including the corresponding words or associated words identified by the semantic network in the search instructions. This procedure is used to increase the number of relevant documents located by the information searching/retrieval system. Although semantic networks may be useful for finding additional relevant documents responsive to a query, this technique also tends to increase the number of irrelevant documents located by the search, unless other techniques are also used.
U.S. Pat. No. 5,855,015, expressly incorporated herein by reference, relates to a system and method for adaptively traversing a network of linked textual or multi-media information, which utilizes one or more heuristics to explore the network and present information to a user. An exploration or search heuristic governs activity while examining and exploring the linked information resources, while a presentation heuristic controls presentation of a manageable amount of information resources to the user. The system and method accept relevance feedback from the user, which is used to refine future search, retrieval, and presentation of information resources. The user may present an information query of various degrees of specificity or the system and method may search and present information resources based entirely on relevance feedback from the user.
Many information retrieval systems and methods focus primarily on selecting information based on a formatted query. The particular format often varies significantly from one system to the next depending upon the particular type of information and the structure of the information database. These approaches assume the existence of a collection of information and a user-specified query, with the task of the search engine being to find those documents that satisfy the query. A significant amount of research and development relating to information retrieval has focused on techniques for determining the degree of similarity between “information units”, i.e. a sentence, document, file, graphic, image, sound bite, or the like, or between the user query and an information unit. As the amount of information in the collection grows, the number of information units that correspond to the query will likely grow as well. As a result, it becomes necessary to make queries increasingly more precise so that the system will return a manageable number of results. It is therefore desirable for a system and method to facilitate sophisticated query construction without requiring an unreasonable amount of time or effort to be expended by the user.
One powerful approach to this problem utilizes a technique referred to as relevance feedback. In a system employing relevance feedback, a few resources that are determined to be interesting, or similar to a user query, are presented to the user who provides feedback to the system pertaining to the relevance of the resources. The user feedback is used to update the query, in an attempt to generate increasingly more precise queries resulting in retrieval of increasingly more relevant resources. A variety of implementations of the general technique of relevance feedback are possible, depending upon the particular structure of the query, the structure of the information, and the method for updating the query based on the feedback.
Traditional information search and retrieval techniques have been applied to hyperlinked information networks. One direct approach utilizing standard information retrieval techniques consists of systematically exploring the network and generating a catalog, index, or map of links associated with documents containing information of interest. This index is then used to retrieve the relevant information based on a user query without employing the hyperlinked structure. This approach is difficult to apply to large, dynamic hyperlinked information networks that may be too large to search exhaustively. Furthermore, the dynamic nature of such networks requires repetitively searching and updating the hyperlink index. This task involves continually accessing various network server locations, which requires a significant amount of network bandwidth, computing resources, and time. In addition, standard information retrieval techniques require the user to articulate or characterize information of interest. Frequently, however, users may be able to easily recognize a document meeting their information need, but may have difficulty expressing that need explicitly in a format appropriate for the information retrieval system. In these cases, the manual examination of search results is a critical part of the search process.
U.S. Pat. No. 5,855,015, expressly incorporated herein by reference, proposes a system for retrieval of hyperlinked information resources which does not require a specific user query to locate information resources of interest, and which actively explores a hyperlinked network to present interesting resources to a user. Heuristics and relevance feedback may be used to refine an exploration technique, or to present resources of interest to a user. The proposed system continually adapts to changing user interests. A system for retrieval of hyperlinked information resources is provided which includes a user interface connected to a programmed microprocessor which is operative to explore the hyperlinked information resources using a first heuristic to select at least one information resource, to present the at least one information resource to the user via the user interface based on a second heuristic, to accept feedback from the user via the user interface, the feedback being indicative of relevance of the at least one information resource, and to modify the first and second heuristics based on the feedback. The patent also proposes a method for retrieval of hyperlinked information resources that includes exploring the hyperlinked information resources using a first heuristic to select at least one information resource, presenting the at least one information resource to the user via a user interface based on a second heuristic, accepting feedback from the user via the user interface indicative of relevance of the at least one information resource, and modifying the first and second heuristics based on the feedback. In one embodiment, the system utilizes a series of training examples, each having an associated ranking, to develop the first and second heuristics that may be the same, similar, or distinct. The heuristics utilize a metric indicative of the relevance of a particular resource to select and present the most relevant information to the user. The user provides feedback, such as a score or rating, for each information resource presented. This feedback is utilized to modify the heuristics so that subsequent exploration will be guided toward more desirable information resources.
The system actively explores a hyperlinked network and presents a manageable amount (controllable by the user) of information to the user without a specific information query. (Of course, the present invention permits such a specific information query, and thus is not limited in this way). Thus, the method allows selection of information of interest that may have been excluded by a precisely articulated query. Furthermore, rather than inundating the user with information selected from a general, broad query, the amount of information presented to the user is limited so as to minimize the time and effort required to review the information. This system provides ability to automatically learn the interests of the user based on a number of ranked training examples. Once exploration and presentation heuristics are developed, a hyperlinked network may be explored, retrieving and presenting information resources based upon the heuristics established by the training examples. The system is capable of continually adapting the exploration and presentation heuristics so as to accommodate changing user interests in addition to facilitating operation in a dynamic hyperlinked information environment.
U.S. Pat. No. 5,890,152, expressly incorporated herein by reference, relates to a Personal Feedback browser and Personal Profile database for obtaining media files from the Internet. A Personal Feedback browser selects media files based on user-specified information stored in the Personal Profile database. The Personal Profile database includes Profile Objects that represent the interests, attitude/aptitude, reading comprehension and tastes of a user. Profile Objects are bundles of key words/key phrases having assigned weight values. Profile Objects can be positioned a specified distance from a Self Object. The distance from the Profile Object to the Self Object represents the effect the Profile Object has in filtering and/or selecting media files for that user. The Personal Feedback browser includes a media evaluation software program for evaluating media files based on a personal profile database. The Personal Profile database is also adjusted based upon user selection and absorption of media files.
U.S. Pat. No. 5,920,854, expressly incorporated herein by reference, establishes a collection search system that is responsive to a user query applied against a collection of documents to provide a search report. The collection search system includes a collection index including first predetermined single word and multiple word phrases as indexed terms occurring in the collection of documents, a linguistic parser that identifies a list of search terms from a user query, the linguistic parser identifying the list from second predetermined single words and multiple word phrases, and a search engine coupled to receive the list from the linguistic parser. The search engine operates to intersect the list with the collection index to identify a predetermined document from the collection of documents. The search engine includes an accumulator for summing a relevancy score for the predetermined document that is then related to the intersection of the predetermined document with the list. An advantage of this system is that the search engine utilizes selective multi-word indexing to speed the search by the effective inclusion of proximity relations as part of the document index retrieval. Furthermore, multiple identifications of a document, both on the basis of single word and phrase index identifications, results in a desirable bias of the search report score towards most relevant documents. Another advantage of this system is that the index database utilized handles both word and phrase terms as a single data type, with correspondingly simplified merge and join relational database operators. Through the combined handling of both words and phrases, the system usually requires only a single disk access to retrieve a term list from a collection index. The index database operations needed to support term searching and combination can be effectively achieved utilizing just merge and join relational operators, further simplifying and enhancing the intrinsic speed of the index database management system.
U.S. Pat. No. 5,920,859, expressly incorporated herein by reference, relates to a search engine for retrieving documents pertinent to a query that indexes documents in accordance with hyperlinks pointing to those documents. The indexer traverses the hypertext database and finds hypertext information including the address of the document the hyperlinks point to and the anchor text of each hyperlink. The information is stored in an inverted index file, which may also be used to calculate document link vectors for each hyperlink pointing to a particular document. When a query is entered, the search engine finds all document vectors for documents having the query terms in their anchor text. A query vector is also calculated, and the dot product of the query vector and each document link vector is calculated. The dot products relating to a particular document are summed to determine the relevance ranking for each document.
Use of Transactional Data for Marketing
In recent years, the field of data mining, or extracting useful information from bodies of accumulated raw data, has provided a fertile new frontier for database and software technologies. While numerous types of data may make use of data mining technology, a few particularly illuminating examples have been those of mining information, useful to retail merchants, from databases of customer sales transactions, and mining information from databases of commercial passenger airline travel. Customer purchasing patterns over time can provide invaluable marketing information for a wide variety of applications. For example, retailers can create more effective store displays, and can more effectively control inventory, than otherwise would be possible, if they know that, given a consumer's purchase of a first set of items, the same consumer can be expected, with some degree of probability, to purchase a particular second set of items along with the first set. In other words, it would be helpful from a marketing standpoint to know association rules between item-sets (different products) in a transaction (a customer shopping transaction). To illustrate, it would be helpful for a retailer of automotive parts and supplies to be aware of an association rule expressing the fact that 90% of the consumers who purchase automobile batteries and battery cables also purchase battery post brushes and battery post cleanser. (In the terminology of the data mining field, the latter are referred to as the “consequent.”) It will be appreciated that advertisers, too, can benefit from a thorough knowledge of such consumer purchasing tendencies. Still further, catalogue companies can conduct more effective mass mailings if they know the tendencies of consumers to purchase particular sets of items with other sets of items.
It is possible to build large databases of consumer transactions. The ubiquitous bar-code reader can almost instantaneously read so-called basket data, i.e., when a particular item from a particular lot was purchased by a consumer, how many items the consumer purchased, and so on, for automatic electronic storage of the basket data. Further, when the purchase is made with, for example, a credit card, the identity of the purchaser can be almost instantaneously known, recorded, and stored along with the basket data. As alluded to above, however, building a transaction database is only part of the marketing challenge. Another important part is the mining of the database for useful information. Such database mining becomes increasingly problematic as the size of databases expands into the gigabyte, and indeed the terabyte, range. Much work, in the data mining field, has gone to the task of finding patterns of measurable levels of consistency or predictability, in the accumulated data. For instance, where the data documents retail customer purchase transactions, purchasing tendencies, and, hence, particular regimes of data mining can be classified many ways. One type of purchasing tendency has been called an “association rule.” In a conventional data mining system, working on a database of supermarket customer purchase records, there might be an association rule that, to a given percent certainty, a customer buying a first product (say, Brie cheese) will also buy a second product (say, Chardonnay wine). It thus may generally be stated that a conventional association rule states a condition precedent (purchase of the first product) and a condition subsequent or “consequent” (purchase of the second product), and declares that, with, say 80% certainty, if the condition precedent is satisfied, the consequent will be satisfied, also. Methods for mining transaction databases to discover association rules have been disclosed in Agrawal et al., “Mining Association Rules Between Sets of Items in Large Databases”, Proc. of the ACM SigMod Conf. on Management of Data, May 1993, pp. 207-216, and in Houtsma et al., “Set-Oriented Mining of Association Rules”, IBM Research Report RJ 9567, October, 1993. See also, Agrawal et al., U.S. Pat. Nos. 5,615,341, 5,796,209, 5,724,573 and 5,812,997. However, association rules have been limited in scope, in the sense that the conditions precedent and subsequent fall within the same column or field of the database. In the above example, for instance, cheese and wine both fall within the category of supermarket items purchased.
U.S. Pat. No. 5,844,305, expressly incorporated herein by reference, relates to a system and method for extracting highly correlated elements (a “categorical cluster”) from a body of data. It is generally understood that the data includes a plurality of records, the records contain elements from among a set of common fields, the elements have respective values, and some of the values are common to different ones of the records. In an initialization step, for each of the elements in the records, an associated value, having an initial value, is assigned. Then, a computation is performed, to update the associated values based on the associated values of other elements. The computation is preferably iteratively to produce the next set of updated values. After the computation is completed, or after all the desired iterations are completed, the final results, i.e., the updated associated values are used to derive a categorical cluster rule. The categorical cluster rule provides the owner of the data with advantageously useful information from the data.
Tracking of User Activity
Frequency programs have been developed by the travel industry to promote customer loyalty. An example of such a program is a “frequent flyer” program. According to such a program, when a traveler books a flight, a certain amount of “mileage points” is calculated by a formula using the distance of the destination as a parameter. However, the mileage points are not awarded until the traveler actually takes the flight. When a traveler has accumulated a sufficient number of mileage points, he may redeem these points for an award chosen from a specific list of awards specified by the program. Thus, for example, the traveler may redeem the points for a free flight ticket or a free rental car. In order to redeem the accumulated points, the traveler generally needs to request a certificate, and use the issued certificate as payment for the free travel. While the above program may induce customer loyalty, it has the disadvantage that the selection of prizes can be made only from the limited list of awards provided by the company. For example, a traveler may redeem the certificate for flights between only those destinations to which the carrier has a regular service. Another disadvantage is that the customer generally needs to plan ahead in sufficient time to order and receive the award certificate. According to another type of frequency and award program, a credit instrument is provided and credit points are accumulated instead of the mileage points. In such programs, bonus points are awarded by using a formula in which the price paid for merchandise is a parameter. Thus, upon each purchase a certain number of bonus points are awarded, which translate to dollar credit amount. According to these programs, the customer receives a credit instrument that may be acceptable by many enrolled retailers, so that the selection of prizes available is enhanced. An example of such a program is disclosed in E.P.A. 308,224. However, while such programs may enhance the selection of prizes, there is still the problem of obtaining the credit instrument for redeeming the awarded points. In addition, the enrollee must allow for processing time before the bonus points are recorded and made available as redeemable credit. Thus, the immediacy effect of the reward is lacking in these conventional incentive programs. U.S. Pat. No. 5,774,870, expressly incorporated herein by reference, provides an on-line access to product information, product purchases using an on-line electronic order form, award catalogs, and award redemption using an on-line electronic redemption form. Bonus points are awarded immediately upon purchase of the merchandise, and are immediately made available for redemption.
These reward programs have the direct consequence that the user has an incentive to uniquely identify himself in order to be able to collect the reward after a number of uses, and to use the services associated with the reward program in distinction to similar services provided by others. Therefore, by providing a reward program, the value of personalization is increased for the user, thereby incentivizing the user to comply with the acquisition of personal information by the system.
Relevance Ranking
Web search services typically need to support a number of specific search capabilities to be at least perceived as a useful document locator service within the Web community. These capabilities include performing relatively complete searches of all of the available Web information, providing fast user-query response times, and developing an appropriate relevance ranking of the documents identified through the index search, among others. In order to support a relatively complete search over any collection of documents, the derived document collection index managed by a Web search service may store a list of the terms, or individual words, that occur within the indexed document collection. Words, particularly simple verbs, conjunctions and prepositions are often preemptively excluded from the term index as presumptively carrying no significant informational weight. Various heuristics can be employed to identify other words that appear too frequently within a document collection to likely serve to contextually differentiate the various documents of the collection. As can be expected, these heuristics are often complex and difficult to implement without losing substantive information from the index. Furthermore, as these heuristics generally operate on a collection-wide basis to minimize unbalanced loss of information, a distributed database architecture for storing the document collection variously becomes prohibitively complex to implement, slow in terms of query response time and quite limited in providing global relevancy ranking.
In order to improve query response time, conventional Web search services often strive to minimize the size of their indexes. A minimum index format provides identifications of any number of documents against particular indexed terms. Thus, word terms of a client query can be matched against the collection index terms to identify documents within the collection that have at least one occurrence of the query terms. A conventional relevancy score can be based on the combined frequency of occurrence of the query terms on a per document basis. Other weighing heuristics, such as the number of times that any of the query terms occur within a document, can also be used. These relevance-ranking systems typically presume that increasing occurrences of specific query terms within a document means that the document is more likely relevant and responsive to the query. A query report listing the identified documents ranked according to relevancy score is then presented to the client user. Simple occurrence indexes as described above are, nonetheless, quite large. In general, a term occurrence index maintained in a conventional relational database management system will be approximately 30% of the total size of the entire collection. At the expense of index size, proximity information is conventionally utilized to improve document relevancy scoring. The basic occurrence index is expanded into a proximity index by storing location-of-occurrence information with the document identifications for each of the indexed terms in a document collection. Storing the expanded term-proximity information results in the size of the index typically being on the order of 60 to 70 percent of the total size of the document collection. The term-proximity information provides an additional basis for evaluating the relevancy of the various documents responsive to a particular client query. Conventional search engines can post-process the client query identified documents to take into account the relative proximity of the search terms in individual documents. In effect, a revised relevancy ranking of the documents is generated based on whether, and to what degree, query terms are grouped in close proximity to one another within the identified document. Again, the conventional presumption is that the closer the proximity of the terms, the more likely the document will be particularly relevant and responsive to the client query.
Various schemes can be utilized to further weight and balance the relevancy scores derived from term frequency and term proximity. While a number of such schemes are known, the schemes operate on the reasonable and necessary premise that all relevant documents need to be initially identified from the collection before a final relative relevancy score can be computed. The relative relevancy is then calculated based on the full set of query-identified documents. Thus, existing search systems cannot effectively operate against a document collection index that, due perhaps to size or to support parallel access, is fragmented over multiple server systems or against multiple collection indexes that are served from multiple distributed servers. Furthermore, to determine if the proper, consistent ranking of the full set of query identified documents produce the ranking scores, conventionally they must be calculated over the full set of identified documents. Large amounts of information must therefore be pooled from the potentially multiple index servers in order to perform the aggregate relevancy scoring. Consequently, the convenience, as well as capacity and performance, potentially realized by use of distributed servers is not generally realized in the implementation of conventional search systems.
Another significant limitation of conventional search systems relates to the need to ensure the timeliness of the information maintained in the collection indexes. For large collections, the collection indexes need to be rebuilt to add or remove individual document-to-term relations. The process of building and rebuilding a collection index is quite time consuming. The rapid rate of document collection content changes however, requires that the indexes be updated frequently to include new index references to added or exchanged documents. Known index preparation functions and procedures are unfortunately one, if not many orders of magnitude slower than the typical rate of document collection content change. Ever larger and faster monolithic computer systems are therefore required to reduce the document collection indexing time. While computer performance continues to steadily increase, the rate of document additions and changes appears to be far greater. Furthermore, any increase in computer performance comes at a much-increased cost. Thus, practical considerations have generally become limiting factors on the performance, size and assured timeliness in searching collections for query identified documents. Consequently, there is a clear and present need for a collection search system that is scalable without loss of performance or repeatable accuracy and that can be actively maintained current substantially in real-time.
U.S. Pat. No. 5,924,090, expressly incorporated herein by reference, relates to a system for searching a database of records that organizes results of the search into a set of most relevant categories enabling a user to obtain with a few mouse clicks only those records that are most relevant. In response to a search instruction from the user, the search apparatus searches the database, which can include Internet records and premium content records, to generate a search result list corresponding to a selected set of the records. The search apparatus processes the search result list to dynamically create a set of search result categories. Each search result category is associated with a subset of the records within the search result list having one or more common characteristics. The categories can be displayed as a plurality of folders on the user's display. For the foregoing categorization method and apparatus to work, each record within the database is classified according to various meta-data attributes (e.g., subject, type, source, and language characteristics). Because such a task is too much to do manually, substantially all of the records are automatically classified by a classification system into the proper categories. The classification system automatically determines the various meta-data attributes when such attributes are not editorially available from source. If the number of retrieved records is less than a particular value (e.g., 20), a grouping processor is bypassed. Otherwise, the grouping processor processes a portion of the search result list to dynamically create a set of search result categories, wherein each search result category is associated with a subset of the records in the search result list.
For example, the portion of the search result list processed can be the first two-hundred (or one-hundred) most relevant records within the selected set of records. The grouping processor performs a plurality of processing steps to dynamically create the set of search result categories. A record processor identifies various characteristics (e.g., subject, type, source and language) associated with each record in the search result list. The candidate generator identifies common characteristics associated with the records in the search result list and compiles a list of candidate categories. The candidate generator utilizes various rules, which are described below, to compile the list. The weighting processor weights each candidate category as a function of the identified common characteristics of the records within the candidate category. Also, the weighting processor utilizes various weighting rules, which are described below, to weight the candidate categories. The display processor selects a plurality of search result categories (e.g., 5 to 10) corresponding to the candidate categories having the highest weight and provides a graphical representation of the search result categories for display on the user's monitor. The search result categories can be displayed as a plurality of icons on the monitor (e.g. folders). When a particular search result category is selected by the user, the display processor also can provide a graphical representation of the number of records in the search result category, additional search result categories and a list of the most relevant records for display. The user can select a search result category and view additional search result categories (if the number of records is greater than a particular value) along with the list of records included in that category. To narrow the search, the user can provide an additional search terms (i.e., a refine instruction). Upon receiving the additional terms, the search processor searches the database and generates another search result list corresponding to a refined set of the records. Alternatively, the user can (effectively) refine the search simply by successively opening up additional search result categories. See, http://www.northernlight.com.