1. Field of the Invention
The present invention relates generally to data processing of documents and more particularly to searching and retrieving of information and knowledge, and synchronizing search results with advertisers' products and services, in particular to methods and systems for formulating query, user relevancy feedback, managing results, personalization of knowledge, and monetization of content and knowledge.
2. Description of Related Art
Today's major World Wide Web (the “Web”) search engines, such as those provided by Google and Yahoo, crawl the Web and index billions of Web pages maintained in their respective repositories. Rudimentary processing of the information inherent in these ultra-large and substantially unstructured datasets has already led to the creation of a multi-billion dollar industry. The underlying model is deceptively simple: (i) content creators from around the globe publish their content on the web and link them to other contents via hyperlinks. In fact, the web, having at first grown in an organic and voluntary fashion, has now become almost the first choice as a publication medium: it is the norm now to publish one's content on the web first, and often, in preference to any other communication media; (ii) users search this repository for information required for their everyday decision making process, the number of users and searches having increased exponentially since the inception of the web; and (iii) advertisers pitch their products and services as the users browse and search for information. There is, however, a growing realization in the industry that in order for this model to continue to work and the industry to keep growing, the three primary stakeholders, namely the users of the web, the advertisers, the content owners, have to be served better.
Users can experience variety of problems using current search technologies. Searching information is an imprecise process, where users frequently often do not have a clear vision of their goals, may have only a fuzzy understanding of what they want, recognizing it only when they see it. The facility to search is generally limited to keyword and Boolean functions of these keywords. Keywords and their Boolean functions are notoriously inefficient in capturing user intent.
FIG. 1 depicts a conventional search engine-user interaction process. Conventional search engines are much more data-oriented (i.e., keyword-oriented). Usually, they return as hits, a linear list of documents embedded with the keywords entered by the user. One can also use sophisticated Boolean functions of these keywords as search criteria. Typically, these search systems present a long disorganized list of several hundred thousands, or millions of documents listed according to the underlying search systems' global ranking algorithm and the “proximity” score, that determines—sometimes arbitrarily—relevance of the keywords entered by the user in relation to a document under consideration. Most users do not understand Boolean expressions, Boolean models, and how to express their search requests in terms of Boolean expressions. A majority of the Boolean expressions constructed by users consist of a sequence of keywords. As a result, the long and disorganized list of documents returned frequently fails to directly address users' information needs.
In conventional search engines, the primary criterion of using Page Rank for determining if a document is relevant has been known to have serious deficiencies. For example, if the SONY Corporation's home page (which has a high page rank) adds only one piece of information about heart-disease, then this document will be displayed very high on a list returned responsive to a search for “heart disease,” even though this is a very isolated document and is probably not very relevant, given Sony's business models. Thus, sorting via page rank can often lose the context of the documents.
Conventional search engines often require users to spend a large amount of time reformulating their search expressions to satisfy their information needs: these conventional web search engines contain an underlying assumption that users' information needs are static. However, users' information needs, and subsequently their search expressions, continuously change and often take new and unexpected directions upon assimilation of the information retrieved throughout the search process. Often, the original goal of the search may be only partially fulfilled. In addition, users' information needs are generally not satisfied by a single, final retrieval of a set of documents, but rather by a series of selections and bits of information found along the way.
Furthermore, conventional search engines require separate, and often manual processing by users that generally includes scanning result information, viewing lists of titles, reading the titles in result sets, reading the retrieved documents themselves, scanning thesaurus structures, manually constructing lists of topics related to query terms, documenting separately additional keywords associated with topics of interests, and following hypertext links within the documents related to search results. Users repeat these steps until, by chance, the users' query expression matches the search engines' underlying page ranking schemes such that the “keyword-relevant” result set corresponds to the “user-relevant information.” A lot of times users lose track of the path taken from initial query to reach the desired information. When the same search is subsequently initiated, there is no guarantee that the same search process can be reproduced to achieve the information goal.
Advertisers are directly affected by the problems experienced by users of conventional search engines. The current dominant practice associates keywords to products and services. Thus companies end up buying hundreds of thousands of keywords so that the keywords will cover all the meanings and intent with which users may be searching the web, and then the companies spend millions of dollars to analyze the keyword return on investment (ROI). Keyword based advertisement creates unusual problems: for example, a Google™ search for the word “virus” returns a preponderance of pages related to computer viruses leading related software companies to bid heavily for the keyword “virus.” This leaves no room for sellers of drugs for viral infections produced by pharmaceutical companies. Thus, software producers and pharmaceutical companies are forced to compete with each other although no overlap in their respective sectors is apparent. Therefore, keyword based advertisement does not reach intended potential customers, and severely limits Internet ad-billboard space.
In conventional revenue generation systems for search engines, a user performs a search and the search results are displayed along with advertisements that match keywords used in the search. If the user then clicks on an advertisement, the search engine provider receives a share of the revenue paid by the advertiser according to the cost-per-click (CPC) model. The basic premise underlying this model is that the user was satisfied with the quality and information content of the documents returned by the search results. While a search engine can sort and organize a given set of documents, it neither creates nor controls the quality of information in the documents. Thus, content owners who make high-quality searches possible are excluded from the revenue sharing equation.