1. Field of the Invention
The present invention relates to search engines, and more particularly, to search engine methods and systems that provide relevant and timely topics.
2. Background of Invention
The world economic order is shifting from one based on manufacturing to one based on the generation, organization and use of information. To successfully manage this transition, organizations must collect and classify vast amounts of data so that it may be searched and retrieved in a meaningful manner. Traditional techniques to classify data may be divided into four approaches: (1) manual; (2) unsupervised learning; (3) supervised learning; and (4) hybrid approaches.
Manual classification relies on individuals reviewing and indexing data against a predetermined list of categories. For example, the National Library of Medicine's MEDLINE® (Medical Literature, Analysis, and Retrieval System Online) database of journal articles uses this approach. While manual approaches benefit from the ability of humans to determine what concepts a data represents, they also suffer from the drawbacks of high cost, human error and relatively low rate of processing. Unsupervised classification techniques rely on computer software to examine the content of data to make initial judgments as to what classification data belongs to. Many unsupervised classification technologies rely on Bayesian clustering algorithms. While reducing the cost of analyzing large data collections, unsupervised learning techniques often return classifications that have no obvious basis on the underlying business or technical aspects of the data.
This disconnect between the data's business or technical framework and the derived classifications make it difficult for users to effectively query the resulting classifications. Supervised classification techniques attempt to overcome this drawback by relying on individuals to “train” the classification engines so that derived classifications more closely reflect what a human would produce.
Illustrative supervised classification technologies include semantic networks and neural networks. While supervised systems generally derive classifications more attuned to what a human would generate, they often require substantial training and tuning by expert operators and, in addition, often rely for their results on data that is more consistent or homogeneous that is often possible to obtain in practice. Hybrid systems attempt to fuse the benefits of manual classification methods with the speed and processing capabilities employed by unsupervised and supervised systems. In known hybrid systems, human operators are used to derive “rules of thumb” which drive the underlying classification engines.
No known data classification approach provides a fast, low-cost and substantially automated means to classify large amounts of data that is consistent with the semantic content of the data itself. Thus, it would be beneficial to provide a mechanism to determine a collection of topics that are explicitly related to both the domain of interest and the data corpus analyzed. Commonly owned, co-pending U.S. patent application, Ser. No. 10/086,026, entitled Topic Identification and Use Thereof in Information Retrieval Systems, filed on Feb. 26, 2002 by Paul Odom, provides such a mechanism.
At the same time, the emergence of the Information Age has created a wealth of information that is available electronically. Unfortunately, much of this information is often inaccessible to individuals because they do not know where to look for it, or if they do know where to look the information can not be found efficiently. For example, an individual is working at his desk and his boss requests that he find an electronic copy of a memo that the individual sent last month. The memo contains information that was obtained from a website, which included a spreadsheet that had data extracted from a division report.
The boss would like the individual to send a copy of the email and the references back to him as soon as possible. Also, he would like the individual to check for additional references to see if the conclusions in the memo need to be updated. The boss requires that the project be completed within fifteen minutes. The worker is not disorganized, but as is common, does not have total recall of how the information was gathered or where the email is stored. After thirty minutes, the worker finally finds the email. But, the worker still needs to search for additional information as requested by his boss. The end result is that because no efficient search mechanism existed the worker has missed his boss' deadline.
The above example commonly occurs within the workplace, and involves not just email, but all forms of electronically stored information. Human worker studies show that it is not unusual for some office workers to spend more than 10% of each work day looking for information. The same studies claim that less than half those searches are successful. Databases, data warehouses, document management systems, and file searches are often too difficult or “hit and miss” to be used effectively and efficiently. Corporate enterprises and government organizations have spent billions of dollars to aggregate and integrate information, so it will be more accessible. Of course, an individual can get answers if he is a database or document system expert and if the individual remembers the exact title, the exact phrasing used in the document, or the ever elusive primary key associated with the document of interest. Unfortunately, more common than not, this level of detail is not available to assist in finding the information.
Internet based searches are often times even more frustrating, and less productive. For example, it is not particularly useful when you know that there are approximately 6,120,000 answers to the search criteria you just entered. Ads associated with search engines are also often frustratingly irrelevant to a search and therefore of little interest to the users and of minimal value to the advertiser. The search engine ads try to identify promising content to be associated with. Unfortunately, these are often not very relevant either. For example, you entered “plasma injectors” and you get several ads for plasma televisions. Individuals have learned that keyword ads are not usually very useful, so individuals often completely ignore keyword ads.
Furthermore, because website popularity has nothing to do with what might be relevant in the thousands of search results, search results driven by website popularity can often lead to useless results. Meanwhile, at search engine operations facility there is an army of personnel and massive server farms humming away to potentially deliver hundreds of thousands of results to every search query that an individual enters.
Web searching, search advertising, and enterprise searching are not consistently providing acceptable search resolution for the user. The missing ingredient in current search technology is “true relevance”. Relevance can only be defined by the user for a specific search. Relevancy has no predictable pattern. No generalized algorithm is going to repeatably produce relevant information, because in the end, any generalization is arbitrary.
What has occurred, so far in the industry, is a fragmentation of search applications as vendors try to address niche search markets in an attempt to improve relevancy by narrowing the domain. For example, sites that are product specific, area-of-interest specific, group specific, or subject specific, have all been implemented. So far, there have been no successful generalized search applications that consistently provide high levels of relevancy.
Present search and topification algorithms generally assume that topics are relatively static. However, the relevance of topics to a particular search query is not only based on what appears in the content of the query, but the relevance can also be a function of current events. For example, if an individual had conducted a search of the Internet in January 2006 using the search string “NFL,” then one would expect the topics Denver vs. Pittsburgh and Charlotte vs. Seattle to be of interested, since these were the team pairings in the American Football Conference and National Football Conference championship games. This set of topics is time sensitive to the playoffs. While a search engine may have these topics in its database, these topics would be part of tens of thousands of possible topic results for a query using the term “NFL.” During the January 2006 time frame, the “Denver vs. Pittsburgh” and “Charlotte vs. Seattle” topics would likely be a very meaningful topic result. Unfortunately, search engines do not directly factor in time relevancy, and these topics would be mixed in with the tens of thousands of other possible topic results. Thus, a user would not likely receive as relevant search results as would be desired.
Another shortcoming of current search engines that display topics or search results is that search engines do not display topics associated with every subject matter domain related to a search constraint entered by a user. Rather a search engine may only show search results or topics that are most popular without regard to different subject matter domains that search results may belong to. For example, when a user enters the search constraint, Jaguar. The data items belonging to the search results may include topics that correspond to subject matter domains that include autos (e.g., there is a car named Jaguar), animals (e.g., there is an animal called Jaguar), software (e.g., there is a software package referred to as Jaguar), resorts (e.g., there are resorts in South America referred to as Jaguar resorts), football (e.g., there is a football team referred to as the Jacksonville Jaguars) and game (e.g., there is a game referred to a Jaguar). Those search engines that provide results based only on popularity of website hits, might only display topics or search results associated with the subject matter domain Auto. Or, at the very least, items associated with Resorts would be on page 27 of the search results. More often than not, a user probably would be looking for data items in the subject matter domain Auto. However, a reasonable proportion of users may also be interested in other domains that may be less popular. For these users, the search results displayed would not be particularly relevant and their specific areas of interest difficult to find. Thus, a user once again may not receive search results relevant to their particular area of interest.
What are needed are search methods and systems that can efficiently generate search results to identify and display topics by considering, at any given time, the relative significance of a topic based on current events and that ensure coverage of all subject matter domains associated with a search constraint.