With the advent of modern technology, including the Internet a wealth of information is available to computer users. Users can automatically retrieve a multitude of different documents by searching the Internet. However, the wealth of information has become so overwhelming that there is a need to organize, classify and filter information according to different criteria.
It is common for computer users connected to the Internet to utilize web browsers and search engines to locate web pages of particular interest. Search engines, such as Google, index hundreds of millions of web pages maintained by computers all over the world. The users compose queries, and the search engine identifies pages that match the queries according to the subject matter of the pages.
In many instances, particularly when a query is short, broad, or not well defined, the result set can be overwhelmingly large, for example thousands of pages. Furthermore, many of the pages returned are irrelevant and not of a quality suitable to provide the desired information. This is because “quality” is in practice impossible to define in general, whether explicitly, or through a series of steps of a computer program.
Many companies and researchers have developed methods that use the text of a document to identify its topic automatically. This process is called text categorization or text classification. For example, a press release may be categorized automatically as concerning the computer industry or the automobile industry. Such methods of test classification group articles or documents according to subject matter, not quality.
Some ranking approaches utilize user feedback. These approaches require users to supply relevance information to iteratively improve ranking. However, studies have shown that users are generally reluctant to provide relevance feedback. Within the context of email, it is also known in the art to utilize text routing or filtering in order to classify and select messages. This is the process of deciding where or to whom to send a message or document. Such a classification system utilizes criteria based on the desired recipient. One common application of text filtering is to identify low-priority email messages automatically. The purpose of such methods is generally to identify unsolicited commercial email. For instance, unwanted advertising has become a problem endemic to email, with users receiving vast amounts of unwanted email, known as ‘spam’. Such documents are undesirable because of the lack of the recipient's interest in receiving such correspondence.
It is also known within the art that many email carriers may automatically filter such correspondence. For instance, the Hotmail service of Microsoft may categorize messages that are sent to a number of emails, rather than to a single recipient, into a folder marked “Bulk Mail”. While the intended recipients may desire to read such emails, they are categorized and placed in a different folder automatically because of the number of intended recipients or the sender's email identity. While sorting according to the identity of the sender or number of recipients represents an advancement in the art, this still is problematic in that it only applies to email, and does not provide a fine-grained ranking of messages. A few companies and researchers have software methods that attempt to predict how an individual user will perceive the relevance of a document. The major drawback of these methods is that they require detailed information about the preferences of each user in order to be beneficial for that user.
Learning processes are also known within the art, wherein a program is capable of learning or remembering which documents may be preferred by a user. However, to date these technologies have faced similar problems in that they are generally topic based, or user/recipient based. This is to say documents are desirable or undesirable because of their subject matter or because of the sender or receiver's identity. So, while such processes represent advancements, there is a need for a system and method that utilizes a learning process in order to select documents according to their quality, rather than topic or user/recipient identity.
Another problem with the aforementioned technology is that because of the sheer amount of information being delivered, it is impractical for wireless and telephony applications. In many of these applications, bandwidth for transmitting information to a device is limited or expensive or both. Additionally, many of these applications use devices, whether screen based or voice based or other, that can only present a limited amount of information to the user. By filtering and limiting the result set of a query to only information of a high quality, as performed by the invention described herein, the restricted bandwidth and restricted presentation capacity can be used more efficiently.
There also exists a need for a method and system capable of filtering documents according to their quality when not connected to the Internet. For instance, many companies with a vast array of internal documents may desire to select certain documents not only according to their subject matter, but also their quality.