1. Field of the Invention
The present invention relates to online information filtering in general, and more particularly, information filtering of Web or Intranet searching results. The method employs a rich supervised learning paradigm by accepting relevance feedback to cluster the information and more particularly it employs an efficient user-interaction-based method of text representation and cluster neighbourhood analysis providing a personalized information filtering for online search applications.
2. The Background of the Invention
The problem of information overload is overwhelming almost every Web surfer or a user scouring information from an Intranet. Information seekers on the Internet go through one or the other search engine to research a topic they are interested in. It is estimated that there are over 2500 search services. Some are directory-based where the user would drill down various levels of pre-classified information to arrive at one or two documents they might be interested in. The others are keyword-driven search engines where the user specifies the keywords that drive the search process and the search engine brings up numerous results, which the user has to browse and find out where they are of any relevance. Most search services have a combination of both.
Most of these search engines offer little or no personalized features. The user would be treated as an anonymous visitor who gets inundated with a lot of irrelevant information. For instance, if the user is searching for information about the interest, Cricket, obviously all the documents that are related to the game of Cricket will be non-sense to the user. But most search engines have little or no feature to enable the user to specify and interact with the search facility such that they get the right type of information. Instead, if we could somehow recognise the user (as if we know his or her interests) under a Topic profile, it would be a lot more effective in getting accurate information seen by the user.
Another big limitation with most search engines is that the amount of time and expertise spent in researching a subject area is never remembered. There is nothing like a xe2x80x9cstop and resumexe2x80x9d interface. The work involved in researching and judging documents as relevant and irrelevant has to be repeated over and over each time the search engine is used to look for information in that subject area.
Publicly indexed information available to a Web user is exploding as days pass by. A typical search engine throws up hundreds of results for a user query. A very good document could be at the bottom of the pile. Not all of these hundreds of results will be useful to the user. Instead the user would like the information to be presented in a classified manner either by relevancy or by the nature of the concept the documents cover and the concepts the user likes.
Information filtering algorithms are designed to sort through large volumes of dynamically generated information and present the user with those that are likely to satisfy his/her information requirement. With the growth of the Internet and other networked information, research in the development of information filtering algorithms has exploded in recent years. A number of ideas and algorithms have emerged.
Some of the earlier approaches have adopted what is known as the classical supervised learning paradigm. In this paradigm, when a new icon (document) arrives, the learning agent suggests a classification, the supervisor (user) would provide a classification, and the difference is used to adjust parameters of the learning algorithm. In such a paradigm, the agent""s classification and the user""s classification can be independent processes. The user can also give a classification even before seeing the agent""s classification.
Learning itself can be either xe2x80x9csupervisedxe2x80x9d or xe2x80x9cunsupervisedxe2x80x9d. In supervised learning networks the input and the desired output are both submitted to the network and the network has to xe2x80x9clearnxe2x80x9d to produce answers as close as possible to the correct answer. In an unsupervised learning network the answer for each input is omitted and the networks have to learn the correlation between input data and to organise inputs into categories from these correlations.
Supervised learning is a process that incorporates an external teacher. It employs Artificial Neural Networks that are particularly good at dealing with such ill-structured documentation handling and classification tasks that are usually characterised by a lack of pre-defined rules. The network is given a set of training patterns and the outputs are compared with desired values. The weights are modified in order to minimise the output error. Supervised algorithms rely on the principle of minimal disturbance, trying to reduce the output error with minimal disturbance to responses already learned.
The application of supervised learning paradigms will improve the performance of a search system. While an unsupervised approach may be easier to implement, since it does not require external intervention, a supervised approach could provide much better results in situations where a thesaurus or a knowledge base already exists or when a human expert can interact with the system. The objective is to employ neural techniques to add the xe2x80x9cintelligencexe2x80x9d needed in order to fulfil the user requirements better. Systems employing these models exhibit some of the features of the biological prototypes such as the capability to learn by example and to generalise beyond the training data.
Both supervised and unsupervised approaches rely upon a technique of document representation. It is a numerical representation of the document, which is used to produce an ordered document map.
One of the standard practices of document representation in information retrieval (IR) systems is the Vector Space information paradigm. This approach encodes the document set to generate the vectors necessary to train the document map. Each document is represented as a vector (V) of weights of keywords identified in the document. The word weight is calculated using the Term Frequency*Inverse Document Frequency (TFIDF) scheme which calculates the xe2x80x9cinterestingnessxe2x80x9d value of the word. Such formulae are used to calculate word weights and used to train the networks to create the information map.
Document representation techniques are used in the classification of textual documents by grouping (or clustering) similar concepts/terms as a category or topic, a process calling for cluster analysis. Two approaches to cluster analysis exist: the serial, statistical approach and the parallel, neural network approach.
In the serial approach, classes of similar documents are basically found by doing pairwise comparisons among all of the key elements. This clustering technique is serial in nature in that pairwise comparisons are made one at a time and the classification structure is created in a serial order. The parallel neural network approach is based on establishing multiple connections among the documents and clusters of documents allowing for independent, parallel comparisons.
A significant number of text-based classification algorithms for documents are based on supervised learning techniques such as Bayesian probability, decision trees or rule induction, linear discriminant analysis, logistic regression, and backpropagation-like neural networks
In spite of so many complex techniques researched to solve the problem of information filtering, the process of Searching (esp. on the Internet) is yet an unresolved problem.
Accordingly it is the object of our invention to make an attempt and provide a more effective and efficient neural network based supervised learning process that learns incrementally as documents arrive and the user grades them by providing feedback to the learning agent. The technique described in this invention can be called Supervised Clustering Analysis.
There is a need for personalization of the information searching either on the Web environment or in an Intranet supporting user participation in grading and ranking documents.
There is a need for automatic learning agents, which build subject profiles in which a user""s feedback is captured and used to learn about the user""s interests for pushing the right information and collaborating with others.
These and other needs are attained by the present invention, where whenever a new document arrives, a learning agent suggests a classification and also provides an explanation by pointing out the main key-phrases of the document responsible for its classification.
The user looks at this, gives his/her classification and then provides hints by showing a list of key-phrases that according to the user, are truly responsible for the classification suggested by the user. In this way the user""s classification is truly of a feedback nature.
The algorithm behind the invention has three major parts.
Document Representation
Feedback and Learning
Classification
This information filtering process is designed to sort through large volumes of dynamically generated textual information, to incrementally learn as new text documents arrive and the user grades them by providing feedback.
Text-based documents either dynamically retrieved from the Web or available in a textual repository on an Intranet are represented by applying key-word weighting""s after capturing the user reasoning for classifying the document as relevant or irrelevant.
When a new item (document) arrives, the learning agent suggests a classification and also provides an explanation by pointing out the main features (key-phrases) of the item (document) responsible for its classification. The user looks at this and provides hints by showing a list of features (key-phrases) that are truly responsible for a particular way of classifying the document.
This interaction method contributes to the learning process. The apparatus includes a feedback-based clustering scheme that models user""s interest profiles, a simple neural adaptation method for learning the cluster centres to provide personalized information filtering for information seekers.