Today, users are overwhelmed by information. Information overload is a problem for two reasons.
A first reason is that it requires “knowledge worker” to locate a pertinent information. A second reason is that pertinent information is seldom found because a search is abandoned before the right information is found.
According to Outsell, July 2001, “In today's business, knowledge workers spend an average of 10 hours per week searching for information”.
At a very basic level, a knowledge worker uses a search engine to look for information. The search engine looks for results by matching the worker query with information that is tagged or indexed within a plurality of documents. Today the “tagged information” is created manually. Because it is very expensive and time-consuming to do so manually, much of the available information is not tagged, and if it is, it is not done at a granular level. The granular level refers to a level that is more specific and fine-tuned that a non-granular level. The outcome of the process is that the knowledge worker cannot find the information at the right time because the information, he or she seeks, has not been tagged or identified within the plurality of documents.
Two types of approach are available in the domain of text categorization. A first approach is a categorization based on keywords. A second approach is a categorization based on data from texts of a pre-categorized training corpus.
Both approaches have their pros and cons. The keyword approach provides acceptable results as long as the keywords identified manually are found in the text. Contrarily, the statistic approach, using all the words of the text in a training corpus, must be able to recognize accurate returns from a much larger group of inaccurate returns. However, both approaches are limited when faced with ambiguity resolution with respect to the language and taxonomy used.
There is therefore a need for a method and apparatus that will overcome the above-identified drawbacks.