The internet contains vast numbers of web pages stored in computer files located all over the world. More and more files are constantly being created and placed on the internet. The vast number of internet files and the speed in which the internet is growing make it impossible to use human labor to classify and organize those files into meaningful categories. Yet there currently exists no system that will automatically analyze web pages or computer files and arrange them into meaningful categories that will facilitate the retrieval of relevant information from the internet or intranets.
Yahoo (www.yahoo.com) is a popular search engine that manually classifies web pages into subjects (such as, Arts & Humanities, Business & Economy, Computers & Internet, and Education, each of which is further classified into sub-categories, thereby forming a directory structure). The manual classification process usually begins with users who submit suggested subjects for their web sites or web pages. The sites are then placed in categories by people (called Surfers) who visit and evaluate the suggestions and decide where they best belong. By using this manual process, Yahoo ensures the classification is done in the best humanly possible way. However, since the manual process is labor intensive and relatively slow compared to the rapid growth of web pages, Yahoo can now only classify a small percentage of web pages (estimated to be less than 10%). This manual process simply cannot keep up with the explosive growth of the web. Thus, the percentage of manually classified web pages is estimated to be getting smaller and smaller.
Most search engines (such as, AltaVista, Excite, Go (formerly Infoseek), DirectHit, Google, and Lycos) do not provide classification of web pages (or only rudimentary manual grouping of a small number of pages). With the exception of DirectHit, these search engines rank search results based on factors such as the location of the keywords and the number of occurrences of the keywords. For example, if the keywords are located in the title of a web page, then the web page is ranked higher than other web pages that contain the same keywords in the body.
DirectHit (www.directhit.com), on the other hand, ranks search results based on the usage history of millions of Internet searchers. This ranking is based on a number of usage factors, such as the number of users who select a web page and the amount of time the users spend at the web page. By presenting the higher ranked pages first, one can see and find the most popular pages or sites.
Northern Light (www.northernlight.com) is one of the first search engines to incorporate automatic web-page classification. Northern Light organizes search results into categories by subject, type, source, and language. The categories are arranged into hierarchical folders much like a directory structure. The arrangements and the choices of the categories are unique to each search and generated based on the results of the search.
The automated categorization of web documents has been investigated for many years. For example, Northern Light received U.S. Pat. No. 5,924,090 for their classification mechanisms. Mladenic (1998) (citations for all references given herein are provided at the end of this specification) has investigated the automatic construction of web directories, such as Yahoo. In a similar application, Craven et al. (1998) applied first-order inductive learning techniques to automatically populate an ontology of classes and relations of interests to users. Pazzani and Billsus (1997) apply Bayesian classifiers to the creation and revision of user profiles. WebWather (Joachims et al., 1997) performs as a learning apprentice that perceives a user's actions when browsing on the Internet, and learns to rate links on the basis of the current page and the user's interests. For the techniques of construction of web page classifiers, several solutions have been proposed in the literature, such as Bayesian classifiers (Pazzani & Billsus, 1997), decision trees (Apte et al., 1994), adaptations of Rocchio s algorithm to text categorization (Ittner et al., 1995), and k-nearest neighbor (Masand et al., 1992). An empirical comparison of these techniques has been performed by Pazzani and Billsus (1997). The conclusion was that the Bayesian approach leads to performances at least as good as the other approaches.
The prior art also includes methods of text learning and document classification. Text learning techniques are used to extract key information from documents. The extracted information is used to represent a document or a category. To represent (or to describe) a document or a category in a concise way, text learning techniques are used to abstract key information from the documents. A simple but limited document representation (or description) is the bag-of-words technique (Koller 1998, Lang 1995). To represent a given document, the technique simply extracts key words from the document and uses those words as the representation of that document. To make the representation concise, many common words (also called stop words), such as pronouns, conjunctions and articles, are not included in the representation.
Various derivatives from the bag-of-words technique have also been proposed. For example, Mladenic (1998) extends the bag-of-words concept to a bag-of-phrases, which was shown by Chan (1999) to be a better choice than using single words. Experiments have shown that a phrase consisting of two to three words is sufficient in most classification systems.
Another extension of this concept is to associate each phase (or term) with a weight that indicates the number of occurrences of that phase in the document (Salton 1987). To increase the accuracy of counting the occurrences, many forms of a word, such as plural or past tense of a word, are considered the same as the original word, which is done by using a process called “stemming.” Each phase together with its associated weight is considered as a feature of the document. All the extracted features of a document are grouped to form a vector called a “feature vector” representation of that document.
As an example, assume the block of text seen in the left in FIG. 1 represents a′ text file. The chart to the right in FIG. 1 represents the number of occurrences of particular words in the text. One possible way to form a feature vector representing this text would be to list the number of occurrences of each key (i.e., uncommon) word. However, because of the large number of different words appearing in an average text document, typically only a limited number of the most frequently used words will be selected as features. Thus, if the features chosen to represent the document in FIG. 1 were “plantation”, “Louisiana”, “house”, portrait” and “fireplace”, the feature vector could be represented as (2, 2, 1, 1, 1). It is also typical to normalize the feature values, for example, by dividing each feature value by the sum of the feature values (in this case 7), thus giving the example feature vector as (0.29, 0.29, 0.14, 0.14, 0.14). Obtaining a feature vector representative of multiple files is accomplished by a normalized sum of the individual feature vectors, e.g., let C be the normalized sum of vectors A and B, then
                              C          i                =                                            A              i                        +                          B              i                                            ∑                          (                                                A                  i                                +                                  B                  i                                            )                                                          (        1        )            Likewise, the similarity of vectors A and B may be determined by their dot product or
                              ∑                      (                                          A                i                            ×                              B                i                                      )                                                            A                                ×                                  B                                                          (        2        )            While a text file was given as the preceding example, it will of course be understood that a feature vector could represent a webpage or any other electronic document or item of information.
One way to represent a category or a folder representing many files is by using the similar vector representation as described above for documents. In this case, a set of training documents for a category is provided. Text learning techniques extract the common terms among the documents and use those terms to form a vector representation of the category. One such technique is called Term Frequency Inverse Document Frequency (TFIDF) (Salton 1987). TFIDF representation extends the feature vector concept further to account for the number of occurrences of a term in all training documents. It represents each category as a vector of terms that are abstracted from all training documents. Each training document Dj is represented by a vector Vj and each element of the vector Vj is a product of the term frequency TF(Wi, Dj) and the inverse document frequency IDF (Wi), where TF(Wi, Dj) is the number of occurrences of the term Wi in the document Dj. IDF(Wi) is the product of the total number of training documents T and the inverse of DF(Wi) is the number of documents containing the term Wi. That is:
      IDF    ⁡          (      Wi      )        =      T          DF      ⁡              (        Wi        )            
Log(T/DF(Wi)) is often used instead of the simple product. A single vector is formed by combining all the vectors Vj where j ranges 1 to T. Each element of the single vector is the average value of all the corresponding elements in Vj (j from 1 to T). Other more sophisticated techniques are available such as PrTFIDF (Joachims 1997). Joachims extended the TFIDF representation into probabilistic setting by combining probabilistic techniques into the simple TFIDF.
Once each category is represented by a vector and a document is also represented by a vector, classifying the document is done by comparing the vector of the document to the vector of each category. The dot product (equation 2) between the vectors is usually used in the comparison. The result of the dot product is a value which is used to measure the similarity between the document and a category. The document is assigned to the category that results in the highest similarity among all the categories. Other more sophisticated classification algorithms and models were proposed including: multivariate regression models (Fuhr 1991, Schutze 1995), nearest neighbor classifiers (Yang 1997), Bayesian classifiers (Lewis 1994), decision tree (Lewis 1994), Support Vector Machines (Dumais 2000, Joachims 1998), and voted classification (Weiss 1999). Tree structures appear in all of these systems. Some proposed systems focus on classification algorithms to improve the accuracy of assigning documents to catalogs (Joachims 1997), while others take the classification structure into account (Koller 1998). Nevertheless, there are many improvements which are still needed in conventional classification systems.