The present invention relates generally to classifying information. More particularly, the invention provides a method and system for automatic classification or clustering, such as automatic classification or clustering of documents. In one aspect, the present invention provides a way of classification that is to correctly associate items (e.g., documents) to be classified with one or more appropriate pre-defined categories, which define the items. Clustering can be used to group items into clusters, which serve as categories.
In the early days, ancient people scribed markings that served as information onto natural objects such as stone and wood. That is, cave people cut markings of trees, people, and wild animals onto sides of cave walls. These caves with markings have stood in place for thousands of years to preserve a piece of history by way of the markings. Unfortunately, it was difficult to make scribes on cave walls, which were extremely hard and prone to damage. It was also impossible to transport such walls, which were fixed. Accordingly, people began to scribe markings on wild animal skins after they were removed and cured. The skins were easily transportable and also much easier to scribe than hard cave walls. Wild animals, however, became scarce and often difficult to find and hunt down due increasing populations of people that tended to reduce the wild animal population. Additionally, the skins often became damaged and rancid when subject to wetness from rain and snow.
Thin sheets of wood eventually served as a medium for markings. The wood became thinner and eventually “paper” was discovered. Wood pulp which was bound together formed conventional paper and paper products. Paper was used to preserve large amounts of written information. Paper was often bundled and bound in the form of books. Books were easily transportable and also easy to read. Thereafter, computers were used to form and store information in the form of electronic files. Such files were easily stored in a hard disk media. People began connecting these computers together in the form of a network and then began sending information from one computer to another in the form of electronic files. The networks were first local to a specific office or region. Then, the networks grew and eventually connected computers all over the world. A commonly known network for connecting such computers in a world wide manner is the Internet.
Now that millions of people have computers, which are connected to other computers over the Internet, there has been an explosion of information which could be accessed at a touch of a button. With the explosion of such information, people began categorizing documents into categories. Such categories were frequently used in classification and also used in clustering. As merely an example, U.S. Pat. Nos. 5,787,422 and 5,999,927, in the name of Tukey, et al. (herein “Tukey”), describe techniques for defining such categories in a conventional manner. The conventional technique, however, only placed an item into a single category. The conventional technique also placed a number of items into the best fitting and second best fitting clusters. Other conventional techniques placed a single item into many categories, which tends to overly dilute the value of classification for users.
Such conventional techniques often result in failing to recognize a significant fraction of the content of the item or resulted in failing to provide meaningful information due to combinatorial explosion. Additional details of such failings are described in more detail below. Such details were discovered by the present inventor. Other related art includes the following: Fraley, C. and A. E. Raftery, How many clusters? Which clustering method? Answers via model-based cluster analysis, Computer Journal, 41, 578-588, 1998; M. Iwayama and T. Tokunaga, Hierarchical bayesian clustering for automatic text classification, In Proceedings of the International Joint Conference on Artificial Intelligence, 1995; C. Fraley. Algorithms for model-based Gaussian hierarchical clustering, SIAM Journal on Scientific Computing, 20:270-281, 1999; Jane & Dubes, Algorithms for Clustering Data, Prentice Hall, 1988 P. Willett, Document Clustering Using an Inverted File Approach, Journal of Information Science, Vol. 2 (1980), pp. 223-31; Hofmann, T. and Puzicha, J., Statistical models for cooccurrence data. AI-MEMO 1625, Artificial Intelligence Laboratory, Massachusetts Institute of Technology (1998); U.S. Pat. No. 5,832,182, Zhang, et al. Method and system for data clustering for very large databases; U.S. Pat. No. 5,864,855, Ruocco, et al. Parallel document clustering process; U.S. Pat. No. 5,857,179, Vaithyanathan, et al., Method and apparatus for information access employing overlapping clusters; U.S. Pat. No. 6,003,029 Agrawal, et al., among others.
From the above, it is seen that an improved way for organizing information is highly desirable.