Many computational tasks can be formulated as problems that require learning and classification, in particular when the number of categories is large. For example, in a number of existing text categorization domains, such as categorizing web pages into topic hierarchies, the number of categories currently range in the hundreds of thousands. In the task of language modeling, each possible word or phrase to be predicted may be viewed as its own category, thus the number of categories can easily exceed hundreds of thousands. For papers on language modeling, see for example, R. Rosenfeld, Two Decades of Statistical Language Modeling: Where Do We Go From Here, IEEE, 88(8), 2000; J. T. Goodman, A Bit of Progress in Language Modeling, Computer Speech and Language, 15(4):403-434, October 2001; and Y. Even-Zohar and D. Roth, A Classification Approach to Word Prediction, In Annual meeting of the North American Association of Computational Linguistics (NAACL), 2000. For a paper that also discusses large scale text categorization, see for example: O. Madani and W. Greiner, Learning When Concepts Abound, Technical Report, Yahoo! Research, 2006. Similarly, visual categories are numerous. See for example, J. Z. Wang, J. Li, and G. Wiederhold, SIMPLIcity: Semantics-sensitive Integrated Matching for Picture Libraries, IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9):947.963, 2001. In addition, decades of research in cognitive psychology has stressed the importance of categories (concepts) to basic cognition. See for example, G. L. Murphy, The Big Book of Concepts, MIT Press, 2002. The number of categories necessary for general human level intelligence can easily exceed millions. Developing successful learning and classification techniques that can scale to a possibly unbounded number of instances as well as myriad categories has the potential to significantly impact applications as well as contribute to our understanding of intelligence. However, efficient learning and classification of instances from large collections of objects is a difficult task in the face of myriad categories.
An important subproblem is the recall problem, where on presentation of an instance, a small set of candidate categories should be quickly identified and output without missing the true categories. Typically an instance is represented by a vector of feature values. Accurately and efficiently reducing the number of categories drastically to a small set of candidate categories that include the right category for the instance requires both high recall and high precision. Recently, an approach based on learning an inverted index from features to categories was explored. See O. Madani and W. Greiner, Learning When Concepts Abound, Technical Report, Yahoo! Research, 2006. In that work, classifiers corresponding to the retrieved categories could be applied for precise categorization of the instance. Unfortunately, this approach relies substantially on classifiers. Although functional, training and applying classifiers take time and space. A learning and categorization method that does not require classifiers but has similar or better performance on categorization accuracy would be very useful.