Statistical techniques for dimensionality-reduction, such as independent component analysis (ICA), have been shown to be effective in embedding large sets of documents in low-dimensional spaces both for classification and for similarity-based retrieval. These dimensionality-reduction techniques have been applied to retail product catalog “understanding,” where the unstructured text items can include such information as product name and description.
One standard technique for improving the performance of such dimensionality-reduction methods is to preprocess the set of words associated with each item. Preprocessing (also sometimes known as normalization) is typically accomplished by first removing very common words (e.g., “an”, “the”, etc.), and then applying techniques such as stemming (i.e., removing suffixes and prefixes from words to expose common root forms) or truncation, and/or down-selection to a “relevant” subset of the vocabulary. Down-selection is also known as word filtering or word selection. It should be understood that some or all of the aforementioned preprocessing techniques (common word removal, stemming, truncation, and word filtering) may be used in any particular application. After word selection, each document in the set (e.g., each product description in a catalog) is converted into a vector of counts. The vectors are combined into a matrix of descriptions and counts that is then reduced using some dimensionality reduction scheme such as ICA.
Traditional statistical methods for word selection use catalog-based relevance measures, such as removing all words appearing in most or few of the items. An alternate approach to word selection is described herein.