Classification algorithms, as a subset of statistical machine learning techniques, are well known in the art. A classification task of particular interest is the extraction of attribute-value pairs from natural language documents that describe various products. Various techniques for performing such attribute-value extraction are described in our commonly-assigned, prior U.S. patent application Ser. No. 11/742,215 (the “'215 application”) and/or U.S. patent application Ser. No. 11/742,244 (the “'244 application”), the teachings of which prior applications have been incorporated herein by the reference above. As noted therein, retailers have been collecting a growing amount of sales data containing customer information and related transactions. These data warehouses also contain product information that is often very sparse and limited. Treating products as atomic entities hinders the effectiveness of many applications that businesses currently use to analyze transactional data, such applications including product recommendation, demand forecasting, assortment optimization, and assortment comparison. While many retailers have recently realized this and are working towards enriching product databases with attribute-value pairs, the work is currently done completely manually, e.g., through inspection of product descriptions that are available in an internal database or through publicly available channels (such as the World Wide Web), or by looking at the actual product packaging in a retail environment.
While our prior U.S. patent applications describe techniques that beneficially automate these tasks, further improvements are possible. For example, in the sense that classification techniques applied to text determine probabilistic classifications of words and phrases, the reliability of the such classifications can be degraded to the extent that the text includes substantial amounts of extraneous information. Such extraneous text relative to the desired extraction results are similar to noise relative to a desired signal. Thus, it would be desirable to eliminate such extraneous information from text to be analyzed. Furthermore, it is know that certain classification techniques provide advantages or operate more reliably on certain types of data as compared to other classification techniques. Because no one classification technique is perfectly suited for every situation and type of input data, it would be beneficial to leverage the advantages of various techniques in order to arrive at the best possible results.