Retailers have been collecting a growing amount of data from various sources in hopes of improving business performance based on analysis of such data. For example, most retailers have terabytes of transaction data containing customer information and related transactions. These data warehouses also contain product information, but that information is often very sparse and limited. For example, most retailers treat products as “atomic” entities with very few related attributes (typically brand, size, or color). Nevertheless, retailers currently try to use transactional data for various applications, such as demand forecasting, assortment optimization, product recommendations, assortment comparison across retailers/manufacturers or product supplier selection. However, treating products as atomic entities hinders the effectiveness of these applications. Representations of products in terms of attributes and attribute values would significantly improve, both in terms of efficiency and efficacy, the above-mentioned applications. As used hereinafter, attributes describe a generalized quality, property, or characteristic of a product, whereas values assign a specific quantity, quality, configuration, etc. to an otherwise generic attribute.
For example, assume a grocery store wants to forecast sales of “Tropicana Low Pulp Vitamin-D Fortified Orange Juice 1-liter plastic bottle”. Typically, they would look at sales of the same product from the same time last year and adjust that number based on some new information. If this particular product is new, however, data from previous years will obviously not be available. In contrast, representing the product as a set of attribute-value pairs (e.g., Brand: Tropicana; Pulp: Low; Fortified with: Vitamin-D; Size: 1 liter; Bottle Type: Plastic) would enable use of data from other products having identical or similar attributes, thereby enabling a more accurate forecast. Even if the product is not new, representing it in terms of attribute-value pairs allows comparison with other related products and improved forecasts.
Many retailers have realized this recently and are trying to enrich their product databases with attributes and corresponding values, for each product. However, this is typically done using a manual process in which product descriptions (often obtained from an internal database, the World Wide Web or actual product packaging) are individually inspected, making the process relatively inefficient and expensive. Automation of this type of processing would greatly improve efficiency and overall expense.
To this end, techniques for extracting information from text documents are well known. However, such techniques have not been applied to the problem of extracting product attributes and values. For example, recently proposed techniques extract product features and their polarity (i.e., “good”, “bad”, “useful”, etc.) from online user reviews. While these techniques attempt to describe a product as a vector of attributes, they do not address the extraction of values or associating the extracted attributes and values together. Other techniques encompass information extraction with the goal of filling templates whereby certain parts of a text document are extracted as relevant facts. However, these techniques start with a definitive list of template slots, akin to attributes, rather than deriving such attributes directly from the documents themselves. Additional work has been performed in the area of extracting named entities from documents using so-called semi-supervised learning, discussed in further detail below. However, while these techniques essentially perform classification of words/phrases as attributes or values, such classifications are performed independently of each other, and attribute-value pairs are not determined. Further still, such classification techniques have not been applied to the determination of product attributes and values. Recently, Silver Creek Systems, Inc. has offered its. “DATALENS” system as means for developing “understanding” of, for example, a company's products through analysis of product descriptions. Relying on user intervention to identify attributes and values manually, at least in part, the “DATALENS” system uses non-classification-based techniques (i.e., the development of schemas in which core terms are further described by their attributes and values) to transform such product descriptions from one or more (often idiosyncratic) language domains into other, more useful language domains.
Thus, it would be advantageous to provide techniques that allow for the establishment of product, attribute-value pairs through the automatic extraction of product attributes and values while overcoming the limitations of prior art techniques.