Electronic commerce is a burgeoning industry. Buyers go online to find products that they used to walk through stores to locate. Part of the attractiveness of electronic shopping is the ease of locating products.
Vendors rely on taxonomies of product categories to help customers find products. A taxonomy typically is a multilevel, hierarchical classification of products, though many other approaches to categorizing products can be considered taxonomies. Many online vendors offer products from a variety of sources. The sources may offer similar, competing products or they may offer the same product at different prices. For customers who are shopping for the best price, it is particularly important for products to be properly classified in a taxonomy, so that similar and identical products are assigned to the same category.
Each time an online vendor receives product information to post in an electronic catalog, the product information needs to be classified. In some cases, parts of a printed catalog are updated and the entire catalog is resubmitted. In other cases, information from multiple vendors needs to be combined into a single catalog. The information supplied may include catalog content, images, buyer specific contract pricing, and inventory availability. In any case, the classification process is tedious, time consuming, relatively expensive and error prone. Therefore, it is desirable to have an automatic classification system which is capable of learning from previous product classifications and also capable of combining information from multiple vendors.
Substantial efforts have been devoted to automatic text classification, such as automated article clipping services. For instance, the Text REtrieval Conference (TREC) has been sponsored by the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA) to bring together information retrieval researchers from around the world. The SMART/TREC collections of data have become standard test collections for information retrieval research. Many papers have grown out of TREC-related work, such as Amitabh Kumar Singhal""s dissertation, Term Weighting Revisited (Cornell, January 1997). However, the work necessarily is focused on text classification, rather than product categorization, because the data collections consist of published articles, abstracts and U.S. patents. None of these collections include typical product information. The data items in these collections tend to include far more text than a typical product description for a catalog. Other work is ongoing in the area of web search engines, which attempt to retrieve the web pages most relevant to a user query.
Accordingly, it is desired to extend past work on information retrieval, taking into account the nature of product information, to generate an automatic product classification system in support of building catalogs for electronic commerce.
The present invention may be practiced as either a method or device embodying the method. One aspect of the present invention is a method of machine learning to automatically categorize items from a plurality of pre-categorized items, including counting a frequency of term usage by category for text fields, weighting the frequency by category based on a frequency of usage in other categories, and determining a distribution by category for values in one or more numeric fields. Terms may be a single word or both single words and phrases. Numeric fields may include prices or dimensions of a product to be listed a product catalog. Weightings of frequencies may be stored in a sparse matrix, a B-tree or other suitable data structure. The weighting of frequency use may be determined by a term frequency-inverse document frequency ranking algorithm or any of a variety of other ranking algorithms that are well known. The pre-categorized data used for machine learning may be filtered to eliminate outliers based, for instance, on standard deviations from a mean value or on a percentile of high and low outliers to be eliminated. An alternative aspect of the present invention for machine learning is learning the category assignments of particular, pre-categorized items. This proceeds on an item by item basis, instead of a category by category basis. This alternate embodiment includes counting a frequency of term usage by item for text fields, weighting the frequency by category based on a frequency of usage in other items or categories, and determining a distribution by category for values in one or more numeric fields. Related aspects of the first embodiment may apply to this alternate embodiment.
Another aspect of the present invention is automatically categorizing an item having both text and numeric fields. This aspect of the invention may include parsing terms from text fields of an uncategorized item, identifying categories associated with the terms, calculating ranking scores for the terms in the identified categories, and adjusting the ranking scores based on distributions for numeric fields associated with the item. Ranking scores may be normalized based on the number of terms in an uncategorized item. The invention may further include selecting one or more categories to assign an item to based on adjusted ranking scores. Alternative categories may be rank ordered and items flagged for review by a human user. The calculation of ranking scores for identified categories may include summing the weighted frequencies for terms parsed from text fields and normalizing the sum of frequencies based on the number of terms parsed. Alternatively, it may include summing by text field the weighted frequencies of the parsed terms, combining the sums across text fields according to a predetermined weighting formula, and normalizing the combined sum of weighted frequencies. One predetermined weighting formula would assign a greater weight to a text field in a filed containing a shorter description of the uncategorized item than a text field containing a long description. Adjusting such ranking scores may involve applying an additive or multiplicative factor or a decision rule. Another, alternative embodiment of the present invention is automatic categorization based on comparing terms of an uncategorized item to terms of previously categorized items, instead of terms in categories. In this embodiment, previously identified items are the subject of ranking scores, instead of categories. The categories to which the pre-categorized items are assigned are used as a template for assigning additional items. This is particularly useful when multiple vendors offer to sell the same item and parrot the manufacturer""s description of the product. The first and alternate embodiments can be used together, for instance, relying on the alternate embodiment when a threshold ranking score is achieved and relying on the first embodiment otherwise.
The present invention further includes a user interface and method for assisting a human user in verifying or correcting automatic category assignments. Category assignments normally will be made with confidence scores. A method implementing the present invention may include selecting an automatically categorized item having a low confidence score, displaying alternative categorizations of the selected item together with their confidence scores and resetting the assigned category of a displayed item based on user input. Preferably, a single action by a user will indicate a preferred category for reassignment or a confirmation of the automatically selected category. It is also preferred to display alternative categorizations sorted by confidence score, preferably with the category assignment having the most favorable confidence score appearing prominently at the top of a list. It also will be helpful for a user to be able to see the details behind the calculation of one or more selected confidence scores, particularly during an initial phase when a library of pre-categorized items is being assembled as a basis for automatic classification.