The present invention relates to query processing, and more specifically, to an item name normalization approach for processing queries.
Information is typically retrieved from an information system by submitting a search query to the information system, where the search query specifies a set of search criteria. The information system processes the search query against a set of searchable items and provides search results to a user. For example, in the context of online shopping over the Internet, a user may submit a word-based search query that specifies the type of item and the brand name of the item that the user wishes to purchase. As used herein, the term xe2x80x9citem namexe2x80x9d refers to information used to identify an item. Thus, xe2x80x9citem namexe2x80x9d may, for example, refer to the brand name of an item, the model name of the item, or a short description of the item, which may include the brand name of the item. For example, a user that is shopping for a winter-camping sleeping bag may submit a word-based search query that specifies, xe2x80x9cLands"" End sub-zero sleeping bagxe2x80x9d. Thus, xe2x80x9cLands"" End sub-zero sleeping bagxe2x80x9d is an item name that describes the type of item (i.e. xe2x80x9csleeping bagxe2x80x9d), a subclass of that item (i.e. xe2x80x9csub-zeroxe2x80x9d), and the brand name of the item (i.e. xe2x80x9cLands"" End) that the user is interested in purchasing.
As used herein, the term xe2x80x9csearch resultsxe2x80x9d refers to data that indicates the item names that satisfy a search query. One problem with using word-based search queries to retrieve information is that word-based search queries sometimes do not accurately reflect the intent of the user, and thus the user is often dissatisfied with the search results. For example, assume that xe2x80x9cLands""End sub-zero sleeping bag xe2x80x9d is a valid item name. Further assume that a user who is interested in purchasing a sub-zero sleeping bag made by Lands""End may submit a search query that does not exactly match the item name xe2x80x9cLands""End sub-zero sleeping bagxe2x80x9d. Instead the user submits a query such as xe2x80x9cLandsend Company sub-zero sleeping bagxe2x80x9d. The search results for such a query may be a null set because no item names match the search query xe2x80x9cLandsend Company sub-zero sleeping bagxe2x80x9d.
Another problem may be that the various sources from which item names are extracted may themselves provide inconsistent information on item names. Also, such sources may provide different information on prices and other product information associated with the item names. The following example illustrates the problem of inconsistent item names as well as the problem of different information associated with the item names in the context of online catalog shopping.
FIG. 1A is a table 100 that shows brand names 101, 103, 105, 107 and 109. Brand names 101, 103, 105, 107 and 109 are really variations of the brand name, xe2x80x9cLands""Endxe2x80x9d. Similarly, FIG. 1B is a table 110 that shows item names 112, 114, 116, 118 and item name sources, 112a, 114a, 116a, 118a. Item names 112, 114, 116 and 118 are variations of the same item name. Variations of an item name will henceforth be referred to as xe2x80x9citem name variantsxe2x80x9d. Assume that each item name variant in table 110 is extracted from a different shopping catalog. For example, item name variant 112 is extracted from item name source 112a, namely, xe2x80x9cCatalog Axe2x80x9d. Similarly, item name variants 114, 116, 118 are extracted from item name sources 114a, 116a, 118arespectively. Further assume that each item name source provides different information on the item name variants. For example, assume that item name source 112aindicates that item name variant 112 is priced at $10 and available in red, blue, green and yellow; item name source 114aindicates that item name variant 114 is priced at $11 and available in green and yellow only; item name source 116aindicates that item name variant 116 is priced at $9 and available in yellow only; and item name source 118a indicates that item name variant 118 is priced at $15 and available in 36 colors.
If, for example, a user submits a search query, xe2x80x9cLandsend Company sweater for girlsxe2x80x9d, only item name variant 114 would satisfy the search query. Thus the user may believe that only green and yellow sweaters are available and that are priced at $11. The user may in fact be cost conscious and thus may prefer the $9 sweater described by item name source 116a. Alternatively, the user may be more concerned with having a range of colors from which to select and thus would probably prefer the information from item name source 118a, which indicates that the sweater is available in 36 colors.
Given the current demand for query processing in the context of online shopping and the limitations in the prior approaches, an approach for processing queries that does not suffer from limitations associated with conventional query processing approaches is highly desirable. In particular, an approach for processing queries that addresses the problem of multiple variants of an item name and the inconsistent information associated with an item name is highly desirable.
According to one aspect of the invention, a method is provided for normalizing item names. One or more clusters of item name variants are determined, wherein the item name variants are extracted from an initial set of documents and wherein each cluster of item name variants is a cluster of similar item name variants. A normalized item name that is logically associated with each cluster is determined. The item name variants in each cluster is mapped to create an initial set of mapping information. A dictionary is created using the mapping information.