The present invention relates generally to search engines. More particularly, the invention is a computer-implemented search system and method that makes use of searching to build and maintain a database. The invention allows users of the database to be presented with data that maintains a high level of accuracy and reliability and minimizes the occurrences of redundant data. Users of the databases may then be presented with unique instances of data rather than data elements that appear to be different but are actually duplicates of data elements already entered into the database.
Modern information resources, including data found on global information networks, form large databases that need to be searched to extract useful information. Existing database searching technology provides the capability to search through these databases. It is also important to be able to search for a value or item in a database within its particular data context to reduce the number of irrelevant xe2x80x9cmatchesxe2x80x9d reported by a database searching program. Traditional search methods of exact, partial and range retrieval paradigms fail to satisfy the content-based retrieval needs of many emerging data processing applications.
Databases of all sizes containing data elements constantly need to be updated to include new data elements or updated information about existing data elements. The information may be single data elements that need to be added to the database or data feeds containing multiple data elements. This is important for many different applications for which databases are used but is especially important in maintaining large databases that contain manufacturing or consumer product databases, employee data for large companies, or any other type of database where data feeds of updated and new data needs to be entered on a daily or even more frequent basis. The data feeds need to be examined to determine if the data is new or updated, or if it already exists in the target databases.
Manufacturers and retailers often refer to the same product by using different product identifiers such as product description, category, and identification numbers. This can cause confusion when attempting to build a database of unique products because entries for products are often repeated in a database and appear to be unique products or inventory when they are not. Because the entries may not be exact matches, traditional database search methods fail to satisfy content-based retrieval needs for comparing a product to an existing database to determine if it is already entered into that database. The process of examining product entries and comparing them to a database to determine if they should be added to the database or likewise removing redundant entries from a database has traditionally been accomplished by either manually examining the entries or by using a computer program to attempt to identify the same products. In most cases, it is accomplished by a combination of a using a computer program to generate a list of potential duplicate candidates followed by a manual examination of those candidates. Since the product and inventory databases are often very large, both the computerized process and the manual process can take a long time.
The present invention uses similarity-scoring techniques to identify redundant database content. For example, search techniques may be used to identify products and inventory that may have different product (or inventor) description categories, identification numbers and the like, but are actually the same product or inventory. The present invention provides for using optimized comparisons to allow a unique database to be maintained by reducing or eliminating redundant entries.
Comparisons are used to determine new or updated products to an existing unique product taxonomy database that holds information. A comparison is performed on the new or updated data to the existing database to determine if the product is unique, not unique, or is a possible duplicate product match. The hierarchical database may be divided into distinct taxonomy fragments, or categories, such as pertaining to retail electronic commerce. If the product is unique, it may be entered in the database.
The method of storing and maintaining a unique product and inventory database allows a database to be used in commerce where multiple manufacturer and retailers are involved. When joining data from multiple data sources, there is a need to maintain a unique, non-redundant collection of data, due to variations in product identity between manufacturers and retailers. The present invention is a method for maintaining a manufacturer and retailer inventory database so that unique data can be retrieved from the database. The collection of unique data aids the eventual consumer in identifying the similarities and differences between unique manufacturer products and the consumer retail organizations that supply the products.
An embodiment of the invention is a computer implemented method for combining data elements to build and maintain a unique database comprised of data entries, which comprises using at least one candidate data element that is a candidate to be added to existing data elements in the unique database, performing a comparison between the candidate data element and the existing data elements in the unique database, and computing a similarity score that represents a similarity between the at least one candidate data element and the existing data elements in the unique database. The method may further comprise determining if the candidate data element should be entered into the unique database based on the similarity score. The method may further comprise rejecting the candidate data element for entry into the unique database if the similarity score is greater than a similarity score threshold. The method may further comprising selecting the candidate data element as a candidate for entry into the unique database if the similarity score is equal to or less than a similarity score threshold. The candidate data element may be entered into the unique database. The computing a similarity score may comprise separating the unique database into at least one selected category, developing a schema for the selected categories, assigning the candidate data element to at least one of the selected categories, formulating a similarity score command for each candidate data element based on the selected categories to which the candidate data element belongs, sending the similarity score command to a similarity score function, and performing a search using the score command and the unique database whereby a similarity score result is returned from the search function that represents the similarity score between the at least one candidate data element and the existing data elements in the unique database.
Another embodiment of the invention may be a computer-readable media containing instructions for controlling a computer system to implement the method described above.
Another embodiment of the invention is a system for combining data elements to build and maintain a unique database, which comprises candidate data elements, unique data elements contained in a database, a similar scoring engine for comparing the candidate data elements with the unique data elements and means for entering the candidate data elements into the database based on the comparison of the similar scoring engine. The candidate data elements may be entered into the database if the similar scoring engine determines a similarity score result set that is less than or equal to a predetermined threshold value. The system may further comprise means for performing a secondary similar score validation check on the candidate data elements prior to entering the candidate data elements into the database. The candidate data elements may be entered into the database if the similar scoring engine determines a similarity score result set that is less than or equal to a predetermined threshold value.
The present invention, which relies on similarity scoring, solves the aforementioned needs. The present invention comprises computer-readable media having computer-executable instructions for performing the methods as above.