The present invention relates to search engines in general and in particular improving the efficiency of known search-engine technology by allowing search engines to more efficiently and accurately organize heterogeneous data into clusters of similar information.
Search engines must search through enormous amounts of extrinsic data and return results quickly enough to provide the real-time response time required by interactive user interfaces. One way in which a search algorithm may speed up this process is by organizing searchable data items into clusters before starting a search, where each cluster contains items that have similar attributes. This step may reduce the number of data items that a search engine must consider when responding to a search request. For example, it is easier for a search engine to respond a search request related to computer monitors if responding requires the engine need search through only a single cluster that contains monitor specifications previously retrieved from Web sites of monitor manufacturers.
Clustering technology, however, introduces another set of problems. Organizing data into clusters of similar records requires an engine to process an enormous amount of third-party information. Each information source may use a proprietary or source-specific schema to format and organize data. Inconsistencies among these source schema make it harder for a search engine to compare records retrieved from different sources and to accurately determine whether entities described by different records have similar attributes.
Current search-engine technology addresses these problems by formatting all data items stored in a single cluster according to a common, cluster-specific objective schema. Importing data into such a cluster, however, requires a search engine to translate each imported data item from the data item's original source-schema format into the cluster's objective schema format. This translation comprises mapping each relevant attribute of an imported data item onto one or more attributes of the objective-schema format. This mapping must follow a distinct set of rules that is a function of both the source schema of the data item (which depends on the source from which the data items is retrieved) and the objective schema of the cluster in which the data item will be stored.
Known search engines have no way to automatically identify mapping rules capable of accurately and reliably correlating attributes of imported data records (which are based on source schemas), to corresponding attributes of a cluster's objective schema. This procedure can be especially difficult when the search engine imports data from many sources that each use a different schema to format and store data. Two imported records retrieved from databases comprising heterogeneous schemas, may thus represent identical semantic values in completely different ways. For example, a monitor's 24-bit color-depth attribute may be represented in one schema as a “true color” string, in a second schema as a “24” 16-bit integer value, and in a third schema as a “16.777216E+6” floating-point value.
Determining which data items are similar enough to be clustered together thus requires determining how to interpret attributes of those data items defined by various source schemas, and to then accurately map attributes from each source schema onto attributes defined by a cluster's objective schema. Current clustering methods perform this task, if at all, through a cumbersome brute-force method that loses the original source schemas after performing the mapping and that requires a search engine to derive new mapping rules every time data is discovered at a new data source that uses a previously unknown source schema.
Clustering search engines, therefore, although potentially more efficient than search engines that do not incorporate clustering technology, suffer from technical problems of their own that affect their efficiency and accuracy.