Many computing systems and end-users today need to interact with numerous different target databases in order to extract needed information. The input data these computing systems and end-users use to access these target databases is often diverse. In general, it is necessary to map this diverse input data from the varying sources to the target database entries in order to extract information. However, inconsistencies between the various input data and the various database entries often make this task difficult.
In particular, various target databases often contain errors, imperfections, and inconsistencies (such as incomplete, ambiguous, and incorrect entries) among themselves and each other. Similarly, the various input data sources (whether computing systems or end-users) often also contain errors/imperfections and inconsistencies among themselves and each other. In general, the errors and inconsistencies across the target databases and input data can be small (e.g., the database lists the full word “street” while the input data uses the abbreviation “St.”) or more major (e.g., all entries in the database refer to “Route 1” and the input data is entered as an alternative name, “Main St.”). However, despite the degree of the errors/inconsistencies, it is still necessary and often critical to be able to determine a match between the databases and input data.
As an example of this problem, assume a user is searching for information about a building at a particular street address in order to identify any issues that might affect the value of the property. Relevant information could lie in county and town real-estate records, in state and federal tax records, in mortgage records across one or more financial entities, in newspaper archives, etc. Assume further there is an Internet-based service, for example, that can search these multiple sources on behalf of the user and return all relevant information based on the input address of the property. However, each of the target databases corresponding to these entities potentially has its own way of representing addresses. For example, the county records might maintain addresses in an abbreviated way because the records rely on block and lot numbers for accurate identification. The state records might be stored in an early computing system that is limited in the number of characters used to represent street names. The financial records might be widely inconsistent in how they represent street names and indicators because different users entered the data and have different preferences. In addition, all of these databases are likely to contain records with data entry errors, such as typographical or spelling mistakes. Overall, there is a strong possibility that the service will have a difficult time successfully matching an input address entered by the user with each of the various databases.
Overcoming such mapping issues and finding a match between possibly erroneous/imperfect input data from a variety of sources and an entry in multiple different databases can be an onerous task. More specifically, assume there are “M” different sources for input data, each of which has its own characteristic type of variations/inconsistencies and errors. For example, input data sources can include databases, data obtained via user interface applications, data collected by customer service representatives during a phone conversation, transcribed hand-written notes, voice-recognition output, etc. Assume further there are “N” different target databases that can be searched for any given input data request from any of the “M” sources. For example, in addition to the above property search example, a computing system may need to access customer records from different service providers, consumer data from different marketing firms, legal data from different jurisdictions, etc. Again, each of these “N” databases will have its own characteristic variations/inconsistencies and errors. In order to map input data from any of the “M” input sources to any of the “N” target databases involves defining a set of rules for mapping each source to each target. Because of the errors/inconsistencies of any given input source and any given database, the complexity of any of these sets of rules can be high, leading to difficulty in defining the rules and leading to excessive processing for any given query especially if this query is across multiple target databases. As significant, the total number of rules that needs to be defined is on the order of “M×N” sets of rules. Making this situation worse, it is possible that the rules will need to be field specific so that “M×N” sets of rules will need to be defined for each field. The possibly high value of the “M×N” product is an indication of the difficulty of trying to perform the direct mapping between multiple input sources and the entries of one or more target databases.
There are several existing approaches for overcoming the above-described mapping problem. One approach is to limit the interaction with a target database to a specific set of choices, thereby limiting the manner in which data is expressed. For example, pull-down boxes could be used to enter data into a database system and to specify input data when doing a search. This approach reduces errors and inconsistencies within the database and between the database and input data. However, this approach does not allow for free-form flexibility and is only feasible when the number of possible values for a given database entry are limited to a number that does not result in a list that is daunting to users.
A second approach is to define multiple sets of rules to perform dynamic mappings between the various input data sources and the multiple target database variations. As described above, this approach involves defining “M×N” sets of rules for mapping the various input data sources to each target variation, taking into account all possible errors and inconsistencies. While this approach allows for free-form flexibility, it is an onerous task if the number of variations in the target databases and/or input data sources is large. In addition, given the possible complexity of the rules, this method leads to more processing for any given query.
A third approach is to “cleanse” the data across the multiple databases using “cleansing rules” and thereby removing entry errors and creating consistency both within and across the databases (e.g., all address entries within the databases are cleansed to use the full word “street”). A computing system could then use these “cleansing rules” on the input data prior to accessing the databases so that the input data is now consistent with the database entries. Alternatively, the computing system could utilize a single set of rules that map the more free-form input data to the data representation that is now common across the multiple target databases. In general, this third approach is well suited for situations where the target databases are commonly controlled/owned and where the target data across the multiple databases is static. However, if the target data across the multiple databases is dynamic, the database information would have to be cleansed continuously in order to handle updates to the data. This presents a problem of consistency.
Although the second and third approaches address the matching issue and allow for free-form flexibility with respect to the input data and database entries, they also have the additional problems with respect to partial matches. Specifically, when defining rules for matching as in the second and third approaches, it is also possible to define rules that detect matches that are not exact (i.e., partial mapping rules). For example, a partial match might involve accepting all records that match those inputs that contain common misspellings or typographical errors (e.g., “Mian Street” is equivalent to “Main Street”). Partial mapping offers the advantage of possibly identifying intended records that might be hidden by data entry errors. However, it also has the disadvantage of potentially identifying items that are not matches at all. If either the input or target data can be presumed to be correct, the probability of such mismatches is likely to be low. However, in cases where the possibility of error exists in both the input and the target data, the probability of mismatches is substantially higher.