The present specification relates to data management. In particular, it relates to identifying entries in different sources of data, e.g., databases, that correspond to the same object, such as the same book, the same restaurant or the same phone number.
Many organizations maintain databases of entries or records containing data about relevant objects. Each entry may be divided into fields, where each field includes data about a particular attribute of the object which is represented by the entry. For example, a book database maintained by an online bookseller may include an entry for each book it sells, where the entry for a particular book may include information such as the book title, author, etc. As another example, an entry in a database of businesses may include information such as business name, address, phone number, etc.
Generally, the structure and semantics of the entries will vary among independently managed databases. For example, in one database, the business address may be stored in a single field, whereas, in another database, the same information might be stored across multiple fields. In addition, abbreviations, synonyms, and other differences in data recording conventions between the databases, will result in different data representations of the same information. Furthermore, the data quality may differ between the databases due to a variety of factors, including data entry errors and missing data. As a result of the various possible differences between the databases, it may be difficult to determine whether two entries refer to the same object by directly comparing data in the entries.
It is therefore useful to provide techniques for determining whether entries in different databases refer to the same object.