Many data driven applications, including web-based applications, typically rely heavily on and use textual data that originates from different and diverse data sources. This often results in multiple and different representations of the same items (or entities) in the data. For instance, a data set may include a collection of citations that represent academic publications, and there may be multiple citations within the collection that represent the same academic publications. However, because these citations may originate from a variety of different sources, the various citations that represent the same academic publications may differ. In particular, the citations may include numerous variations, such as listing all authors or only partial authors, using abbreviations, including or excluding different elements (e.g., author, title, venue, volume information, page information, publication date, etc.), including misspellings, and reordering elements to name a few.
Recognizing these different (and possibly erroneous) representations of the same items facilitates consolidating and cleaning the data and creating cohesion in the data. In some cases, only by matching representations of items in the data may particular applications be applied. However, it is difficult to obtain high accuracy in matching between different representations of the same item. The difficulty is further exacerbated when matching is to be performed over a large collection of data.