Social networks have become repositories for massive quantities of personal data, including users' job titles, skills and qualifications, current and previous employers, education, and other information. A key impediment to effectively using this data, however, is that the data can be entered by users into their network profiles in any format and language. The lack of standardization makes it difficult to search, analyze, and aggregate the data. A prerequisite for effectively searching, analyzing, and aggregating the data is the ability to recognize data variants, including variants in different languages, that are semantically equivalent.
In one approach for identifying data variants that are semantically equivalent, data from different languages is treated independently and a person manually reviews for each language a collection of user-entered data, define a data term or phrase that is representative of multiple variants of user-entered data, and create a look-up table that maps the user-entered data to a representative data term or phrase for each language. However, this approach can be extremely time-consuming and the results may be limited to user-entered data variants that have been manually mapped.