1. Field
This relates generally to normalizing massive amounts of data, and more specifically to a method of normalizing massive amounts of raw data using labeled n-grams.
2. Description of Related Art
Social networks have become repositories for massive quantities of personal data, including users' job titles, current and previous employers, education, and other information. This data can be used for many purposes, including recruiting. A key impediment to effectively using this data, however, is that the data can be entered by users into their network profiles in any format. As a result, there is no standardization of the data. For example, the same job title may be entered in multiple formats, using different spellings, different abbreviations, or even different words. Accordingly, a recruiter or demographer searching user profiles for a job title, e.g., “Registered Nurse,” may not find users with the titles “R.N.” or “Reg. Nurse,” although they are semantically equivalent. This lack of standardization makes it difficult to search, analyze, and aggregate the data. Thus, a prerequisite for effectively searching, analyzing, and aggregating the data is the ability to recognize data variants having the same meaning as equivalent.
In one approach for identifying data variants that are semantically equivalent, a person may manually review a collection of user-entered data, define a data term or phrase that is representative of multiple variants of user-entered data, and create a look-up table that maps the user-entered data to a representative data term or phrase. However, this approach can be extremely time-consuming and the results may be limited to user-entered data variants that have been manually mapped.
What is needed is an efficient method and system for recognizing and identifying data variations in massive quantities of data to enable effective searching, analysis, and aggregation of the variations.