1. Field of the Invention
This invention pertains in general to systems and methods for identifying specific information referenced in a selection of text, and more specifically to a concept synonym matching engine for identifying and extracting concepts referenced in a selection of text and matching these to defined concepts in the presence of errors or variations in the description of those concepts.
2. Description of the Related Art
Identifying concepts described in text is fundamental to the problem of building intelligent conceptual searching engines. A common problem in concept identification or matching systems is the difficulty of correctly identifying a concept cited in a selection of text and matching the concept to a separate set of defined concepts, especially where the selection of text is likely to include some variations or errors. For example, one of the most difficult and time consuming tasks in corporate staffing is the screening of hundreds or thousands of resumes to find the right candidate for a particular job. With the numerous job search websites available today, companies now have access to a very large resumé base, and thus a large set of potential candidates to fill their job openings. However, actually identifying the resumés of the most qualified candidates within a very large database by searching for key terms within the resumés can be a great challenge. An employer ideally would like to define a set of concepts of interest or desired features in job description (e.g., a particular university, grade point average over a certain number, worked at particular companies, particular types of job experience or abilities, and the like), and then automatically and in real time identify and match those concepts to the resumés. However, if the system is not effective at matching concepts, the employer may risk missing a number of potentially good candidates, and thus risk project delays and missed deals in the meantime while candidate search progresses. Employers may also receive numerous poor matches for the job description through which the employer must spend time and money sifting to find the few good candidates buried within the pile of resumés.
This problem can be accentuated when the selections of text through which a matching system searches can originate from a number of sources and the text is unstructured, making it more difficult to search for selections of text within these various documents. For example, in a job search, an employer may receive resumés from numerous different sources, in many different formats, using different types of fonts, with different textual arrangements on a page, and the like. An employer may receive resumes as hard copies, by e-mail, through job search websites (e.g., that may be formatted according to the job search website's requirements), through the enterprise's own job search system (e.g., that may be formatted according to the enterprise's own requirements), and the like. Matching systems may not be able to identify specific concepts amongst the unstructured text or the various document formats.
Additionally, the identification and matching must commonly be performed in the presence of errors or variations in the description of those concepts. If the system is unable to recognize misspelled words or cannot equate the different terms and abbreviations that may be used to describe one concept (e.g., “University of California, Berkeley” or “UC Berkeley” in a resumé), the system may again miss numerous proper matches.
Classification technologies used currently to do some types of matching are able to do broad generalizations and high-level matching of concepts in a selection of text. However, these technologies tend to fail when required to search through very short sentences or strings of text. These classification technologies often have trouble doing a matching when the matching involves only a few very specific words in a selection of text. Natural language processing technologies are also commonly used in the concept matching context. However, these technologies commonly require some sort of structure in the text (e.g., noun phrases and verb phrases in a subject, action, object sentence structure or other typical types of structures for text). The natural language processing technologies require this structure to be able to figure out the parts of speech in text and to extract concepts. Thus, these technologies are unable to reliably extract concepts in a series of words or a string of text that is unstructured, such as might be used in a resumé as a set of words separated by commas to define a list of skills of the job candidate. In addition, while these natural language technologies may be able identify some terms within a selection of text, they are typically not meant to match the text or its terms against a taxonomy or a previously-defined concept (e.g., matching skills in a resumé against a pre-defined collection of skill concepts)