The exemplary embodiment relates to entity extraction and finds particular application in connection with the extraction of entities from semi-structured documents, such as resumes.
Recruiting companies and large enterprises receive hundreds of resumes every day. In order to classify, search, or sort the resumes, recruiters use dedicated tools called Applicant Tracking Systems (ATS). These tools populate a database with information about candidates, such as personal information, skills, and companies they have worked for, etc. This information may be captured in different ways, such as manual entry by a recruiter, manual entry by the candidate on a recruitment platform, or automatic extraction from the candidate's resume.
While a resume is a well-defined document, with fairly standard sections (personal information, education, experience, etc.), the format and presentation may vary widely. Also, multiple file formats are possible (PDF, Microsoft Office Word document, text file, html, etc.), the order of the sections may vary (e.g., the education section may be at the beginning or at the end), and the content of the sections may have many different forms (list of structured paragraphs, tables, full sentences or list of words). Additionally, depending on the professional domain, there may be a specific vocabulary or style. It is therefore difficult to ensure that all the pertinent information is extracted.
Various methods of information extraction have been used to extract information from text, such as rule-based and machine learning methods. For an overview, see Klügl, “Context-specific Consistencies in Information Extraction,” Würzburg University Press, 2014, which discusses resumes, as a particular example. The UIMA Ruta rule-based system for information extraction and general natural language processing is described as well as machine learning techniques based on Conditional Random Fields (CRF) and extensions.
One approach known as “stacked Conditional Random Fields” exploits context-specific consistencies by combining two linear-chain CRFs in a stacked learning framework. The first CRF adds high-quality features to the input of a second CRF. Both CRFs work on the same data, but with different feature sets. While this method improves the results over a single CRF, it is less effective when the entities are far apart, as in a resume where a single block of text describing the candidate's experience can be several hundred words long.
Methods to address the wide spacing of entities can involve higher-order models, such as Comb-chain CRF and Skyp-chain CRF, which add long-range dependencies. Comb-chain CRF uses a classifier, trained to detect the boundaries of the entities. The output functions influence the model to assign a higher likelihood to label sequences that confirm with the description of the classifier. Skyp-chain CRF is a variant of skip-chain CRF. But instead of creating additional edges between labels, whose tokens are similar or identical, this approach adds long-range dependencies based on the patterns occurring in the predicted label sequence and the classification result. These two extensions of CRF provide improvements in performance over linear-chain CRF, for both resume and references extraction. However, they both rely on boundary detection to detect consistencies, which can miss some entities.
One system which is designed for entity detection in resumes is described in Amit Singh, et al., “PROSPECT: a system for screening candidates for recruitment,” Proc. 19th ACM Int'l Conf. on Information and Knowledge management, pp. 659-668, 2010. The PROSPECT system is a web portal for screening candidates. The system includes text extraction, resume segmentation and information extraction components, using approaches such as SVMs or CRFs. The number of years of experience in a specific domain is computed, which is used for ranking of resumes by giving higher scores to the resumes matching more closely the requirements of the job offer (e.g., “at least 6 years of J2EE experience”).
A cascaded model for entity extraction is described Kun Yu, et al., “Resume information extraction with cascaded hybrid model,” Proc. 43rd ACL Annual Meeting, pp. 499-506, 2005. A first step entails segmentation into identified sections (personal information, education, experience). Entity extraction is performed in a second step. Based on the type of section, the most appropriate entity extraction task is used, either HMM or classification.
A system that automatically extracts information from resumes to populate a database and allow searching is described in Kopparapu, “Automatic extraction of usable information from unstructured resumes to aid search,” 2010 IEEE Int'l Conf. on Progress in Informatics and Computing (PIC), vol. 1, pp. 99-103, 2010. Other work on resumes is described in Kaczmarek, “Information Extraction from CV,” Proc. Business Information Systems, pp. 1-7, 2005; and Maheshwari et al., “An approach to extract special skills to improve the performance of resume selection,” Databases in Networked Information Systems, pp. 256-273, 2010. Work has also been done on information extraction from job descriptions. See, Ciravegna, et al., “Learning Pinocchio—Adaptive Information Extraction,” Natural Language Engineering 1 (1): 1-21, 2001.
Many of these systems do not have reliably good performance and tend to miss some of the information sought.