Not Applicable
Not Applicable
Not Applicable
1. Technical Field
The present invention relates in general to a method and apparatus for making predictions about entities represented in text documents. It more particularly relates to a more highly effective and accurate method and apparatus for the analysis and retrieval of text documents, such as employment rxc3xa9sumxc3xa9s, job postings or other documents contained in computerized databases.
2. Background Art
The challenge for personnel managers is not just to find qualified people. A job change is expensive for the old employee, the new employee, as well as the employer. It has been estimated that the total cost for all three may, in some instances, be as much as $50,000. To reduce these costs, it is important for personnel managers to find well matched employees who will stay with the company as long as possible and who will rise within the organization.
Personnel managers once largely relied on rxc3xa9sumxc3xa9s from unsolicited job applications and replies to newspaper help-wanted advertisements. This presented a number of problems. One problem has been that the number of rxc3xa9sumxc3xa9s from these sources can be large and can require significant skilled-employee time even for sorting. Rxc3xa9sumxc3xa9s received unsolicited or in response to newspaper advertisements would present primarily a local pool of job applicants. Frequently most of the rxc3xa9sumxc3xa9s are from people unsuited for the position. Also, a rxc3xa9sumxc3xa9 oftentimes only described an applicant""s past and present and did not predict longevity or promotion path.
One attempt at finding a solution to the oftentimes perplexing problems of locating qualified, long-term employees has been to resort to outside parties, such as temporary agencies and head-hunters. The first temporary agency started in approximately 1940 (Kelly Girl, now Kelly Services having a website at www.kellyservices.com) by supplying lower-level employees to business. Temporary agencies now offer more technical and high-level employees. The use of head-hunters and recruiters for candidate searches is commonplace today. While this approach to finding employees may simplify hiring for a business, it does not simplify the problem of efficiently finding qualified people. It merely moves the problem from the employer to the intermediary. It does not address finding qualified employees who will remain with, and rise within, the company.
In recent years, computer bulletin boards and internet newsgroups have appeared, enabling a job-seeker to post a rxc3xa9sumxc3xa9 or an employer to post a job posting, which is an advertisement of a job opening. These bulletin boards and internet newsgroups are collectively known as xe2x80x9cjob boards,xe2x80x9d such as those found at services identified as misc.jobs.resumes and misc.jobs.offered. Presently, World Wide Web sites were launched for the same purpose. For example, there are websites at www.jobtrak.com and www.monster.com.
On internet job boards, the geographic range of applicants has widened, and the absolute number of rxc3xa9sumxc3xa9s for a typical personnel manager to examine has greatly increased. At the same time, the increasing prevalence of submission of rxc3xa9sumxc3xa9s in electronic format in response to newspaper advertisements and job board postings has increased the need to search in-house computerized databases of rxc3xa9sumxc3xa9s more efficiently and precisely. With as many as a million rxc3xa9sumxc3xa9s in a database such as the one found at the website www.monster.com, the sheer number of rxc3xa9sumxc3xa9s to review provides a daunting task. Because of the ubiquity of computer databases, the need to search efficiently and to select a single document or a few documents out of many, has become a substantial problem. Such a massive text document retrieval problem is not by any means limited to rxc3xa9sumxc3xa9s. The massive text document retrieval problem has been addressed in various ways.
For example, reference may be made to the following U.S. Pat. No.: 4,839,853, COMPUTER INFORMATION RETRIEVAL USING LATENT SEMANTIC STRUCTURE; U.S. Pat. No. 5,051,947, HIGH-SPEED SINGLE-PASS TEXTUAL SEARCH PROCESSOR FOR LOCATING EXACT AND INEXACT MATCHES OF A SEARCH PATTERN IN A TEXTUAL STREAM; U.S. Pat. No. 5,164,899, METHOD AND APPARATUS FOR COMPUTER UNDERSTANDING AND MANIPULATION OF MINIMALLY FORMATTED TEXT DOCUMENTS; U.S. Pat. No. 5,197,004, METHOD AND APPARATUS FOR AUTOMATIC CATEGORIZATION OF APPLICANTS FROM RESUMES; U.S. Pat. No. 5,301,109, COMPUTERIZED CROSS-LANGUAGE DOCUMENT RETRIEVAL USING LATENT SEMANTIC INDEXING; U.S. Pat. No. 5,559,940, METHOD AND SYSTEM FOR REAL-TIME INFORMATION ANALYSIS OF TEXTUAL MATERIAL; U.S. Pat. No. 5,619,709, SYSTEM AND METHOD OF CONTEXT VECTOR GENERATION AND RETRIEVAL; U.S. Pat. No. 5,592,375, COMPUTER-ASSISTED SYSTEM FOR INTERACTIVELY BROKERING GOODS FOR SERVICES BETWEEN BUYERS AND SELLERS; U.S. Pat. No. 5,659,766, METHOD AND APPARATUS FOR INFERRING THE TOPICAL CONTENT OF A DOCUMENT BASED UPON ITS LEXICAL CONTENT WITHOUT SUPERVISION; U.S. Pat. No. 5,796,926, METHOD AND APPARATUS FOR LEARNING INFORMATION EXTRACTION PATTERNS FROM EXAMPLES; U.S. Pat. No. 5,832,497, ELECTRONIC AUTOMATED INFORMATION EXCHANGE AND MANAGEMENT SYSTEM; U.S. Pat. No. 5,963,940, NATURAL LANGUAGE INFORMATION RETRIEVAL SYSTEM AND METHOD; AND U.S. Pat. No. 6,006,221, MULTILINGUAL DOCUMENT RETRIEVAL SYSTEM AND METHOD USING SEMANTIC VECTOR MATCHING.
Also, reference may be made to the following publications: xe2x80x9cInformation Extraction using HMMs and Shrinkagexe2x80x9d Dayne Freitag and Andrew Kachites McCallum, Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, AAAI Technical Report WS-99-11, July 1999; xe2x80x9cLearning Hidden Markov Model Structure for Information Extraction,xe2x80x9d Kristie Seymore, Andrew McCallum, and Ronald Rosenfeld, Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, AAAI Technical Report WS-99-11, July 1999; xe2x80x9cBoosted Wrapper Inductionxe2x80x9d Dayne Freitag and Nicholas Kushmerick, to appear in Proceedings of AAAI-2000, July 2000; xe2x80x9cIndexing by Latent Semantic Analysisxe2x80x9d Scott Deerwester, et al, Journal of the Am. Soc. for Information Science, 41(6):391-407, 1990; and xe2x80x9cProbabilistic Latent Semantic Indexing,xe2x80x9d by Thomas Hofman, EECS Department, UC Berkeley, Proceedings of the Twenty-Second Annual SIGIR Conference on Research and Development in Information Retrieval.
Each one of the foregoing patents and publications are incorporated herein by reference, as if fully set forth herein.
Early document searches were based on keywords as text strings. However, in a large database, simple keyword searches oftentimes return too many irrelevant documents, because many words and phrases have more than one meaning (polysemy). For example, being a secretary in the state department is not the same as being Secretary of State.
If only a few keywords are used, large numbers of documents are returned. Keyword searches may also miss many relevant documents because of synonymy. The writer of a document may use one word for a concept, and the person who enters the keywords uses a synonym, or even the same word in a different form, such as xe2x80x9cMgrxe2x80x9d instead of xe2x80x9cManager.xe2x80x9d Another problem with keyword searches is the fact that terms cannot be readily weighted.
Keyword searches can be readily refined by use of Boolean logic, which allows the use of logical operators such as AND, NOT, OR, and comparative operators such as GREATER THAN, LESS THAN, or EQUALS. However, it is difficult to consider more than a few characteristics with Boolean logic. Also, the fundamental problems of a text-string keyword search still remain a concern. At the present time, most search engines still use keyword or Boolean searches. These searches can become complex, but they currently suffer from the intrinsic limitations of keyword searches. In short, it is not possible to find a word that is not present in a text document, and the terms cannot be weighed.
In an attempt to overcome these problems, natural language processing (NLP) techniques have been applied to the problems of information extraction and retrieval, including hidden Markov models. Some World Wide Web search engines, such as Alta Vista and Google, use latent semantic analysis (U.S. Pat. No. 4,839,853), which is the application of singular value decomposition to documents.
Latent semantic analysis has also been used for cross-language document retrieval (U.S. Pat. Nos. 5,301,109 and 6,006,221) to infer the topical content of a document (U.S. Pat. No. 5,659,766), and to extract information from documents based on pattern-learning engines (U.S. Pat. No. 5,796,926). Natural Language Processing has also been used (U.S. Pat. No. 5,963,940) to extract meaning from text documents. One attempt at simplifying the problem for rxc3xa9sumxc3xa9s was a method for categorizing rxc3xa9sumxc3xa9s in a database (U.S. Pat. No. 5,197,004).
These techniques have generated improved search results as compared to prior known techniques, but matching a job posting with a rxc3xa9sumxc3xa9 remains difficult, and results are imperfect. If these techniques are applied to a massive number of rxc3xa9sumxc3xa9s and job postings, they provide only a coarse categorization of a given rxc3xa9sumxc3xa9 or job posting. Such techniques are not capable of determining the suitability of a candidate for a given position. For example, certain characteristics such as the willingness and ability for employment longevity or likelihood for moving along a job promotion path may be important to an employer or a candidate.
Therefore, it would be highly desirable to have a new and improved technique for information analysis of text documents from a large number of such documents in a highly effective and efficient manner and for making predictions about entities represented by such documents. Such a technique should be useable for rxc3xa9sumxc3xa9s and job postings, but could also be used, in general, for many different types and kinds of text documents, as will become apparent to those skilled in the art.
Therefore, the principal object of the present invention is to provide a new and improved method and apparatus for making predictions about entities represented by documents.
Another object of the present invention is to provide a new and improved method for information analysis of text documents to find high-quality matches, such as between rxc3xa9sumxc3xa9s and job postings, and for retrieval from large numbers of such documents in a highly effective and efficient manner.
A further object of the present invention is to provide such a new and improved method and apparatus for selecting and retrieving from a large database of documents, those documents containing representations of entities.
Briefly, the above and further objects are realized by providing a new and improved method and apparatus for information analysis of text documents where predictive models are employed to make forecasts about entities represented in the documents. In the employment search example, the suitability of a candidate for a given employment opportunity can be predicted, as well as other certain characteristics as might be desired by an employer, such as the employment longevity or expected next promotion.
A method and apparatus is disclosed for information analysis of text documents or the like, from a large number of such documents. Predictive models are executed responsive to variables derived from canonical documents to determine documents containing desired attributes or characteristics. The canonical documents are derived from standardized documents, which, in turn, are derived from original documents.