The present invention relates generally to methods and systems for automated text processing, and specifically to methods for automated coding of textual data.
The tasks involved in conducting a large-scale survey, such as a population census, generally fall into three essential stages:
Data collection, generally either by filling out paper forms or electronic data entry;
Data coding, in which data collected in free text form are converted into unambiguous codes, typically numbers or alphanumeric values; and
Data analysis.
The present patent application is concerned with the coding stage. In response to a given question, such as xe2x80x9cWhat is your occupation?xe2x80x9d, there are typically many different answers that can correspond to the same code. As a simple example, the responses xe2x80x9cI drive heavy trucksxe2x80x9d and xe2x80x9cdriver of a heavy truckxe2x80x9d should receive the same code. A computer, however, will have a difficult time recognizing this fact. Because of such ambiguities, coding has not generally been automated up to now. The personnel engaged to perform the coding must have a high level of expertise, including familiarity with coding procedures and with a large catalog of codes that is typically provided for this purpose. For example, coders must know whether such job descriptions as xe2x80x9cchildcare worker,xe2x80x9d xe2x80x9cbabysitter,xe2x80x9d xe2x80x9cnannyxe2x80x9d and xe2x80x9cplaygroup assistantxe2x80x9d fall under the same coding classification or different ones. The same coder must be capable of coding xe2x80x9csemi-trailer driverxe2x80x9d and xe2x80x9cdriver of a heavy truck.xe2x80x9d Because of the huge volume of data to be coded, with relatively little computer assistance, and the high level of skill that is required, the coding stage is generally the single most expensive activity in a census.
The Inference Group, of Manuka, Australia, offers a system known as xe2x80x9cPrecision Dataxe2x80x9d for automated coding of textual data. The system is described at www.inferencegroup.com.au. Precision Data offers two types of automated coding: automatic coding, performed by a computer strictly without human intervention, giving either one or no answer; and computer-assisted coding, wherein the computer output may be zero, one or several answers. In the latter case, a human coder must choose a code from a list suggested by the system. Precision Data is based on a coding engine, which is described as a xe2x80x9csemi-linguisticxe2x80x9d system. The engine parses input phrases, looks up words and other objects in a dictionary, and calculates a confidence level. The dictionary links the words to a classification index. A selection algorithm is then used to determine if there is an acceptable coding match. Coding parameters can be set to control how strict or loose a match must be in order to be acceptable. The system has a user interface with different levels of user access.
It is an object of the present invention to provide improved methods and systems for automated coding of textual data.
It is a further object of some aspects of the present invention to provide interactive methods for automated coding that make more efficient use of human coding resources.
It is still a further object of some aspects of the present invention to provide methods and systems for automatic coding of textual data with enhanced accuracy and speed.
In some preferred embodiments of the present invention, an automatic text coding system receives a collection of reference phrases along with their corresponding codes, which have been assigned by one or more human experts. After preprocessing the text to remove superfluous words and characters, the system analyzes the phrases to generate respective code lists for all of the remaining words. The code list for any given word includes the codes assigned to all of the phrases in which the word appeared. Preferably, a weight is assigned to each code in the code list, which reflects the likelihood of the code being the correct one when the given word appears in an unknown phrase. Thus, the system prepares the code lists substantially autonomously, based on coding results known to be correct.
The system subsequently uses the code lists to code further phrases whose coding is not known a priori. For each phrase, the system computes a respective cumulative matching score for each of the codes that appears in the code list of one or more of the words in the phrase. The matching score of a given code is determined by summing the weights listed for that code in the code lists of all of the words in the phrase (although the weight may be zero in some of the code lists). Preferably, the sum is weighted to account for factors such as the order of the words in the phrase. When the system finds that for a given phrase, one of the codes has a cumulative matching score much higher than the score of any other code, it unequivocally selects the code with the highest score. Furthermore, if the phrase exactly matches one of the phrases in the collection that was coded by human experts, the code assigned by the expert is preferably selected automatically.
In some preferred embodiments of the present invention, if there are a number of candidate codes for a given phrase that have roughly comparable cumulative scores, the system passes the phrase to a human specialist. Typically, multiple specialists are available, each with a particular field or fields of expertise. The system automatically chooses the most appropriate specialist, typically one who is expert in a category to which the candidate code with the highest score belongs. The system presents the human specialist with the candidate code or codes in the specialist""s field of expertise. The specialist verifies or rejects the code (or indicates that he or she is unable to decide). In the case of rejection, if the next candidate code is in a different category, the phrase is passed on to another specialist with expertise in that category. The system thus makes optimal use of the human resources at its disposal, increasing the speed at which ambiguous phrases can be handled while reducing the level of training and ability required of most of the human operators.
It may also occur that the system is unable to find any codes with sufficient cumulative weights, or that there is an excessive number of codes, or that the chosen specialist (or specialists) rejected all of the candidate codes or was unable to reach a decision. In such a case, the phrase is passed to an expert human operator for manual coding. Optionally, methods of natural language processing, as are known in the art, are first applied in order to classify the field of the phrase, so that it can be routed to an operator with the appropriate field of expertise. Preferably, after the phrase has been coded, the phrase and its assigned code are added to the collection of reference phrases with known codes. The assigned code, with the appropriate weights, is then automatically added to the code lists of the words in the phrase, as described above. In this manner, the system automatically learns from the phrases that it was unable to code automatically.
Preferred embodiments of the present invention are thus based on a combination of a number of component inventive concepts. These concepts include automatic routing of phrases with suggested candidate codes to appropriately-specialized human operators, and automatic learning of codes from previously-coded text. It will be understood, however, that these inventive concepts may also be used independently of one another. Furthermore, while preferred embodiments described herein are directed to coding of text phrases, the principles of the present invention may also be applied in automated coding of data of other types. Such coding may be used, for example, in classifying images (as in automated visual inspection or sorting) or sounds. All such applications are considered to be within the scope of the present invention.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for automated coding of a text phrase relative to a catalog of codes, including:
finding a plurality of the codes that are candidates for coding of the phrase;
identifying a category to which one or more of the candidate codes belong; and
conveying the phrase together with the one or more candidate codes in the identified category to a human operator specialized in the identified category, for verification by the operator of one of the candidate codes in the category for assignment to the phrase.
Preferably, finding the plurality of the codes includes examining lists of the codes respectively associated with the words in the phrase, and selecting the candidate codes from the lists. Most preferably, the method includes providing a collection of reference phrases and codes respectively assigned to the reference phrases by a human operator, and generating the lists of the codes responsive to the words in the reference phrases, the list for each word including the codes assigned to the reference phrases containing the word.
Additionally or alternatively, each of the lists of the codes includes respective weights assigned to the codes with respect to the word with which the list is associated, and selecting the candidate codes includes computing matching scores for the codes based on the weights of the codes associated with the words in the phrase, and designating one or more of the candidate codes whose matching scores meet a predetermined criterion. Preferably, designating the candidate codes includes finding a set of one or more candidate codes whose matching scores are substantially greater than those of all of the other codes. Alternatively or additionally, when there is a single candidate code whose matching score meets the criterion, the method includes returning the single code, without conveying the phrase to the human operator.
Preferably, the method includes:
providing a collection of reference phrases and codes respectively assigned to the reference phrases by a human operator;
generating the lists of the codes responsive to the words in the reference phrases, the list for each word including the codes assigned to the reference phrases containing the word; and
computing the weights to be assigned to each code in the code lists so as to indicate a likelihood of the code being the one that the human operator would assign to an unknown phrase when the given word appears in the unknown phrase.
Preferably, conveying the phrase together with the codes to the human operator includes presenting the candidate codes to the operator in a predetermined sequence and receiving a binary input from the operator to verify or reject the codes in the sequence. Most preferably, when the operator is unable to verify any of the candidate codes, the method includes passing the phrase to an expert operator for manual coding.
Further preferably, when a single one of the candidate codes meets a predetermined matching criterion, while none of the other candidate codes meets the criterion, returning the single code, without conveying the phrase to the human operator.
There is also provided, in accordance with a preferred embodiment of the present invention, a method for automated coding of text phrases made up of words, the method including:
providing a collection of reference phrases and codes respectively assigned to the reference phrases by one or more human operators;
generating for the words in the reference phrases respective lists of the codes, the list for each word including the codes assigned to the reference phrases containing the word;
receiving an input phrase to which a code is to be assigned; and
processing the input phrase using the lists of codes so as to determine whether one or more candidate codes, selected from the lists, meet a criterion for assignment to the phrase.
Preferably, the input phrase includes first and second input phrases, and the method includes, if the candidate codes do not meet the criterion with respect to the first input phrase, passing the first input phrase to one of the human operators for assignment of a code to the first input phrase, and repeating the step of generating the lists of codes using the words in the first input phrase, wherein processing the input phrase includes processing the second input phrase using the lists of codes generated using the words in the first input phrase.
Preferably, generating the lists of the codes includes assigning respective weights to the codes in each list with respect to the word with which the list is associated, and processing the input phrase includes computing matching scores for the codes based on the weights of the codes associated with the words in the phrase, and designating a set of one or more of the candidate codes whose matching scores meet a predetermined criterion.
Preferably, processing the input phrase includes:
when the set of one or more candidate codes includes a single code meeting the criterion, returning the single code; and
when the set of one or more candidate codes does not include a single code meeting the criterion, routing the phrase to a human operator for coding.
There is further provided, in accordance with a preferred embodiment of the present invention, apparatus for automated coding of a text phrase relative to a catalog of codes, including:
a plurality of coding workstations, adapted to be operated by human operators having respective fields of specialization; and
a coding server, coupled to communicate with the workstations, and operative to find a plurality of the codes that are candidates for coding of the phrase, to identify a category to which one or more of the candidate codes belong, and to convey the phrase together with the one or more candidate codes in the identified category to one of the workstations that is operated by one of the human operators whose field of specialization includes the identified category, for verification by the operator of one of the candidate codes in the category for assignment to the phrase.
There is additionally provided, in accordance with a preferred embodiment of the present invention, apparatus for automated coding of text phrases made up of words, the apparatus including a coding server, adapted to receive a collection of reference phrases and codes respectively assigned to the reference phrases by one or more human operators, to generate for the words in the reference phrases respective lists of the codes, the list for each word including the codes assigned to the reference phrases containing the word, to receive an input phrase to which a code is to be assigned, and to process the input phrase using the lists of codes so as to determine whether one or more candidate codes, selected from the lists, meet a criterion for assignment to the phrase.
Preferably, the apparatus includes one or more coding workstations, in communication with the server and adapted to be operated by the human operators.
There is moreover provided, in accordance with a preferred embodiment of the present invention, a computer software product for automated coding of a text phrase relative to a catalog of codes, including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to find a plurality of the codes that are candidates for coding of the phrase, to identify a category to which one or more of the candidate codes belong, and to convey the phrase together with the one or more candidate codes in the identified category to a human operator specialized in the identified category, for verification by the operator of one of the candidate
There is furthermore provided, in accordance with a preferred embodiment of the present invention, a computer software product for automated coding of text phrases made up of words, the product including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer, upon receiving a collection of reference phrases and codes respectively assigned to the reference phrases by one or more human operators, to generate for the words in the reference phrases respective lists of the codes, the list for each word including the codes assigned to the reference phrases containing the word, and upon further receiving an input phrase to which no code has yet been assigned, to process the input phrase using the lists of codes so as to determine whether one or more candidate codes, selected from the lists, meet a criterion for assignment to the phrase.