FIG. 1 depicts an optical character recognition system (OCR) 10 having an optical scanner 12 connected to a data processing system 14. The optical scanner 12 scans a written or printed page of text and reads the individual characters printed or written thereon. Typically, the scanner 12 is capable of recognizing any character from a predefined set and returning a symbolic representation, associated with each scanned character. These codes are fed to the data processing system 14 for further processing.
The data processing system 14 shown in FIG. 1 comprises a CPU 16 interconnected to a main memory 18 via a bus 20. Also connected to the bus 20 are a disk memory 22 and an I/O interface 24. The I/O interface 24 is connected to the optical scanner 12. In this manner, the data processing system 14 may receive the characters read in by the optical scanner 12 via the I/O interface 24.
Frequently, it is desired to process printed text. The text itself may be scanned by the OCR system to extract the individual characters. From there, the symbolic representations of each character are transmitted to the data processing system for further processing in accordance with a desired application such as translating the written text to another language or interpreting the written text.
Occasionally it is desirable to morphologize the text or to segment the words of the text into morphemes and then relate each morpheme to one another. A morpheme is an indivisible word fragment which can convey meaning. For instance, the word "gun" is a morpheme. It conveys a certain meaning to the reader and cannot be divided into a smaller unit and still convey the same meaning. The word "guns," on the other hand, has two morphemes "gun" and "s." The first morpheme conveys the same meaning as before. But, by placing the second morpheme "s" adjacent to the first morpheme, the sense of plurality is conveyed. Many morphemes may be placed into a single word to convey a more complicated meaning. For instance, the word "gunfighter" contains three morphemes "gun," "fight" and "er."
As can be seen above, some morphemes can stand alone as words such as "gun" and "fight." Others are word fragments and cannot stand alone such as "s" and "ing." Still, others appear to be composed of further divisible units such as "together" or the phrase "in order to". However, these phrases and words, are also morphemes as they cannot be further divided without drastically changing their meaning.
In the English language it is often a simple task to determine where a particular word begins and ends simply by using the "space" character as a word delimiter. This is not so simple in other languages such as Japanese, Chinese or Korean. Each character within a sentence is chosen from a large set and is approximately evenly spaced in relation to the other characters. Further, the placement of one character in a sentence may drastically alter the way characters are parsed by the reader into words. Nonetheless, these texts exhibit morpheme properties. Characters alone, or strings of characters, form indivisible units or morphemes which convey a particular meaning. Again, the morpheme may be a whole phrase, word, word fragment or semantic unit which merely conveys information regarding another morpheme.
The goal of morphology is to segment a text into morphemes and then relate the morphemes to one another. Several morphology methods have been previously disclosed (see, e.g., Japanese Pat. No. 61-210479, Japanese Pat. No. 60-20234).
Certain languages, such as Japanese, have a structured grammar in which two adjacent morphemes must obey certain rules of connection. These rules dictate whether two morphemes may be placed adjacent to one another. Two prior morphology methods have exploited the connection relationship between morphemes in the Japanese language. A first approach, called Longest Match, segments the sentence of characters or divides the sentence into morphemes, one at a time. Initially, the parser attempts to string together the longest series of characters, starting from the beginning of the sentence, which is listed in a Japanese dictionary as a recognizable morpheme. It is then determined whether this morpheme may be connected to the beginning of the sentence. If this morpheme cannot be the first morpheme in the sentence, (i.e., the above-generated morpheme is not conjunctive with the beginning of the sentence) the parser returns to the step in which the longest morpheme was formed. Then, the second longest morpheme is formed and then tested in the above manner.
After the first morpheme is divided, a second morpheme is divided in a similar manner. The longest possible morpheme is formed from the remaining characters of the sentence starting with the character following where the first morpheme left off. The second morpheme so formed is tested to determine if it is conjunctive with the first morpheme. If the two morphemes are not conjunctive, the second morpheme is reformulated, as described above, i.e. the second morpheme is formed from the next to longest string of characters remaining in the sentence and tested. It may be appreciated that reformulation may occur for any morpheme in the parsing analysis. Further, although several morphemes may appear to be successfully divided initially, it may later turn out that one was incorrectly divided. The parser therefore has the ability to backtrack to any step in the parsing process including backtracking to redivide any previously segmented morpheme. FIG. 2 illustrates such a case.
Depicted therein is an exemplary goal tree illustrating possible states of the above-described Longest Match process. Each node of the tree depicts a possible state of the process after a morpheme is divided from a sentence of characters represented by the characters A through M. The root node 400 of the goal tree represents the state of a sentence with no divided morphemes. The root node 400 has three children nodes 426, 406, 401 showing three possible divisions of the first morpheme from the sentence in accordance with the above stated criteria (caret marks, i.e., ".LAMBDA.", delimit divided morphemes in FIG. 2). Each child node 426, 406, 401 depicts the state of the sentence after the formation of one of three morphemes from the sentence which is both listed in the dictionary and conjunctive with the beginning of a sentence. Thus, "A," "ABC" and "ABCD" are all morphemes listed in the dictionary which are conjunctive with the beginning of a sentence.
As depicted in FIG. 2, of the morphemes that may be formed with the characters A-M which are conjunctive with the beginning of a sentence, "ABCD" is the longest. In the Longest Match process, this would be the first attempted division of the first morpheme. Thus, node 401 depicts the first state of the Longest Match process.
The Longest Match process then proceeds to divide a second morpheme from the characters E-M. As depicted by nodes 402 and 403, only two morphemes may be formed from the characters E-M which are both listed in the dictionary and conjunctive with the first morpheme "ABCD." Again, however, the Longest Match process forms and tests morphemes in the order of decreasing length. Thus, the longer morpheme, "EFGH," would be formed from the remaining characters E-M and successfully tested. This would result in the state as depicted by node 402, with the morphemes "ABCD" and "EFGH" divided from the sentence.
As depicted in FIG. 2, node 402 has no children. This means that no morpheme listed in the dictionary can be formed with the remaining characters I-M or that if morphemes can be formed, they are not conjunctive with the morpheme "EFGH." Hence, the Longest Match parser "backtracks" to the state of node 401. With the state of the Longest Match parser as depicted in node 401 the parser attempts to form a different second morpheme which is conjunctive with "ABCD" from the characters E-M. As depicted in node 403, the next longest morpheme which is conjunctive to "ABCD" is "EF." The Longest Match process thereafter attempts to divide the third and fourth morphemes ("GH," "IJK," respectively) traversing the states of nodes 403 to 405 in a similar manner.
When node 405 is reached, the parser determines that no fifth morpheme may be formed from the remaining characters L and M which is conjunctive with the fourth morpheme "IJK." The parser then backtracks to node 404 to redivide the fourth morpheme from the characters I-M. Since no other morphemes may be formed therefrom which are conjunctive with the morpheme "GH", the parser backtracks to node 403 to redivide the third morpheme from the characters G-M. Since that is not possible, the parser backtracks to node 401 to redivide the second morpheme from the characters E-M. Since, all of the possible morphemes which are conjunctive with "ABCD" have been tried, the parser backtracks to the root node 400 to redivide the first morpheme from the characters A-M. At this point, the first morpheme is redivided as "ABC" as depicted at node 406. It may be appreciated, that processing continues in the above described manner through the numbered nodes in numerical order until node 425 is reached. At this point, all of the morphemes are divided from the sentence.
As can be seen from the goal tree of FIG. 2, the Longest Match process uses very limited criteria to form morphemes. The morphemes are formed in decreasing order of length and then tested to determine if they are conjunctive. Occasionally, however, a morpheme originally thought to be correct, later turns out to be incorrect. At this point, the Longest Match process backtracks to redivide the morphemes of the sentence in the reverse order in which they were divided. As such, the efficiency of the algorithm suffers as many poor choices are selected in the segmentation process which are not discovered until much later on.
The other approach, called the Parse List Method, attempts to more selectively search for the morphemes of the sentence. At each state of the solution, the parser, in this method, determines all of the possible choices for dividing the next morpheme from the remaining characters in the sentence. For instance, using the example of FIG. 2, the parser would first determine that three morphemes "A," "ABC" and "ABCD" could be formed from the characters A-M and are conjunctive with the beginning of the sentence.
Each partial solution is assigned a weight in accordance with some formula for determining the likelihood of segmenting the entire sentence using this particular partial solution. The partial solution that seems best suited to succeed is selected and processing continues on this partial solution. All potential second morphemes are formed from the remaining characters following the first morpheme which are listed in the dictionary and conjunctive with the first morpheme. For instance, assume "ABCD" appeared most likely to lead to a fully segmented sentence. Both "EF" and "EFGH" would be formed and tested. Each new partial solution thereby formed is also assigned a weight. Again, all of the weights of all of the partial solutions are compared and the partial solution which seems best suited to succeed is continued. It may appear, for instance, at this stage, that the first divided morpheme, "ABCD," which previously looked promising will not inevitably lead to a final solution (i.e. a fully segmented sentence). In such a case, the most promising partial solution will be one of the other states having a different first divided morpheme, i.e. the states of nodes 426, 406 of FIG. 2. Thus, the most promising partial solution is continued i.e., all of the choices for the next parsed morpheme of this partial solution will be explored and assigned weights. For example, suppose after comparing the weights of nodes 402, 403, 406 and 426, node 406 appears to have the most promising solution. In such a case, nodes 407, 416 are examined by dividing "D" and "DEF" from the sentence as the second morpheme. This process continues until the final solution is achieved.
The Parse List Method, although theoretically an optimal process, proves to be inadequate in practice. This is because the computation of weights cannot be 100% accurate. Further, at each stage of the morphological process, every potential morpheme of the most promising partial solution must be evaluated and assigned a weight before processing continues. This reduces efficiency by requiring the evaluation of many choices for the next morpheme in a sentence. Finally, the formula which determines the weights for each partial solution may have a significant time requirement to calculate the weight for each partial solution. All of these considerations reduce the efficiency of the algorithm.
It is therefore an object of the present invention to provide a method and system for morphologizing texts which is efficient and reduces the amount of backtracking.