The present invention relates generally to a computer-based method for identifying text. More particularly, the present invention relates to a lattice and method for identifying and normalizing orthographic variations in Japanese text.
Word segmentation refers to the process of identifying the individual words that make up an expression of language, such as text. Word segmentation is useful for checking spelling and grammar, synthesizing speech from text, performing natural language parsing and understanding, and searching a collection of documents for specific words or phrases, all of which benefit from an identification of individual words.
Performing word segmentation of English text is rather straightforward, since spaces and punctuation marks generally delimit the individual words in the text. In Japanese text however, word boundaries are implicit rather than explicit. That is, Japanese text typically does not include spaces or punctuation between words. Therefore, segmentation cannot be performed in the same manner as English word segmentation. Other characteristics of Japanese text further complicate the matter. For example, potential word candidate records may overlap (causing ambiguities for the parser) or there may be gaps where no suitable record is found (causing a broken span). Also, the language includes four different scripts that are in common usexe2x80x94kanji, hiragana, katakana and roman. Furthermore, these different scripts can be mixed within lexical entries. Additionally, many Japanese words have a variety of acceptable spellings and certain characters are optional.
Existing segmenting methods involve adding orthographic variations to the lexicon as they are encountered (requiring a long-term maintenance commitment), or lexicalizing all possible variations (requiring a much larger lexicon). An accurate and efficient approach to automatically performing Japanese word segmentation would have significant utility.
The present invention provides a solution to this and other problems and offers other advantages over the prior art.
The present invention relates to a lattice and method for identifying and normalizing orthographic variations in Japanese text.
One embodiment of the present invention is directed to a computer-readable medium having stored thereon a data structure that includes multiple data fields collectively representing a Japanese lexical entry. The multiple data fields include a plurality of multi-form data fields. Each multi-form data field is capable of holding data representing a word element of the lexical entry. Each multi-form data field includes two subfields. The first subfield contains data representing a primary form of the corresponding word element. The second subfield contains data representing an alternate form of the corresponding word element.
In an illustrative embodiment of the invention the data structure includes a lattice of the form:
[W:ab][X:c] . . . [Y:def]
where W, X and Y each represent a primary-orthography character; a, b, c, d, e, and f each represent an alternate orthography character; ab, c, and def represent an alternate representation to W, X and Y, respectively; and the lattice as a whole represents a plurality of orthographic forms of the lexical entry.
Another embodiment of the present invention is directed to a method of normalizing orthographic variations in the Japanese language. According to this method, an orthography lattice is maintained for each of multiple lexical entries. Each lattice represents a plurality of orthographic forms of the lexical entry and includes at least one word-element representation representing multiple forms of a word element of the lexical entry. Each word-element representation includes a primary form of the word element and an alternate form of the word element. Each lattice is normalized to produce a normalized form that includes the primary form of each word element representation of the lattice and that does not include the alternate form of each word element representation.
Another embodiment of the present invention is directed to a method of segmenting Japanese text. According to the method, an orthography lattice is stored for each of multiple lexical entries. Each lattice represents a plurality of orthographic forms of the lexical entry and includes at least one word-element representation representing a plurality of forms of a word element of the lexical entry. Each word-element representation includes a primary form of the word element and an alternate form of the word element. A sequence of input characters is received and the input sequence is evaluated against the plurality of lattices. If any orthographic form of one of the lexical entries is present in the input sequence, a normalized form of that lexical entry is generated that comprises the primary form of each word-element representation of the lattice corresponding to the entry and that does not include the alternate form of each word-element representation.
Another embodiment of the present invention is directed to a another, method of segmenting Japanese text. According to the method, an orthography lattice is stored for each of multiple lexical entries. Each lattice represents a plurality of orthographic forms of the lexical entry. Each lattice includes at least one word-element representation. Each word-element representation represents multiple different forms of the corresponding word element of the lexical entry. Each word-element representation can include a primary form of the word element and an alternate form of the word element. A character input that is part of an input string is received. The received character input is compared to the first word-element representation of each lattice. If the received character input matches either the primary form or the alternate form of the first word-element representation of a particular lattice, the subsequent characters in the input string are compared to further word-element representations in the particular lattice in order to ascertain whether any orthographic forms of the lexical entry corresponding to the particular lattice are present in the input string beginning with the received character input. In an illustrative aspect of this embodiment of the invention, if any orthographic forms of the lexical entry corresponding to the particular lattice are present in the input string beginning with the received character input, a normalized representation of the lexical entry is generated which includes the primary form of each word-element representation of the lattice and that does not include the alternate form of each word-element representation.
Another embodiment of the present invention is directed to yet another method of segmenting Japanese text. According to this method, an orthography lattice is stored for each of a plurality of lexical entries. An all-alternate-orthography form is also stored for each lexical entry. Each all-alternate-orthography form consists exclusively of alternate orthography characters and does not contain any primary orthography characters. An input character that is part of an input string of characters is received. It is determined whether the received input character is a primary orthography character or an alternate orthography character. If the received input character is an alternate orthography character, the input character is compared to the first character of each stored all-alternate-orthography form. Then, if the input character matches the first character of a particular all-alternate-orthography form, subsequent characters in the input string are compared to further characters in the particular all-alternate-orthography form. In this way, it is ascertained whether the all-alternate-orthography form of the corresponding lexical entry is present in the input string beginning with the received input character. If, on the other hand, the received input character is a primary orthography character, the input character is compared to the primary form of the first word-element representation of each lattice. Then, if the received input character matches the primary form of the first word-element representation of a particular lattice, subsequent characters in the input string are compared to further word-element representations in the particular lattice, thereby ascertaining whether any orthographic forms of the lexical entry corresponding to the particular lattice are present in the input string beginning with the received input character. In an illustrative aspect of this embodiment of the invention, if any orthographic forms of the lexical entry corresponding to the particular lattice are present in the input string beginning with the received input character, a normalized representation of the lexical entry is generated which includes the primary form of each word-element representation of the lattice corresponding to the entry and that does not include the alternate form of each word-element representation.
In an illustrative embodiment of the above method, for each lexical entry that contains two or more word elements, a look-back indicator is stored for each non-initial word element-in the lexical that contains a primary/alternate orthography pair. Each look-back indicator includes data that indicates the primary form of the corresponding word element, the primary form of the first word element in the lexical entry, and the first character of an alternate-orthography form of the first word element in the lexical entry. Each look-back indicator also indicates the difference in character position between the corresponding word element and the first character of the alternate-orthography form of the first word element in the lexical entry when all of the word elements occurring before the corresponding word element in the corresponding lexical entry are alternate form word elements. If the received input character is a primary-orthography character, the input character is compared to the primary form of the word element corresponding to each of a plurality of the look-back indicators.
If the received input character matches the primary form of the word element corresponding to a particular look-back indicator, the character in the input string that precedes the received input character by the difference indicated by the look-back indicator is evaluated. If the evaluated character matches the first character of the alternate-orthography form of the first word element indicated by the look-back indicator, and each character between the received input character and the evaluated character in the input string is an alternate orthography character, the primary form of the first word element in the lexical entry, as indicated by the look-back indicator, is compared to the primary form of the first word-element representation of each lattice. If the primary form of the first word element in the lexical entry, as indicated by the look-back indicator, matches the primary form of the first word-element representation of a particular lattice, the alternate form of the first word-element representation of the particular lattice is compared to the evaluated character and subsequent characters in the input string. If the alternate form of the first word-element representation of the particular lattice matches the evaluated character and subsequent characters in the input string, further subsequent characters in the input string are compared to further word-element representations in the particular lattice. In this way, it is determined whether any orthographic forms of the lexical entry corresponding to the particular lattice are present in the input string beginning with the evaluated character. In an illustrative aspect of this embodiment of the invention, if any orthographic forms of the lexical entry corresponding to the particular lattice are present in the input string beginning with the evaluated character, a normalized representation of the lexical entry is generated which includes the primary form of each word-element representation of the lattice and that does not include the alternate form of each word-element representation.