1. Technical Field
The present invention relates to natural language analysis performed in areas such as machine translation, and, in particular, to a process for breaking text into sentences.
2. Description of the Related Art
The first process to be performed in natural language processing in a field such as machine translation is to break text into sentences. Sentence segmentation is a process for breaking text in a natural language to be processed into sentences. The individual sentence resulting from this sentence segmentation process is used as a unit to perform a desired process such as translation. If text is broken at an improper point, the desired process such as translation cannot properly be accomplished. In machine translation, a long source sentence may explosively increase the amount of data to be processed. If text that should be broken into two sentences is not broken, much time can be required for analyzing the sentence, which may result in a failure of the analysis.
In order to break text at a proper point, it is desirable that erroneous sentence segmentation be corrected based on understanding of the meaning of the text while the desired process such as translation is being performed. However, such correction cannot be made in today's level of natural language processing technology. Therefore, it is required that accurate sentence segmentation be performed in the initial stage.
If the processed language is English, sentences have periods at their ends and, as a general rule, text can be broken into sentences based on whether there is a period or not. However, a period may also be used as an abbreviation marker, in addition to the marker of the end of a sentence, and therefore, text may contain a word with a period (hereinafter called “period word(s)”), such as “U.S.”. Consequently, it is not always a simple work to determine whether text can be broken into sentences.
Such a sentence segmentation procedure for English text according to prior art will be described below. Simple English text that contains no period words can be broken into sentences simply based on whether a word following a single period is capitalized. For example, consider the following text: “I have a pen. You have a book.” Because “pen.” cannot be a period word, the period following “pen” can be considered a single period indicating a sentence end. In addition, the word, “you”, following it is capitalized. Therefore, the text is broken by this period.
Some period words, such as “Mr.”, cannot appear at sentence ends. Other period words, such as “U.S.”, can appear at sentence ends or at any midpoints of a sentence. These period words cannot be used as markers for determining sentence boundaries. Period words such as “Mr.” that cannot appear at sentence ends are stored in a dictionary as data. The dictionary is referenced in a sentence segmentation process such that text is not broken immediately after such a word. On the other hand, a word such as “U.S.” that cannot be used as a marker for determining a sentence boundary is not stored by itself, unlike words such as “Mr.”. Instead, a word string, “U.S. President”, for example, containing that word is stored in the dictionary. If a word string containing “U.S.” used in text is not stored in the dictionary, the text is broken immediately after “U.S.”. On the other hand, if the word string containing “U.S.” is stored in the dictionary, the text is not broken immediately after “U.S.”.
As described above, natural language analysis requires a highly accurate sentence segmentation in its preliminary stage. It is important that text that should be broken into two sentences be separated without fail and the possibility of breaking text that should be considered a single sentence into two sentences should be minimized.
If text to be processed is English text, how periods are treated is important in sentence segmentation. According to prior art, period words that cannot appear at sentence ends are stored in a dictionary and if a period in text is a part of a period word stored in the dictionary, the text is not broken by that period.
However, because the prior art provides no other criteria for making the determination, text containing a period word that can appear at both of any midpoints and the end of a sentence is mechanically broken by the period. Thus, there is a limit to improvement in the accuracy of the sentence segmentation in the prior art.
In addition, according to the prior art, word strings containing a word, such as “U.S.” that does not indicate whether text should be broken immediately after it are stored in the dictionary and whether text should be broken is determined based on whether a word string containing the word in the text is stored in the dictionary. However, some passages of text should be broken immediately after a word string and others should not be broken immediately after that same word. Therefore, mechanical break determination may result in erroneous sentence segmentation. For example, the above-mentioned word string “U.S. President”, which is seldom broken immediately after “U.S.”, should be broken immediately after “U.S.” in the following text in order to properly analyze the text: “Japanese Prime Minister Junichiro Koizumi went to U.S. President Bush welcomed him.” In this case, there is the possibility that the text cannot be properly broken as to the case whether or not the word string “U.S. President” is stored in the dictionary.