1. Field of the Invention
The invention relates in general to a method to convert a speech utterance into a concept sequence. More particularly, this invention relates to a method adopting a probabilistic scoring function to determine the positions of speech recognition errors and to recover an incorrect concept sequence into a correct concept sequence when the speech recognition errors result in such incorrect concept sequence.
2. Description of the Related Art
The natural language understanding technique converts the spoken languages of human beings into a data format that a computer can understand, so that the computer provides various services for users in different application. For example, in the application of machine translation, a vulgate can be translated into a foreign language. In addition to text-based applications, the natural language understanding technical can also be used for speech related applications. Commonly, the speech is converted into words, followed by the process of the natural language understanding.
Generally speaking, the speech related application system (such as spoken dialogue system) includes speech recognition module and language understanding module. The speech recognition module converts the utterance spoken by a user into a set of possible word sequences, while the language understanding module analyze the word sequence set to determine the intention of the user. The user's intention is expressed with a semantic frame.
FIG. 1 shows a flow chart for understanding the utterance spoken by a user. For example, assume that the utterance signal in block S100 is produced by pronouncing the Mandarin words “Qing-Wen (tell me) Hsin-Chu (Hsinchu) Jin-Tien (today) Zao-Shang (morning) Hei-Bu-Hei (will it) Xia-Yu (rain)?”, which means “Will it rain in Hsinchu this morning?”. Then, the speech recognition module in block S102 converts the utterance signal into a set of hypothetical sentences, named sentence list, and puts it into block S104. For example, the sentence list may include the sentences “Qing-Wen (tell me) Hsin-Chu (Hsinchu) Jin-Tien (today) Zao-Shang (morning) Hei-Bu-Hei (will it) Xia-Yu (rain)”, “Qi-Wen (temperature) Hsin-Chu (Hsinchu) Jin-Tien (today) Zao-Shang (morning) Hei-Bu-Hei (will it) Xia-Yu (rain)?”, and “Qing-Wen (tell me) Hsin-Chu (Hsinchu) Qing-Tien (sunny day) Zao-Shang (morning) Hei-Bu-Hei (will it) Xia-Yu (rain)”.
In block S106, the sentence list is then analyzed by the language understanding module according to the linguistic knowledge and domain knowledge. The intention of the user's utterance can thus be determined and represented by a semantic frame in block S108.
In general, a natural language understanding system requires a predefined grammar to analyze and understand an input sequence. In the past, most natural language systems aimed at processing well-formed sentences that are grammatical with respect to the pre-defined grammars. If the input sentence is ill-formed (i.e., ungrammatical with respect to the predefined grammar), it is simply ignored. However, in real applications, ill-formed sentences are inevitable. Especially for a spoken dialogue system, the unpredictable errors of speech recognition usually make the hypothetical sentences erroneous.
To analyze ill-formed sentences, the robustness for parsing gradually evokes a great attention. The typical method is to partially parse the ill-formed sentence into recognizable pieces of phrases (i.e., partial parses), and select certain partial parses for further post-processing. In this approach, the system usually adopts some heuristic rules to select partial parses. Those rules are usually system-specific and, therefore, hard to re-use by other systems. In addition, since partial parsing does not recover the errors in an ill-formed sentence, this approach can only explore very limited information from the ill-formed sentence.
Another way to handle ill-formed sentences is to recover the errors in the sentences. In the past, this error recovery approach focused on searching the fittest parse tree among all alternatives that could be generated by the pre-defined system grammar. However, the system grammar, usually defined for analysis purpose, tends to be over-generated in many cases. That is, such approaches might produce sentences that are syntactically well-formed, but semantically meaningless. Furthermore, the number of structures which can be generated by the system grammar is generally too large to search exhaustively, some heuristic rules should be applied to reduce computation cost. However, those rules are usually system-specific and not easy to be re-used by other systems.
Recently, the concept-based approach is proposed to deal with the ill-formed sentences. In this approach, the system grammar only defines the structures of phrases, named concepts. A sentence is parsed into a sequence of concepts. For example, in FIG. 2, the sentence “Qing-Wen (tell me) Hsin-Chu (Hsinchu) Jin-Tien (today) Zao-Shang (morning) Hei-Bu-Hei (will it) Xia-Yu (rain)” is parsed into the concept sequence “Query Location Date Topic”. Unlike conventional language analysis, the legitimacy of a concept sequence is not specified by grammar rules. Instead, the N-gram stochastic model is used to estimate the likelihood of a concept sequence.
Generally speaking, in the concept-based approach, the system grammar only specifies how to construct individual concept parse but does not give any constraints on how to assemble concepts to a concept sequence. All possible concept sequences of the hypothetical sentences are ranked by the N-gram stochastic model to choose the most probable concept sequence. This approach works well if the correct sentence is included in the hypothetical sentence set. However, due to the imperfect speech recognition technique, speech recognition errors are inevitable, especially for recognizing spontaneous speech. The hypothetical sentence set may not include the correct sentence. In such case, the language understanding module is forced to select one incorrect concept sequence to interpret the user's intention. For example, when the user says “Qing-Wen (tell me) Jin-Tien (today) Hsin-Chu (Hsinchu) Zao-Shang (morning) Hei-Bu-Hei (will it) Xia-Yu (rain)”, the speech recognizer may only output two incorrect hypothetical sentences “Qi-Wen (temperature) Hsin-Chu (Hsinchu) Jin-Tien (today) Zao-Shang (morning) Hei-Bu-Hei (will it) Xia-Yu (rain)” and “Qing-Wen (tell me) Hsin-Chu (Hsinchu) Qing-Tien (sunny day) Hei-Bu-Hei (will it) Xia-Yu (rain)”. In this case, the language understanding module can only select between two incorrect concept sequences of “Topic (Qi-Wen) Location (Hsin-Chu) Date (Jin-Tien Zao-Shang) Topic (Hei-Bu-Hei-Xia-Yu)” and “Query (Qing-Wen) Location (Hsin-Chu) Topic (Qing-Tien) Date (Zao-Shang) Topic (Hei-Bu-Hei-Xia-Yu)”.
The problem of the above language understanding method comes from that the stochastic N-gram model can only provide the likelihood of a concept sequence but is not able to detect whether the concept sequence is correct or not, not even mentioning recover errors.