1. Field of the Invention
The present invention relates to a system and method of semi-supervised learning for spoken language understanding using semantic role labeling.
2. Introduction
The present invention relates to natural language dialog systems. A spoken dialog system includes some basic components such as an automatic speech recognition module, a spoken language understanding module and a speech generation module such as a text-to-speech module. 1 is a functional block diagram of an exemplary natural language spoken dialog system 100. Natural language spoken dialog system 100 may include an automatic speech recognition (ASR) module 102, a spoken language understanding (SLU) module 104, a dialog management (DM) module 106, a spoken language generation (SLG) module 108 and a text-to-speech (TTS) module 110 or other type of module for generating speech.
ASR module 102 may analyze speech input and may provide a transcription of the speech input as output. SLU module 104 may receive the transcribed input and may use a natural language understanding model to analyze the group of words that are included in the transcribed input to derive a meaning from the input. The role of DM module 106 is to receive the derived meaning from the SLU 104 module and generate a natural language response to help the user to achieve the task that the system is designed to support. DM module 106 may receive the meaning of the speech input from SLU module 104 and may determine an action, such as, for example, providing a response, based on the input. SLG module 108 may generate a transcription of one or more words in response to the action provided by DM 106. TTS module 110 may receive the transcription as input and may provide generated audible speech as output based on the transcribed speech. There are variations that may be employed. For example, the audible speech may be generated by other means than a specific TTS module as shown.
Thus, the modules of system 100 may recognize speech input, such as speech utterances, may transcribe the speech input, may identify (or understand) the meaning of the transcribed speech, may determine an appropriate response to the speech input, may generate text of the appropriate response and from that text and may generate audible “speech” from system 100, which the user then hears. In this manner, the user can carry on a natural language dialog with system 100. Those of ordinary skill in the art will understand the programming languages and means for generating and training ASR module 102 or any of the other modules in the spoken dialog system. The present invention focuses primarily on the spoken language understanding module but may apply to other components as well.
Spoken language understanding aims to extract the meaning of the speech utterances. In the last decade, a variety of practical goal-oriented spoken dialog systems (SDS) have been built for limited domains, especially for call routing. These systems aim to identify intent found within the speech of people, expressed in natural language, and take the appropriate action to satisfy the request. In such systems, typically, the speaker's utterance is first recognized using ASR 102. Then, the intent of the speaker is identified from the recognized sequence, using the SLU component 104. Finally, the DM 106 interacts with the user in a natural way and help the user to achieve the task that the system is designed to support. As an example, consider the utterance “I have a question about my bill.” Assuming that the utterance is recognized correctly, the corresponding intent (call-type) would be Ask(Bill). The action that needs to be taken depends on the DM 106. It may ask the user to further specify the problem or route the call to the billing department.
For call-type classification, one can use a domain-dependent statistical approach as in the previous work. But this approach has some serious drawbacks. First, training statistical models for intent classification requires large amounts of labeled in-domain data, which is very expensive and time-consuming to prepare. If rule-based methods are used for these tasks, this requires some human expertise, therefore has similar problems. Moreover, the preparation of the labeling guide (i.e., designing the intents) for a given spoken language understanding task is also time-consuming and involves non-trivial design decisions. These decisions depend on the expert who is designing the task structure and the frequency of the intents for a given task. Furthermore, one expects the intents to be clearly defined in order to ease the job of the classifier and the human labelers.
Another issue is the consistency between different tasks. This is important for manually labeling the data quickly and correctly and making the labeled data re-usable across different applications. For example in most applications, utterances like “I want to talk to a human not a machine” appear and they can be processed similarly.
On the other hand, in the computational linguistics domain, task independent semantic representations have been proposed since the last few decades. Two notable studies are the known FrameNet and PropBank projects. This disclosure focuses on the Propbank project, which aims at creating a corpus of text annotated with information about basic semantic propositions. Predicate/argument relations are added to the syntactic trees of the existing Penn Treebank, which is mostly grammatical written text. Very recently, the PropBank corpus has been used for semantic role labeling (SRL) at the CoNLL-2004 as the shared task. SRL aims to put “who did what to whom” kind of structures to sentences without considering the application using this information. More formally, given a predicate of the sentence, the goal of SRL is to identify all of its arguments and their semantic roles.
The relationship between the arguments of the predicates in a sentence and named entities have been previously exploited by those who have used SRL for information extraction.