1. Technical Field
The present disclosure relates to training spoken dialog systems and more specifically to generating user models with automatically transcribed dialog data.
2. Introduction
Under ideal conditions, designers of dialog managers in spoken dialog systems would try different dialog management strategies on the actual user population that will be using the spoken dialog system and select the one that works best. However, users are typically unwilling to endure this kind of extensive experimentation because users typically view protracted experimentation as too time consuming, boring, or pointless. One alternative to this tedious experimentation is to build a model of user behavior. Then designers can experiment as much as needed to refine the dialog manager in the spoken dialog system using the model without troubling actual users. Of course, only a high-quality user model which accurately reflects user actions can provide relevant and useful results for such experiments. One known method of building a user model is to estimate a model based on transcribed corpora of human-computer dialogs. However, hundreds or even thousands of transcriptions are required and manual dialog transcription is expensive. Worse, user simulations are created for whole user populations instead of for individuals because of limited quantities of transcribed data for individual users. Consequently, these corpora are frequently too small, too sparse, and/or not specific enough for practical use. Further, spoken dialog system designers must often periodically evaluate the spoken dialog system with real users, which is also expensive and time-consuming.
In the prior art, a human transcriptionist listens to each of hundreds or thousands of user utterances and manually enters the words that were spoken. These transcriptions allow prior art systems to estimate user behavior model and the ASR model to create user simulations. The user behavior model takes the dialog history as input and predicts a distribution over user actions (such as answering a question, remaining silent, hanging up, etc.), and the ASR model takes the user action as input and predicts a distribution over ASR results (such as whether an error is made, a confidence score, etc.).
In practice, ASR errors are isolated and independent, so it is feasible to build the ASR model with relatively few parameters (fewer than a thousand transcriptions is often sufficient). However, user behavior depends heavily on the dialog history, and capturing this in the user behavior model requires much more training data.
One problem with the prior art approach is that transcribing dialog data is slow and expensive, so the number of transcriptions available for training the user behavior model is limited. As a result, user behavior models are impoverished, and cannot effectively account for dialog history. Moreover, these user behavior models cover a whole population of users, and do not model individual differences. Since dialog systems are trained on user simulations, these limitations set an upper-bound on the effectiveness of the optimization process. To realize the potential gains of machine-learning approaches to building dialog systems, user behavior models need to be estimated from many more dialogs than can feasibly be transcribed.