1. Technical Field
The present disclosure relates to spoken dialog systems and more specifically to tracking a distribution over multiple dialog states in a spoken dialog system.
2. Introduction
Speech recognition and automated dialog technology is imperfect and the output from automatic speech recognition (ASR) engines often contains errors. Spoken dialog systems cope with these errors in various ways. Traditional systems track a single dialog state using a form structure. For example, in the travel domain, a form can contain fields for “departure city” and “arrival city.” If the caller says “I want to fly to Boston” then the traditional system populates the “arrival city” field with the value BOSTON.
The conventional approach is problematic in that it requires numerous heuristics to decide how to interpret the results from the speech recognizer. Conflicts arise when the speech recognizer detects a different value for a field which has already been populated. Dealing with such conflicts is particularly difficult, because inevitably the system must discard either the old or new information. In sum, there is no principled way to create all of these hueristics. Many are based on intuition and thus conventional systems discard much useful information.
One alternative is to maintain a probability distribution over all possible forms, otherwise known as dialog states. This approach assigns a probability of correctness to every possible dialog state rather than tracking a single dialog state. In practice such systems cannot track all the possible dialog states because they are far too numerous, even for a dialog of modest size. Instead the system tracks probabilities for groups of dialog states, called partitions. Initially one partition contains all dialog states. As the dialog system progresses, the system splits partitions as needed to capture distinctions implied by the items on the ASR N-best list. For example, if the system recognizes “to boston,” then one partition represents all itineraries to Boston, and another represents all itineraries to other cities. Then if the system recognizes “from new york,” the system creates four partitions: (1) from new york to boston, (2) from [any city but new york] to boston, (3) from new york to [any city but boston], and (4) [any city but new york] to [any city but boston]. The dialog system tracks a probability of correctness for each partition, updated based on ASR score, agreement with the user's profile, etc. The conventional system accommodates conflicting evidence by splitting partitions and shifting probability mass between partitions. All of the information on the N-best list can be used by comparing each N-best list entry to each partition.
However, as the dialog progresses, this splitting operation produces an ever increasing number of partitions. One way to prevent the number of partitions from becoming so large that updates are not possible in real-time is to recombine (merge) low-probability partitions and ignore the distinctions between the dialog states they represent. For example, if the two partitions “Flights from Boston to New York” and “Flights from [any city but Boston] to New York” are recombined, the resulting partition would be “Flights from [anywhere] to New York.”
Current techniques perform recombination at the end of each update. They first perform all possible splits considering the entire N-best list, then compute the new belief in this larger set of partitions, then finally recombine low-belief partitions. While this limits growth in the number of partitions across updates, it does not limit growth within an update. The problem is that the number of partitions is, at worst, exponential in the length of the ASR N-best list. As a result, the number of N-best entries that can be considered is limited to a small number, only 2 or 3 ASR N-best hypotheses in state-of-the-art systems.
In sum, while partition-based methods are promising, they currently cannot make use of more than a very limited number of entries on the N-best list. So despite their theoretical promise, in practice their ability to improve whole-dialog accuracy rates, task completion, and user satisfaction is substantially limited.