This invention relates to an man-machine dialogue system and a method for realising the same. The techniques proposed here can be applied to diverse input and output modalities, for example, graphics devices and touch sensitive devices, but a particular example of this invention is spoken dialogue systems, where system and user communicate primarily through speech.
Speech generation and recognition technology is maturing rapidly, and attention is switching to the problems associated with the use of such technology in real applications, particularly in applications which allow a human to use voice to interact directly with a computer-based information control system. Apart from the simplest of cases, such systems involve a sequence of interactions between the user and the machine. They therefore involve dialogue and discourse management issues in addition to those associated with prompting the user and interpreting the responses. The systems are often referred to as spoken dialogue systems (SDS).
Examples of interactive speech recognition systems are telephone directory systems, travel information and reservation systems, etc. With such systems information is obtained from a user by recognising input speech that is provided in response to prompt questions from the system.
The key issues in designing an SDS are how to specify and control the dialogue flow; how to constrain the recogniser to work within a limited domain; how to interpret the recognition output; and how to generate contextually appropriate responses to the user. The design criteria which motivate the solutions to these issues are many and varied but a key goal is to produce a system which allows a user to complete the required tasks quickly and accurately.
In recent years many attempts have been made to realise such systems with varying degrees of success.
Today's state-of-the-art dialogues aim to understand “natural language” responses by users and typically use a mixed initiative approach, in which the user is not constrained to answer the system's direct questions. For example, in response to the question “Where do you want to go to?”, a user should be allowed to say “To Edinburgh tomorrow evening”. This answers the direct question and anticipates a later question. The approach has a number of consequences. It means, for example, that recognition accuracy must be high and parsing algorithms sophisticated in order to interpret what the user says. It raises problems with respect to the modularity and re-usability of dialogue sub-components. It can cause serious instabilities in the system due to increased chances of misunderstanding caused by recognition inaccuracies. Lastly, however much care is put into the design of such systems, misunderstandings will inevitably arise, and the system will be required to back-track, a non-trivial process in complex systems.
A further problem associated with such systems is that they can be highly labour intensive to develop and test, making them expensive to install and maintain.
Current approaches to dialogue control are varied but typically involve the use of specification using flow-charts and run-time control based on some form of augmented transition network. Essentially, this consists of a set of states linked together by directional transitions. States represent some action of the SDS, such as a question/answer cycles, data processing, simple junctions or sub-dialogues which expand to other networks. Transitions, and their associated transition conditions, determine the course of dialogue flow after the state has been executed.
Similar networks may also be used in SDS in order to provide syntactic constraints for speech input. Grammatical rules may be expressed in the form of finite state networks. Again these consist of states and transitions, but in this case the states represent some acoustic phenomenon, such as a word, to be matched with incoming speech data, or, as before, sub-networks or simple junctions.
Speech output is typically derived from a simple text string which is then translated by the output device into spoken speech by using some form of text-to-speech algorithm or by concatenating previously recorded acoustic waveforms. This may be represented as a very simple network with a single permissible path and each output word representing a state.