Nowadays speech technology has reached a high level of performance and this has led to its increasing use in many critical systems. Research performed by aeronautical companies and regulatory institutions in collaboration with speech technology expert groups has seen the development of large speech and text databases, along with new speech and text processing models that are adapted to specific field requirements. An important area of critical application that may benefit from these capabilities is the control of aerial vehicles. Air Traffic Control (ATC) and interfaces for UAVs benefit in particular. UAVs are of particular interest to the present invention.
When developing a UAV control interface, it is usual to include the following speech recognition modules: a speech recognizer for converting natural speech into a sequence of words, a natural language understanding module that extracts the main semantic concepts from the text (the commands to be executed and their corresponding data for UAV control), and a response generation module for creating a natural response to the pilot that will be converted into speech by a speech synthesiser. The response confirms the command received.
Speech recognition software that has been developed so far is based on two sources of knowledge, acoustic modelling and language modelling. Related to the acoustic modelling, current speech recognition systems are based on hidden Markov models (HMMs). For each allophone (a characteristic pronunciation of a phoneme), one HMM model is calculated as a result of a training process carried out using a speech database. A speech database consists of several hours of transcribed speech (composed of files with speech and text combined, where it is possible to correlate the speech signal to the words pronounced by the person). The size of the database determines the versatility and robustness of the speech recognition. Database acquisition is a very costly process because it requires linguistics experts for transcribing by hand the speech pronounced by different speakers.
The language modelling complements the acoustic modelling with the information about the most probable word sequences. There are several techniques for language modelling including grammar-based language modelling and statistical language modelling (N-gram).
Grammar-based language modelling consists of defining all possible sentences that the system can recognise. Any other word sequence, not foreseen in these sentences, is rejected. This model is easier to generate by a non-expert, but it is very strict and does not deal well with the spontaneous or stressed speech found in real-life situations.
Statistical language modelling consists of computing the probability of a word, given the N−1 previous words. For example, a 3-gram model consists of the probabilities of each possible word preceded by any combination of two words. The statistical model is generated automatically from some application-oriented text (set of sentences), considering a smoothing process for non-seen sequences. This smoothing means that all word sequences are permitted to some extent (there are no forbidden word sequences), fulfilling the roll of a fundamental robustness factor. This fact is very important when modelling spontaneous speech as it accommodates word repetitions, doubts, etc.
So far, all speech recognition systems incorporated in UAV interfaces are commercial programs such as those provided by Microsoft™ and Nuance™. These recognisers are integrated by the UAV interface developer, typically an expert on UAV task assignment and piloting but not necessarily a speech technology expert. Although speech recognition systems are evolving to more robust and user-friendly software engines, there are still important limitations in their configuration that affect drastically the speech recognition performance. One important aspect is the language modelling: the commercial recognition engines offer the possibility to define a grammar-based model (easy to define by a non-expert), but this configuration is not flexible enough for spontaneous or stressed speech that often appears in UAV control interfaces.
To understand spoken commands, one must extract the semantic information or “meaning” (inside the specific application domain) from the speech recogniser output (i.e. the sequence of words it provides). The semantic information may be represented by means of a frame containing some semantic concepts. A semantic concept consists of an identifier or attribute, and a value. For example, a concept could be “WAYPOINT_CODE” while the value is “A01”. Usually, the natural language understanding is performed by rule-based techniques. The relations between semantic concepts and sequences of words or other concepts are defined by hand by an expert. The rule-based techniques can be classified into two types, top-down and bottom-up strategies.
In a top-down strategy, the rules look for semantic concepts from a global analysis of the whole sentence. This strategy tries to match all the words in the sentence to a sequence of semantic concepts. This technique is not flexible and robust enough to deal with error in the word sequence provided by the speech recogniser. Even a single error may cause the semantic analysis to fail. Most previous attempts at speech interfaces for UAV command and control use rule-based techniques with top-down strategy.
In a bottom-up strategy, the semantic analysis is performed starting from each word individually and extending the analysis to neighbourhood context words or other already built conceptual islands. This extension is performed to find specific combinations of words and/or concepts (blocks) that generate a higher level semantic concept. The rules implemented by the expert define these relations. This strategy is more robust against speech recognition errors and is necessary when a statistical language model is used in the speech recognition software.
The response generation module translates the understood concepts into a natural language sentence used to confirm the command back to the pilot. These sentences can be fixed or can be built using templates with some variable fields. These fields are filled in with the information obtained from the semantic interpretation of the previous sentence. Both kinds of response generation modules have been used in the past for UAV command and control. Finally, the natural language sentence is converted into speech by means of a text-to-speech conversion system using a speech synthesiser.