Voice-controlled systems or voice controlled user interfaces, such as interactive voice response (IVR) and interactive media response (IMR) systems of contact centers, typically include speech recognition systems for converting audio signals containing speech into machine readable messages. In other words, the speech recognition systems may be used to parse user speech into commands or other user input, based on the semantics of the words contained in the speech.
Speech recognition systems also appear in other contexts, such as voice controlled user interfaces for intelligent personal assistants (such as Apple® Siri®, Amazon® Alexa®, and Google® Now), navigation systems, and televisions.
These voice-controlled user interfaces may often set constraints on the type of input based on the context. For example, when a voice-controlled user interface expects the user to supply a telephone number in the United States, the user interface may expect the user to provide ten digits (e.g., “two one two eight six seven five three oh nine”). These constraints or rules specifying expected speech recognition inputs (including DTMF inputs) may be referred to as grammars.
Voice extensible markup language (VoiceXML) is a digital document standard for specifying interactive media and voice dialogs between humans and computers. The Speech Recognition Grammar Specification (SRGS) Version 1.0 is a standard published by the World Wide Web Consortium (W3C) that defines a syntax for representing grammars for use in speech recognition, thereby allowing developers to specify the words and/or structure of a user input that the speech recognizer should expect to receive.
The VoiceXML document may specify a script for an interaction between an interactive voice response (IVR) or interactive media response (IMR) system 122. For example, the VoiceXML may specify a greeting that is first played to a caller when the caller is first connected to the IMR 122. The greeting may include a request for the user to provide identifying information, such as a customer account number. VoiceXML script may specify that the IMR 122 is to wait for the caller to provide the account number, where the account number is expected to meet particular conditions (e.g., a particular number of digits, such as a 16 digit account number or one digit, three alphabetic characters, and three more digits). The VoiceXML script may refer to an “account number” identifier, which identifies a corresponding “account number” grammar that is defined in a grammar document. The grammar document may be specified using, for example, SRGS, and defines the particular constraints of one or more named grammars. Accordingly, in this example, the grammar document may define an “account number” grammar and the constraints on this account number (e.g., 16 numeric digits or three alphabetic characters, and three more digits).
Once the speech recognition system determines the most likely input (e.g., utterance) it heard, the speech recognizer system extracts the semantic meaning from that input and returns that semantic meaning to the VoiceXML interpreter (so that the VoiceXML interpreter can take an action in response to the user input). This semantic interpretation is specified via the Semantic Interpretation for Speech Recognition (SISR) standard. SISR is used inside SRGS to specify the semantic results associated with the grammars, e.g., the set of ECMAScript (or JavaScript) assignments that create the semantic structure returned by the speech recognizer.
The current VoiceXML standard defines parameters for configuring grammars based on built-in digits and Boolean values. See, for example, McGlashan, Scott, et al., “Voice Extensible Markup Language (VoiceXML) Version 2.0”, W3C Recommendation 16 Mar. 2004, Appendix P, which defines “Builtin Grammar Types.” For example, digits may be used with “minlength” and “maxlength” parameters to specify a range of number of digits to expect. As another example, Boolean values may be parameterized to specify, in a dual-tone multi-frequency (DTMF) signaling system (or “touch tones”) in which keypress corresponds to “yes” and which keypress correspond to “no.”
However, the set of parameters for configuring these grammars is limited. For example, the VoiceXML standard merely provides the “length” in addition to the aforementioned “minlength” and “maxlength” parameters for specifying a number of digits, and the aforementioned parameters for specifying “yes” and “no” answers in Boolean DTMF grammars.