Automatic Speech Recognition (ASR)
The object of automatic speech recognition is to acquire an acoustic signal representative of speech, i.e., speech signals, and determine the words that were spoken by pattern matching. Speech recognizers typically have a set of stored acoustic and language models represented as patterns in a computer database. These models are then compared to the acquired signals. The contents of the computer database, how the database is trained, and the techniques used to determine the best match are distinguishing features of different types of speech recognition systems.
Various speech recognition methods are known. Segmental models methods assume that there are distinct phonetic units, e.g., phonemes, in spoken language that can be characterized by a set of properties in the speech signal over time. Input speech signals are segmented into discrete sections in which the acoustic properties represent one or more phonetic units and labels are attached to these regions according to these properties. A valid vocabulary word, consistent with the constraints of the speech recognition task, is then determined from the sequence of assigned phonetic labels.
Template-based methods use the speech patterns directly without explicit feature determination and segmentation. A template-based speech recognition system is initially trained using known speech patterns. During recognition, unknown speech signals are compared with each possible pattern acquired during the training and classified according to how well the unknown patterns match the known patterns.
Hybrid methods combine certain features of the above-mentioned segmental model and template-based methods. In certain systems more than just acoustic information is used in the recognition process. Also, neural networks have been used for speech recognition. For example, in one such network, a pattern classifier detects the acoustic feature vectors and convolves vectors with filters matched to the acoustic features and sums up the results over time.
ASR Enabled Systems
ASR enabled systems include two major categories, i.e., information retrieval (IR) systems, and command and control (CC) systems.
Information Retrieval (IR)
In general, the information retrieval (IR) system searches content stored in a database based on a spoken query. The content can include any type of multimedia content such as, but not limited to, text, images, audio and video. The query includes key words or phrases. Many IR systems allow the user to specify additional constraints to be applied during the search. For instance, a constraint can specify that all returned content has a range of attributes. Typically, the query and the constraints are specified as text.
For some applications, textual input and output is difficult, if not impossible. These applications include, for example, searching a database while operating a machine, or a vehicle, or applications with a limited-functionality keyboard or display, such as a telephone. For such applications, ASR enabled IR systems are preferred.
An example of the ASR enabled IR system is described in U.S. Pat. No. 7,542,966, “Method and system for retrieving documents with spoken queries,” issued to Wolf et al. on Jun. 2, 2009.
Command and Control (CC)
ASR enabled CC systems recognize and interpret spoken commands into machine understandable commands. Non limited examples of the spoken commands are “call” a specified telephone number, or “play” a specified song. A number of the ASR enabled CC systems have been developed due to recent advancements in speech recognition software. Typically, those systems operate in particular environment using a particular context for the spoken commands.
Contextual ASR Enabled Systems
Large vocabularies and complex language models slow the ASR enabled systems, and require more resources, such as memory and processing. Large vocabularies can also reduce an accuracy of the systems. Therefore, most ASR enabled systems have small vocabularies and simple language models typically associated with a relevant context. For example, U.S. Pat. No. 4,989,253 discloses an ASR enabled system for moving and focusing a microscope. That system uses the context associated with microscopes. Also, U.S. Pat. No. 5,970,457 discloses an ASR enabled system for operating medical equipment, such as surgical tools, in accordance with the spoken commands associated with appropriate context.
However, a number of the ASR enabled systems need to include multiple vocabularies and language models useful for different contexts. Such systems are usually configured to activate appropriate vocabulary and language model based on a particular context of interest selected by a user.
As defined herein, the context of the ASR enabled system is, but not limited to, a vocabulary, language model, a grammar, domain, database, and/or subsystem with related contextual functionality. For example, the functionalities related to music, contacts, restaurants, or points of historical interest would each have separate and distinguishable contexts. The ASR enabled system that utilizes multiple contexts is a contextual ASR enabled system.
Accordingly, for the contextual ASR enabled systems, it is necessary to specify the context for the spoken queries or the spoken commands.
ASR Enabled Systems Employing PTT Functionality
There are different types of ASR systems that distinguish intended speech input from background noise, or background speech. Always-listening systems employ a lexical analysis of the recognized audio signal to detect keywords, e.g., “computer,” which are intended to activate the ASR enabled systems for further input.
Another type of the ASR enabled system makes use of other input clues modeled after human-to-human discourse, such as direction of gaze.
Yet another type of ASR system uses push-to-talk (PTT) functionality. A PTT control, e.g., a button, is used to mark the beginning of a stream of audio signal as intended speech input. In some implementations, the end of the speech input is determined automatically by analyzing, for example, the amplitude or signal-to-noise ratio (SNR) of the acquired signal. In other implementations, the user is required to keep the button depressed until the user is finished speaking, with the release of the button explicitly marking the end of the input signal.
Embedded ASR Systems
Sometimes, it is necessary to embed the ASR enabled system directly in a physical device rather than to implement the ASR enabled system on network-based computing resources. Scenarios where such embedding may be necessary include those where persistent network connection cannot be assumed. In those scenarios, even if the ASR enabled system involves updating databases on network computers, it is necessary to obtain information through human-machine interaction conducted independently on the device. Then, after the network communication channel is restored, the updated information collected on the device can be synchronized with the network-based database.
As defined herein, an embedded ASR system is one in which all speech signal processing necessary to perform CC or IR takes place on a device, typically having an attached wired or wireless microphone. Some of the data required to generate, modify, or activate the embedded ASR system can be downloaded from different devices via wired or wireless data channels. However, at the time of ASR processing, all data resides in a memory associated with the device.
As described above, it is advantageous to use different types of ASR systems such as IR and CC systems in conjunction with a particular context or a plurality of contexts. Also, due to their limited memory and CPU resources, some embedded ASR systems have limitations which do not necessarily apply to desktop or server-based ASR systems. For example, desktop or server-based systems might be able to process a music-retrieval instruction, such as searching for a particular artist, from any state of the system. However, the embedded ASR system, e.g., an ASR system in a vehicle, might require the user to switch to an appropriate contextual state first, and would allow the user to provide the speech input relevant only to that particular contextual state.
Typically, the embedded ASR system is associated with multiple different contexts. For example, music can be one context. While the embedded ASR system is in the music context state, the system expects user speech input to be relevant to music, and the system is configured to execute functions only relevant to retrieving music. Navigation and contact are other non limited examples of the context of the ASR system.
For example, in the embedded ASR system with user interface employing a PTT button, to search for a musical performer, the user has to push the PTT button, pronounce a contextual instruction, e.g., a code word such as “music,” to switch the ASR system into a music contextual state. After speaking the code word, the user can input a spoken instruction for the music retrieval. If the user inputs music-related spoken instructions, while in some other contextual state, the ASR system fails.
FIG. 1 shows a conventional embedded ASR system. After a PTT button 105 is pressed, the system is expecting speech input containing contextual instructions 110-112. After recognizing 120 the contextual instruction, the system transitions to an appropriate contextual state 130-132. Accordingly, the system after recognizing a subsequent speech input 133-135 activates appropriate function 136-138.
However, complex tasks, such as music retrieval and destination entry, interfere with other user operations, e.g., driving a vehicle, especially when durations of the tasks increase. Hence, it is often desired to reduce a number of steps to activate a function with speech input in the embedded ASR system.