1. Field of the Invention
The present invention generally relates to systems for processing electrical signals representing acoustic waveforms and, more particularly, to speech and speaker detection and recognition and other processing of signals containing human speech.
2. Description of the Prior Art
Many electronic devices require input from a user in order to convey to the device particular information required to determine or perform a desired function or, in a trivially simple case, when a desired function is to be performed as would be indicated by, for example, activation of an on/off switch. When multiple different inputs are possible, a keyboard comprising an array of two or more switches has been the input device of choice in recent-years.
However, keyboards of any type have inherent disadvantages. Most evidently, keyboards include a plurality of distributed actuable areas, each generally including moving parts subject to wear and damage and which must be sized to be actuated by a portion of the body unless a stylus or other separate mechanical expedient is employed. Accordingly, in many types of devices, such as input panels for security systems and electronic calculators, the size of the device is often determined by the dimensions of the keypad rather than the electronic contents of the housing. Additionally, numerous keystrokes may be required (e.g. to specify an operation, enter a security code, etc.) which slows operation and increases the possibility that erroneous actuation may occur.
Perhaps more importantly, use of a keyboard inherently requires knowledge of particular keystrokes or combinations thereof which are associated with functions or data which must be input. For example, a combination of numbers for actuation of a lock for secured areas of a building or a vehicle requires the authorized user to remember the number sequence as well as correctly actuating corresponding switches in sequence to control initiation of a desired function. Therefore, use of a keyboard or other manually manipulated input structure requires action which is not optimally natural or expeditious for the user. Further, for security systems in particular, the security resides in the limitation of knowledge of a keystroke sequence and not in the security system itself since the security system cannot identify the individual actuating the keys.
In an effort to provide a more naturally usable, convenient and rapid interface and to increase the capabilities thereof, numerous approaches to voice or sound detection and recognition systems have been proposed and implemented with some degree of success. However, many aspects of an acoustically communicated signal have defeated proper operation of such systems. For example, of numerous known speech analysis algorithms, none are uniformly functional for different voices, accents, formant variation and the like and one algorithm may be markedly superior to another for a particular utterance than another (particularly when mixed with other background acoustic signals) for reasons which may not be readily apparent. Nevertheless, some empirical information has been gathered which can generally assign an algorithm to a particular signal which can then be expected to at least perform correctly, if not always optimally, for a particular utterance or segment thereof. Algorithm assignment becomes especially critical now that speech recognition systems are also used to transcribe remote (e.g. telephone) or recorded (e.g. broadcast news) speech signals.
Another aspect of acoustically communicated signals which affects both algorithm choice and successful performance is the fact that few speech signals, as a practical matter, are purely speech. Unless special provisions are made which are often economically prohibitive or incompatible with the required environment of the device (e.g. a work place, an automobile, etc.), background signals will invariably be included in an acoustically communicated signal.
Background may include the following non-exhaustive list of contributions: street noise, background speech, music, studio noise, static noise, mechanical noise, air circulation noise, electrical noise and/or any combination thereof. It can also be distorted by the communication channel (e.g. telephone, microphone, etc.). Signal components respectively attributable to speech and various types of background are not easily separated using previously known techniques and no successful technique of reliably doing so under all conditions is known.