In 1964, a group of computer scientists marveled over the new computer being delivered into the computer center at the Courant Institute for Mathematical Sciences at New York University. The machine was the latest introduction from Control Data Corporation, a Model CDC 6600, whose speed and memory capacity far outstripped the 7.094 K random access memory capacity of the now humble IBM 7094 that it replaced. In a portent of things to come, they little suspected that IBM, within months, would obsolete the long heralded CDC 6600 with its IBM 360, a machine which, incredibly, had an unheard-of 360 K of RAM, all built with discrete components, a conservative move in the face of concerns about the reliability of the then-new integrated circuit technology. This impressive machine came to be housed in a room about eighteen feet square, and surrounded by ten or so air conditioners necessary to keep the system from overheating and failing. A half dozen tape decks, nearly a meter across and as tall as a man, and several key punch machines, the size of garment outer table serving machines completed the installation.
Thirty-five years later, changes in technology have been remarkable. Tiny laptop computing devices fly at speeds a thousand times that of those early powerhouse computers and boast thousands of times the memory. Instead of huge reels of recording tape, hard disks with capacities on the order of eighteen GB are found in those same laptop computing devices. These devices, with their huge memory and computing capabilities, move as freely in the business world as people, under the arm, in a bag, or on the lap of a businessman flying across the ocean. No doubt, this technology lies at the foundations of the most remarkable, reliable and completely unanticipated bull market in the history of business.
Just as certainly, the future holds the promise of similar progress.
Notwithstanding the gargantuan magnitude of the progress made in computing during the last third of the 20th century, the world of computing has been largely self-contained. The vast majority of all computing tasks involved computers talking to other computers, or otherwise communicating through use of electrical input signals whose characteristics are substantially absolutely determined. In this respect, computers are completely unlike the humans which they serve, humans whose communications vary in almost infinite ways, regardless of the method of communication, be it voice, writing, or other means. If computing is to continue to make progress, computers must become integrated into usual human communications modalities.
And, indeed, this is already happening. From a slow start at becoming an important factor in the marketplace about a decade ago, speech recognition technology holds just such a promise. A human voice interface to a computer represents what should be probably the most ideal evolutionarily defined modality for-the human-computer communications interface. While humans customarily write, gesture and, to a limited extent, use other communications modes, voice communication remains predominant. This is not surprising, insofar as speech has evolved in human beings probably for many millions of years. This is believed to be the case because even relatively primitive forms of life have fairly highly developed “speech” characteristics. For example, much work has been done in the study of the use of sounds to communicate various items of information by whales. Likewise, scientists have identified and cataloged uniform global communications patterns in chimpanzees.
In view of the highly natural nature of communication by speech, a direct result of its having evolved over such a large fraction of the history of the species, speech communications impose an extremely low level of cognitive overhead on the brain, thus providing a facile communications interface while allowing the brain to perform a number of other functions simultaneously. We see this in everyday life. For example, people engaged in sports activities routinely combine complex physical tasks, situational analysis, and exchange of information through speech, sometimes simultaneously transmitting and receiving audible information, while doing all of these other tasks.
Clearly, the mind is well adapted to simultaneously control other tasks while communicating and receiving audible information in the form of speech. It is thus no surprise that virtually every culture on earth has devised its own highly sophisticated audible language.
In view of the above, it is thus, easily understood why voice recognition technology has come to be the Holy Grail of computing. While useful work began to be done with this technology about ten years ago, users only obtained performance which left much to hope for. Individual quirks, regional pronunciations, speech defects and impediments, bad habits and the like pepper virtually everybody's speech to some extent. And this is no small matter. Good speech recognition requires not only good technology, it also requires recognizable speech.
Toward this end, speech recognition programs generally have an error correction dialog window which is used to train the system to the features of an individual user's voice, as will be more fully described below. The motivation behind such technique is apparent when one considers and analyzes the schemes typically used in speech recognition systems.
Early on, speech recognition was proposed through the use of a series of bandpass filters. These proposals grew out of the use of spectrographic analysis for the purpose of speaker identification. More particularly, it was discovered that if one made a spectral print of a person saying a particular word, wherein the x-axis represented time and y-axis represented frequency, with the intensity of sound at the various frequencies being displayed in shades of gray or in black and white, the pattern made by almost every speaker was unique, largely as a function of physiology, and speakers could be identified by their spectrographic “prints.” Interestingly enough, however, the very diversity which this technique showed suggested to persons working in the field the likelihood that commonalities, as opposed to differences, could be used to identify words regardless of speaker. Hence the proposal for a series of bandpass filters to generate spectrographs for the purpose of speech recognition.
While such an approach was logical given the state of technology in the 1960s, the problems were also apparent. Obtaining high-quality factors or “Q” in electrical filters comprising inductors and capacitors is extremely difficult at audio frequencies.
This is due to a number of factors. First of all, obtaining resonance at these frequencies necessitates the use of large capacitors and inductors. Such components, in the case of capacitors have substantial resistance leak through. In the case of inductances, large values of inductance are required, thus requiring large lengths of wire for the windings and, accordingly, high resistance. The result is that the selectivity of the filters is extremely poor and the ability to separate different bandpasses is compromised. Finally, the approach was almost fatally flawed, from a mass-market standpoint, by the fact that these tuned electrical circuits were very large and mechanically cumbersome, as well as very expensive.
However, in the late 1960's, electrical engineers began to model the action of electrical circuits in the digital domain. This work was done by determining, using classical analytic techniques, the mathematical characteristics of the electrical circuit, and then solving these equations for various electrical inputs. In the 1970's, it was well understood that the emerging digital technology was going to be powerful enough to perform a wide variety of computing tasks previously defaulted to the analog world. Thus, it was inevitable that the original approaches to voice recognition through the concept of using banks of tuned circuits would eventually come to be executed in the digital domain.
In a typical speech recognition system, an acoustic signal received by a microphone is input into a voice board which digitizes the signal. The computer then generates a spectrogram which, for a series of discrete time intervals, records those frequency ranges at which sound exists and the intensity of sound in each of those frequency ranges. The spectrogram, referred to in the art as a token, is thus a series of spectrographic displays, one for each of a plurality of time intervals which together form an audible sound to be recognized. Each spectrographic display shows the distribution of energy as a function of frequency during the time interval. In a typical system, sampling rates of 6,000 to 16,000 samples per second are typical, and are used to generate about fifty spectrum intervals per second for an audible sound to be recognized.
In a typical system, quantitative spectral analysis is done for seven frequency ranges, resulting in eight spectral parameters for each fiftieth of a second, or spectral sample period. While the idea that a spectral analysis over time can be a reliable recognition strategy may be counterintuitive given the human perspective of listening to envelope, tonal variation and inflection, an objective view of the strategy shows that exactly this information is laid out in an easy to process spectral analysis matrix.
Based on the theoretical underpinnings of the above recognition strategy, development of a speech recognition system involves the input of vocabulary into the hard drive of a computer in the form of the above described spectral analysis matrix, with one or more spectral analysis matrices for each word in the vocabulary of the system. These matrices then serve as word models.
In more advanced systems (such as those using so-called “natural” speech, that is continuous strings of words, the natural tendency of speakers to, on occasion, blend the end of one word into the beginning of another, and less frequently to separate words into two parts, sometimes with association of the parts with different words) models are also developed for these artifacts of the language to be recognized (herein after included within the term “phaser”).
Once broken down into a spectral picture over time of frequency energy distributions, recognition of speech is reduced to comparison of known spectral pictures for particular sounds to the sound to be recognized, and achieving recognition through the determination of that model which best matches the unknown speech sound to be recognized. But this picture, while in principle correct, is an unrealistic simplification of the problem of speech recognition.
After a database of word models has been input into the system, comparison of an audible sound to the models in the database can be used as a reliable means for speech recognition. However, there are many differences in the speech patterns of users. For example, different speakers speak at different rates. Thus, for one speaker, a word they take a certain period of time, while for another speaker, the word they take a longer period of time. Moreover, different speakers have voices of different pitch. In addition, speakers may give different inflections, emphasis, duration and so forth to different syllables of a word in different ways, depending on the speaker. Even a single speaker will speak in different ways on different occasions.
Accordingly, effective speech recognition requires normalization of spoken sounds to word and phrase models in the database. In other words, the encoded received sound or token must be normalized to have a duration equal to that of the model. This technology is referred to as time aligning, and results in stretching out or compressing the spoken sound or word to fit it against the model of the word or sound with the objective of achieving the best match between the model and the sound input into the system. Of course, it is possible to leave the sound unchanged and stretch or compresses the model.
In accordance with existing technology, each of the spectral sample periods for the sound to be recognized are compared against the corresponding spectral sample periods of the model which is being rated. The cumulative score for all of the sample periods in the sound against the model is a quality rating for the match. In accordance with existing technology, the quality ratings for all the proposed matches are compared and the proposed match having the highest quality rating is output to the system, usually in the form of a computer display of the word or phrase.
However, even this relatively complex system fails to achieve adequate quality in the recognition of human speech. Accordingly, most commercial systems, do a contextual analysis and also require or strongly recommend a period of additional training, during which the above matching functions are performed with respect to a preselected text. During this process, the model is appended to take into account the individual characteristics of the person training the system. Finally, during use, an error correction dialog box is used when the user detects an error, inputs this information into the system and thus causes the word model to become adapted to the user's speech. This supplemental training of the system may also be enhanced by inviting the user, during the error correction dialog to speak the word, as well as other words that may be confused with the word by the system, into the system to further train the recognition engine and data base.
As is apparent from the above discussion, the development of speech recognition systems has centered on assembling a database of sound models likely to have a high degree of correlation to the speech to be recognized by the speech recognition engine. Such assembly of the database takes two forms. The first is the input of global information using one or more speakers to develop a global database. The second method in which the database is assembled is the training of the database to a particular user's speech, typically done both during a training session with preselected text, and on an ad hoc basis through use of the error correction dialog window in the speech recognition program.
In over fifty years of work, Arthur Lessac has developed a complete voice system reflecting, for the first time, the basic relationship between music and speech. His discovery and development was done empirically but was related to much formal academic work. His work early linked an understanding of music and singing with voice theory and rests on his decision to make a radical departure from traditional methods of studying and teaching voice. Very early in his speech work, Lessac decided that teaching or learning by imitating others was insufficient and damaging. He determined to develop a system of learning based upon sensation and feeling and kinesthetic feedback principles. This required extensive practical and formal study of the natural functioning of the body and the voice.
During almost this same fifty—year period, music historians began to go beyond studies of the history of western classical music. Inter—cultural studies linked western, eastern, African and other music. Related anthropological, archeological, historic and music work began to provide some insight into the origins of speech and music. Since these origins were before the time of recorded history, little progress was made until a number of studies of still—existing primitive tribes. No one has, as yet, described the whole relationship between music and speech as has Lessac. However, there are indications that recent studies would support his main thesis.
Today no complete vocal system compares to the Lessac system. A voice system must deal with two functional aspects and one operational aspect of speech. Functionally, speech consists of vowels and consonants. Operationally, there is the linking together within a word, sentence, paragraph or speech of the different sounds where different emphasis can vary meaning. The connection between vowel sounds and music has long been recognized—though never in a phonetic system. However, the same connection between the functional characterisitics of consonants and musical instruments and between the relationship between speech and a musical score has never before been developed.
Voice and speech theory and training today depends heavily upon the International Phonetic Alphabet (IPA). The IPA was created a century ago by a committee of Western European scholars. The IPA is fine for mapping sound. It does remove orthographic traps, and it provides the student with a guide to specific vowel and consonant sounds in other languages that are missing in his own, although even in this context it does little more than any other alphabet when the spelling of a given language—Spanish, for example—is simplified. But, it is a weak and artificial tool for teaching live people how they should sound. It is cumbersome, complicated, and outdated. It encourages a non-creative approach that is acoustic, imitative and mechanical. And, it includes too many vocal traps.
A symbol from the IPA system maps all of the possible sounds of the language, separating out deviations due to regional genesis which do not discriminate meaning within the culture. This symbol must then be learned or memorized in conjunction with a sound (thus, involving the ear) in order to be understood and then spoken.
And, the IPA does not deal at all with the operational linking together of sounds within words, phrases and larger units of speech. It is not a vocal system—merely an attempt at some definition of comparative sounds.
Functionally, Lessac vowels are “numeric and kinesthetic”, and Lessac consonants are “imagistic, kinesthetic and potentially numeric” in contrast to the purely symbolic nature of the IPA vowel and consonant phonetics.
Operationally, Lessac's methods of “exploration” and the elimination of any basic difference between singing and speaking utilize the basic musical qualities in all uses of the voice.
At the same time, the Lessac voice system includes and adapts important elements from previous systems of acoustic phonetics, articulatory phonetics, auditory phonetics and physiological phonetics.
In the Lessac Vowel System, the numbers directly connect to a structure and kinesthetic feel which, when replicated, creates the desired sound without necessitating control by the ear, and, thus, avoiding the conditioned pitfalls of poor vocal environment. Based on a direct transfer from numeric description to action, this method of learning leaves no room for intervening influences to dilute or interfere with the process. In addition, the vowel is fed by a vibratory and resonance feel that aids in enforcing the phonetic value and provides a significant qualitative component to what in other systems remain, by and large, a quantitative dimension.
In this way, the Lessac vowel system eliminates the IPA concept of front, middle and back vowels, or high and low vowels; it discourages the mechanistic handling of glottal, pharyngeal, velar, palatal, retroflex, dental, labial manipulations; it avoids reliance upon the ear for essential control.
The Lessac Consonant System (described at pages 129-179 of Arthur Lessac's THE USE AND TRAINING OF THE HUMAN VOICE, Drama Book Publishers 1967), relates consonants to musical instruments. Each of the consonants reflects a musical instrument and involves both the sound and the image of the playing of the instrument—images of touch, rhythm, melody, size and subtlety.
To understand the instrument means to understand not only the sound itself but also the kinesthetic feel of the way the instrument is played and the different uses to which it can be put. It is an aesthetic construction and functions as a physical image.
In conventional voice and speech training, even when the habit is more or less automatic, the sight of a “T” or a “V” will prepare the tongue and gum—ridge of the lips to execute the action to produce the desired explosive or fricative sound, but the sound that comes out is often unanticipated, irregular, defective and undetected by the ear. The impression often is that there must be at least a half dozen ways of making the sound.
In the Lessac Consonant System, the picture of a snare drum with a “T” written on the picture will, after one has been taught the aesthetics of a drum beat, bypass and cut through the complexities of tongue manipulation, the memories of imitation, the listening by ear, etc. The student is not only make a perfect “T” sound but will thereby also know how to feel the drumbeats of the “K”, “P”, “D”, “B”, and “G” without any additional training. What is more, once the concept is clear, one can ask a deaf person, or a foreigner, whether Chinese or French, to feel an “R”—trombone, or a “V”—cello, or an “5”—sound effect, or a “CH”—cymbal. The result has been shown to be clear and perfect every time without ear judgment, mental confusion, physical or physiological gymnastics, and unaffected by any previous cultural or sectional influences that might work against this articulation.
Traditionally, the study of voice and speech is divided into different disciplines—voice for singing, voice for speech, diction, public speaking, therapy, etc. However, fundamental Lessac concepts serve all disciplines. All voice and speech is basically musical with the difference between speaking and singing being a relative one.
Traditionally, consonants have been thought of as “articulated” sounds—primarily important for intelligibility. The Lessac instrumental approach to consonants suggests a reversal of the relative position of vowels and consonants in singing and speaking. In singing, the vowels make the principal artistic contribution; the consonants a utilitarian one. But, in general speech, the consonants carry most of the melody and rhythm, while the vowels serve primarily for emphasis.
As the student comes to understand that the voice and speech with its vowels and consonants have a symphonic quality and range, and that one can “play” the voice in a musical and instrumental way, one comes to use another, total image in speaking, namely, the image of an orchestra playing a piece of music.
In teaching through an organized and related group of images, the Lessac approach directs focus to the exploration at hand and perhaps obviates most of the inhibitory and competing response pattern a normal learning situation implicitly contains. It is sometimes difficult to communicate, but when communicated, it contains a tremendous amount of information in a “chunked” and, therefore, memorized state. Through a special kind of learning, images chunk information.
Many people on first understanding the Lessac voice theory assume that his use of musical instruments to teach consonants and his overall musical approach is simply a useful teacher's analogy—or, if they disagree with it, a “trick” of some kind. However, studies of the origins of music suggest that the relationship between music and speech and, within that, between consonants and musical instruments appears to be a fundamental one. In all cultures, the development of specific instruments and vocal sounds appears to have been an inter—related process. Certain instruments were built to mirror the image or sound of the vocal instrument and certain vocal sounds were made to mirror pleasing instrumental images or sounds—such as, basic percussive sounds, the twang of a bow string or the tone of an early horn.
The Lessac consonant system applied to English reflects the instruments of Western European culture and its symphony orchestra. Though the basic instruments are the same in all cultures the drum, the horn, the stringed instrument, etc., specific variations reflecting specific different consonant sounds remain to be defined as the Lessac voice system is applied to languages in other cultural settings.