This invention relates generally to voice communication systems, and more specifically to a compressed voice digital communication system using a very low bit rate speech vocoder for voice messaging.
Communications systems, such as paging systems, have had to compromise the length of messages, number of users and convenience to the user in order to operate the systems profitably. The number of users and the length of the messages have been limited to avoid over crowding of the channel and to avoid long transmission time delays. The user""s convenience has thereby been directly affected by the channel capacity, the number of users on the channel, system features and type of messaging. In a paging system, tone only pagers that simply alerted the user to call a predetermined telephone number offered the highest channel capacity but were some what inconvenient to the users. Conventional analog voice pagers allowed the user to receive a more detailed message, but severally limited the number of users on a given channel. Analog voice pagers, being real time devices, also had the disadvantage of not providing the user with a way of storing and repeating the message received. The introduction of digital pagers with numeric and alphanumeric displays and memories overcame many of the problems associated with the older pagers. These digital pagers improved the message handling capacity of the paging channel, and provided the user with a way of storing messages for later review.
Although the digital pagers with numeric and alpha numeric displays offered many advantages, some user""s still preferred pagers with voice announcements. In an attempt to provide this service over a limited capacity digital channel, various digital voice compression techniques and synthesis techniques have been tried, each with their own level of success and limitation. Voice compression methods, based on vocoder techniques, currently offer a highly promising technique for voice compression. Of the low data rate vocoders, the multi band excitation (MBE) vocoder is among the most natural sounding vocoder.
The vocoder analyzes short segments of speech, called speech frames, and characterizes the speech in terms of several parameters that are digitized and encoded for transmission. The speech characteristics that are typically analyzed include voicing characteristics, pitch, frame energy, and spectral characteristics. Vocoder synthesizers used these parameters to reconstruct the original speech by mimicking the human voice mechanism. Vocoder synthesizers modeled the human voice as an excitation source, controlled by the pitch and frame energy parameters followed by a spectrum shaping controlled by the spectral parameters.
The voicing characteristic identifies the repetitiveness of the speech waveform within a frame. Speech consists of periods where the speech waveform has a repetitive nature and periods where no repetitive characteristics can be detected. The periods where the waveform has a periodic repetitive characteristic are said to be voiced. Periods where the waveform seems to have a totally random characteristic are said to be unvoiced. The voiced/unvoiced characteristics are used by the vocoder speech synthesizer to determine the type of excitation signal which will be used to reproduce that segment of speech. Due to the complexity and irregularities of human speech production, no single parameter can determine in a fully reliable manner when a speech frame is voiced or unvoiced.
Pitch is the fundamental frequency of the repetitive portion of the voiced wave form. Pitch is typically measured in terms of the time period of the repetitive segments of the voiced portion of the speech wave forms. The speech waveform is a highly complex waveform and very rich in harmonics. The complexity of the speech waveform makes it very difficult to extract pitch information. Changes in pitch frequency must be smoothly tracked for an MBE vocoder synthesizer to smoothly reconstruct the original speech. Most vocoders employ a time-domain auto-correlation function to perform pitch detection and tracking. Auto-correlation is a very computationally intensive and time consuming process. It has also been observed that conventional auto-correlation methods are unreliable when used with speech derived from a telephone network. The frequency response of the telephone network (300 Hz to 3400 Hz) causes deep attenuation to the low frequencies of a speech signal that has a low pitch frequency (the range of the fundamental pitch frequency of the human voice is 50 Hz to 400 Hz). Because of the deep attenuation of the fundamental frequency, pitch trackers can erroneously identify the second or third harmonicas the fundamental frequency. The human auditory process is very sensitive to changes in pitch and the perceived quality of the reconstructed speech is strongly effected by the accuracy of the pitch derived, so when a pitch tracker erroneously identifies the second or third harmonic as the fundamental frequency, the synthesized signal can be misunderstood.
Frame energy is a measure of the normalized average RMS power of the speech frame. This parameter defines the loudness of the speech during the speech frame.
The spectral characteristics define the relative amplitude of the harmonics and the fundamental pitch frequency during the voiced portions of speech and the relative spectral shape of the noise-like unvoiced speech segments. The data transmitted defines the spectral characteristics of the reconstructed speech signal. Non optimum spectral shaping results in poor reconstruction of the voice by an MBE vocoder synthesizer and poor noise suppression.
The human voice, during a voiced period, has portions of the spectrum that are voiced and portions that are unvoiced. MBE vocoders produce natural sounding voice because the excitation source, during a voiced period, is a mixture of voiced and unvoiced frequency bands. The speech spectrum is divided into a number of frequency bands and a determination is made for each band as to the voiced/unvoiced nature of each band. The MBE speech synthesizer generates an additional set of data to control the excitation of the voiced speech frames. In conventional MBE vocoders, the band voiced/unvoiced decision metric is pitch dependent and computationally intensive. Errors in pitch will lead to errors in the band voiced/unvoiced decision that will affect the synthesized speech quality. Transmission of the band voiced/unvoiced data also substantially increases the quantity of data that must be transmitted.
Conventional MBE synthesizers require information on the phase relationship of the harmonic of the pitch signal to accurately reproduce speech. Transmission of phase information further increases the data required to be transmitted.
Conventional MBE synthesizers can generate natural sounding speech at a data rate of 2400 to 6400 bit per second. MBE synthesizers are being used in a number of commercial mobile communications systems, such as the INMARSAT (International Marine Satellite Organization) and the ASTRO(trademark) portable transceiver manufactured by Motorola Inc. of Schaumburg, Ill. The standard MBE vocoder compression methods, currently used very successfully by two way radios, fail to provide the degree of compression required for use on a paging channel. Voice messages that are digitally encoded using the current state of the art would monopolize such a large portion of the paging channel capacity that they may render the system commercially unsuccessful.
Accordingly, what is needed for optimal utilization of a channel in a communication system, such as a paging channel in a paging system or a data channel in a non-real time one way or two way data communications system, is an apparatus that simply and accurately determines the voiced and unvoiced portions of speech, accurately determines and tracks the fundamental pitch frequency when the frequency spectrum of the fundamental pitch components is severely attenuated, and significantly reduces the amount of data necessary for the transmission of the voiced/unvoiced band information. Also what is needed is a method or apparatus that digitally encodes voice messages in such a way that the resulting data is very highly compressed while maintaining acceptable speech quality and can be mixed with the normal data sent over the communication channel.