There exist a number of fields in which it is desirable to modify the characteristics of signal, particularly speech or other sound signals, in order to achieve a desired result. For example, in the coding of speech for transmission purposes, it is desirable to compress the speech to thereby reduce the amount of data that is to be transmitted. At the receiving end of the transmission, the compressed speech is expanded to reproduce the original sounds. The time scale modification of speech is also useful in the playback of recorded information. For example, a secretary who is transcribing recorded dictation may desire to speed up or slow down the playback rate, so that the words are reproduced at a rate that matches the typing speed. Of course, when the playback speed differs from the original recording speed, the pitch of the reproduced sound is altered, so that it does not sound natural. Consequently, it is desirable to modify the pitch of the recorded sound in conjunction with the time scale modification, so that the reproduction will sound more natural.
Another area in which the modification of sounds is useful is in sound-source separation. For example, when two people are speaking simultaneously, it is desirable to be able to separate the sounds from the two speakers and reproduce them individually. Similarly, when a person is speaking in a noisy environment, it is desirable to be able to separate the speaker's voice from the background noises.
In each of these areas, as well as others, the signal to be acted upon is first analyzed, to determine its component parts. Some of these component parts can then be modified, to produce a particular result, e.g. separation of the component parts into two groups to separate the voices of two speakers. Each group of component parts can then be separately resynthesized, to audibly reproduce the voices of the individual speakers or otherwise process them individually.
In the past, the analysis of sound, particularly speech, has been typically carried out with respect to the spectral content of the sound, i.e. its component frequencies. The various types of analysis which use this approach rely upon linear models of the human auditory system. In fact, however, the auditory system is nonlinear in nature. Of particular interest in this regard is the cochlea, i.e. that portion of the inner ear which transforms the pressure waves of a sound into electrical impulses, or neuron firings, that are transmitted to the brain. The cochlea essentially functions as a bank of filters, whose bandwidths change at different sound levels. Similarly, neurons change their sensitivity as they adapt to sound, and the inner hair cells produce nonlinear rectified versions of the sound. This ability of the ear to adapt to changes in sound makes it difficult to describe auditory perception in terms of linear concepts, such as the spectrum or Fourier transform of a sound.
Therefore, a different, and perhaps more useful, approach to the analysis of sound is from the standpoint of its temporal content. More particularly, an auditory signal has characteristic periodicity information that remains undisturbed by most nonlinear transformations. Even if the bandwidth, amplitude and phase characteristics of a signal are changing, its repetitive characteristics do not. Furthermore, sounds with the same periodicity typically come from the same source. Thus, the auditory system operates under the assumption that sound fragments with a consistent periodicity can be combined and assigned to a single source.
Along these lines, an analytical tool has been developed which provides a visual representation of the temporal content of a signal. This tool, which is called a correlogram, represents the signal as a three-dimensional function of time, frequency and periodicity. To generate a correlogram, a one-dimensional acoustic pressure is processed in a cochlear model. This model produces a two-dimensional map of neural firing rate as a function of time and distance along the basilar membrane of the cochlea. Then, by measuring the periodicities of the output signals from the cochlear model, a third dimension is added to produce the correlogram. The information contained in the correlogram can be used in a variety of ways. In addition to sound visualization, it can be used for pitch detection and modification, as well as sound separation. For further information regarding the correlogram and its applications, see Slaney et at, "On The Importance of Time--A Temporal Representation of Sound" published in Visual Representation of Speech Signals, edited by Martin Cooke, Steve Beet and Malcolm Crawford, 1993, John Wiley & Sons Ltd., the disclosure of which is incorporated herein by reference.
Heretofore, there has been no known technique for resynthesizing the information in a correlogram into a waveform that can be used to produce an audible sound or be otherwise processed. Part of the difficulty lies in the fact that, as a result of the signal processing that takes place to produce the correlogram, information regarding the phase content of the original signal is suppressed. Thus it is not possible to simply reverse the signal processing in order to reproduce the original sound. Rather, additional steps must be carried out to recover the suppressed phase information. This problem is further exacerbated if the correlogram is modified prior to resynthesis, since the modification may result in the loss of additional information.
Accordingly, it is the general objective of the present invention to provide a system and process for analyzing a signal, such as sound, with respect to its component features and reconstructing the signal from those features. Although not limited thereto, the present invention is particularly directed to a process which enables information in a correlogram to be inverted to produce a waveform that can be used to produce an audible sound or otherwise processed, for example in an automatic speech recognition system.