A Mellin-Transform Information Extractor for Vibration Sources.
This application is based on the inventors"" work, xe2x80x9cA Mathematical Framework for Auditory Processing: A Mellin Transform of a Stabilized Wavelet Transform?xe2x80x9d (Irino et al., ATR Technical Report, Jan. 29, 1999), the description of which is incorporated herein by reference in its entirety.
1. Field of the Invention
The present invention relates to an improvement of time sequential data analysis which has been conventionally conducted by the Fourier transform or statistical approach such as a self regression model. The present invention is applicable to tone recognition, individual recognition by voice, speech recognition, analysis of architectural acousticity as well as signal analysis, encoding, signal separation and signal enhancement processes of voice or music, for example. Besides acoustic signals and the like, the present invention is widely applicable to analysis of mechanical vibration such as mechanical sound and seismic waves, analysis of biotic signals such as brain waves, cardiac pulse sounds, ultrasonic echoes and nerve cell signals as well as analysis of signals from sensors for collecting general time sequential data.
2. Description of the Background Art
Conventionally, the fundamental step in information processing was to find the spectrogram, that is, a xe2x80x9ctime-frequency representationxe2x80x9d of the signal. What is obtained using a fast digital transform (for example, a fast Fourier transform) or using linear predictive analysis, is always a vector which directly corresponds to a spectrum of a frequency representation at a certain time point, and a time sequence of such vectors constitutes a representation corresponding to a spectrogram. Such a representation derives from the spectral representation of signals originated from the Fourier transform. The sound spectrogram is the most popular representation for features of a voice signal, for example. The sound spectrogram is a visual representation of time change in the voice spectrum using a density representation, level contour representation or color representation for easier understanding.
Because this spectral representation is a better representation for signal features than the waveform, because the human auditory system is not very sensitive to relative phase relationships between signals consisting solely of a plurality of sine waves and because a method of efficient calculation of such relations has been established, the spectral representation was thought to be the most suitable for information processing of voice and the like, and therefore the spectral representation has come to be widely used.
Conventionally, the performance of various signal processing systems has been improved to the extreme by applying the spectral representation described above to almost anything. It seems, however, that the improvement in performance by this approach has almost reached the limit. In the field of speech recognition apparatus, for example, it is generally necessary to train the system on a number of human speakers in advance. However, even speech recognition apparatus which has already gone through the learning process with a large number of adult male and female speakers would not recognize the voice of a child. The basic reason for this is that vocal tracts, vocal cords and the like of an adult and a child are different in physical size, and therefore spectral structures and the pitch of the speech are different, and as a result, feature vectors extracted from the respective speakers are different.
As a solution of this problem, the speech recognition apparatus may be trained with the speech of a large number of children, or speech recognition apparatus designed especially for children may be prepared together with the apparatus for discriminating an adult from a child. At present, however, large scale data bases of children""s speech are not available, and hence such speech recognition apparatus for children only cannot readily be constructed. Further, even if such a large scale data base of children""s speech is built up taking much time and labor, the above described solution could not be very efficient.
In order to solve this problem, a representation is indispensable which is capable of automatically normalizing the physical size of a vocal tract or vocal cord, which is difficult using a spectrogram. Though an example of speech recognition only has been described, there are various and many situations which require acoustic feature extraction which is invariant regardless of the physical size of a sound source, for example, analysis of sounds from musical instruments and analysis of combustion engine sound. The solution to the problem is necessary in wide and various fields including analysis of not only acoustic signals but also mechanical vibration such as mechanical sounds and seismic waves, analysis of biotic signals such as brain waves, cardiac pulse sounds, ultrasonic echoes and nerve cell signals and analysis of signals from sensors for collecting general time sequential data.
Therefore, an object of the present invention is to provide a method of signal processing which can overcome the essential limit imposed by spectral representation described with reference to the examples above using a representation not dependent on physical size of the signal source, as well as to provide an apparatus utilizing the method.
Another object of the present invention is to provide a method of signal processing capable of extracting a signal feature invariant regardless of the physical size of the signal source using a representation not dependent on the physical size of the signal source, as well as to provide an apparatus using the method.
A still further object of the present invention is to provide a method of signal processing capable of extracting a signal feature invariant regardless of the physical size of a signal source using a representation of which shape is invariant regardless of expansion or reduction along a time axis of a signal waveform, as well as to provide an apparatus utilizing the method.
An additional object of the present invention is to provide a method of signal processing capable of extracting a signal feature invariant regardless of physical size of a signal source by obtaining and utilizing a representation of which shape is invariant regardless of expansion or reduction along a time axis of a signal waveform, using the Mellin transform, and to provide an apparatus utilizing the method.
A still further object of the present invention is to provide a method of signal processing capable of extracting a signal feature invariant regardless of physical size of a signal source by obtaining and utilizing a time expression of which shape is invariant regardless of expansion or reduction along a time axis of a signal waveform using the Mellin transform, overcoming the xe2x80x9cshift varyingxe2x80x9d characteristic of the Mellin transform, as well as to provide an apparatus utilizing the method.
The method of signal processing in accordance with an aspect of the present invention includes the steps of: wavelet-transforming an input signal in a computer; and extracting features of the signal by performing a Mellin transform to the output of the wavelet transform step in synchrony with the input signal in a computer.
As the output of the wavelet transform step is synchronized with the input signal, a start point for the Mellin transform analysis is determined, and hence the Mellin transform of the input signal becomes possible despite the shift varying nature of the Mellin transform. The Mellin transform is characterized by the fact that the magnitude distribution of the output thereof is unchanged regardless of expansion or reduction of a signal waveform on the time axis. Therefore, the Mellin transform used in signal processing enables extraction of a feature invariant regardless of the physical size of the signal source from the signal.
Preferably, the feature extraction step includes the steps of: transforming a representation corresponding the running spectrum obtained from the wavelet transform step to a time-interval/logarithmic-frequency representation by stabilizing the representation in time with signal synchronization while maintaining the fine structure of the response waveform; and performing a process corresponding to the Mellin transform on the time-interval/logarithmic-frequency representation along a line on which a product or ratio between the time interval and the frequency is constant.
As the process corresponding to the Mellin transform is performed along a line on which the product or ratio between the time interval and the frequency is constant, a Mellin image is obtained which is a representation invariant regardless of expansion or reduction of periodicity and the physical size of the sound source. The sound source image can be represented by the Mellin image.
According to another aspect, the signal processing apparatus of the present invention includes a wavelet transform unit for wavelet transform of an input signal which has been transformed to a predetermined format allowing processing by a computer; and a feature extraction unit for extracting a signal feature by Mellin-transforming the output of the wavelet transform unit in synchronization with the input signal.
As the output of the wavelet transform unit is synchronized with the input signal, an origin for the Mellin transform analysis is determined, and hence the Mellin transform of the input signal becomes possible regardless of the shift varying nature of the Mellin transform. The Mellin transform has a property that the magnitude distribution of its output is invariant regardless of expansion or reduction of the signal waveform on the time axis. Therefore, use of Mellin transform in signal processing enables extraction of features which are invariant regardless of the physical size of the signal""s source.
Preferably, the feature extraction unit includes a unit for transforming a representation corresponding to a running spectrum obtained by the wavelet transform unit to a time-interval/logarithmic-frequency representation that is stabilized in time in synchrony with the signal while maintaining the fine structure of the response waveform, and a unit for performing a process corresponding to the Mellin transform on the time-interval/logarithmic-frequency representation along a line on which the product or ratio between the time interval and the frequency is constant.
As a process corresponding to Mellin transform is performed along the line on which the product or ratio between the time interval and the frequency is constant, a Mellin image is obtained which is a representation invariant regardless of expansion or reduction of the physical size and periodicity of the sound source. Sound source information can be represented by the Mellin image.
According to a still further aspect, the signal processing apparatus in accordance with the present invention includes: a wavelet filter bank including a plurality of wavelet filters performing transform by wavelets having the same wavelet kernel function and frequencies different from each other, each connected to receive an input signal; an auditory figure extracting unit connected to receive an output of the wavelet filter bank for extracting an auditory figure from the output of the wavelet filter bank; a size-shape image generating unit for generating a size-shape image of an input signal from the auditory figure extracted by the auditory figure extracting unit; and a feature extracting unit for extracting features of the input signal from the size-shape image.
The auditory figure is a time-stabilized output of the wavelet filter bank, and as the size-shape image is generated therefrom, subsequent Mellin transform is facilitated. As Mellin transform enables generation of a representation not dependent on the physical size or periodicity of the signal""s source, and therefore the signal from the signal""s source can be analyzed based on a feature not dependent on the the physical size or periodicity of the signal""s source.
Preferably, the feature extracting unit includes a Mellin-image generating unit for generating the Mellin image by performing the Fourier transform on the size-shape image along an impulse response line of each wavelet filter.
As the process corresponding to the Fourier transform is performed along the impulse response line, a Mellin image which is a representation invariant regardless of expansion or reduction of the physical size or periodicity of the sound source can be obtained. The sound source information can be represented by the Mellin image.
Preferably, the auditory figure extracting unit includes: a strobed temporal integrating unit for performing strobed temporal integration of the output of each channel of the wavelet filter bank to generate a stabilized auditory image, by detecting periodicity included in the output of the wavelet filter bank; and a stabilized auditory image extracting unit for extracting, as the auditory figure, one period of a prescribed order of the stabilized auditory image obtained by strobed temporal integration, based on the periodicity detected by the strobed temporal integrating unit.
As the output of the wavelet filter bank does not have a fixed starting point, the Mellin transform is not applicable to the output as it is. When a signal having periodicity or quasi-periodicity such as a voiced sound of speech or constant musical sound is input, the output of the wavelet filter bank involves periodicity, and by strobed temporal integration using the periodicity, the output of the wavelet filter bank can be stabilized in time to generate a stabilized auditory image. Once the stabilized auditory image is generated, one period of a prescribed order selected therefrom as needed, whereby a representation invariant regardless of expansion or reduction of the physical size and periodicity of the sound source can be obtained by subsequent Mellin transform.
The foregoing and other objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.