In ordinary audio processing applications of common audio output interfaces, such as audio output from the speaker of televisions, computers, mobile phones, telephones or microphones, the audio output contains the waveforms distributed in different frequency bands. The varied sounds chiefly include human voice, background sounds and noise, and other miscellaneous sounds. To alter acoustic effects of certain sounds, or to emphasize importance of certain sounds, advanced audio processing on the certain sounds is required.
To be more precise, human speech contents in need of emphasis among output sounds are particularly enhanced. For instance, by enhancing frequency bands of dialogues between leading characters in a movie or of human speech in telephone conversations, output results of the enhanced frequency bands become more distinguishable and perspicuous against less important background sounds and noises, thereby accomplishing distinctive presentation as well as precise audio identification purposes, which are crucial issues in audio processing techniques.
The aforementioned human speech enhancement technique is already used and applied according to the prior art. Referring to FIG. 1 showing a waveform schematic diagram in which a specific band is enhanced according to the prior art, the upper waveform is an original sound output waveform, with a horizontal axis thereof representing frequency and a vertical axis thereof representing amplitude of the waveform output. The lower waveform in the diagram shows a processed waveform. In that ordinary human voices have a frequency range of between 500 Hz and 6 KHz or even 7 KHz, any sound frequencies falling outside this range is not the frequency range of ordinary human voices. As shown in the diagram, a common speech enhancement technique directly selects signals within a band of 1 KHz to 3 KHz from a band of output sounds, and processes the selected signals to generate output signals. Alternatively, a filter through a time domain is used to perform bandpass filtering and enhancement on signals of a certain band. According to such prior art, the desired band of human voice is indeed enhanced. However, co-existing background sounds and noises as well as minor audio contents are concurrently enhanced, such that the speech does not sound distinguishable or clear. Some existing digital and analog televisions implement the above method or a similar method for enhancing speech outputs.
FIG. 2 shows a schematic diagram of a system operation for speech enhancement according to the prior art. This technique processes audio signals of a single-channel under a frequency domain, and executes digital processing on a frequency sampling (FS) from the signals. Commonly used frequency sampling rate or sampling frequencies of audio signals include 44.1 KHz, 48 KHz and 32 KHz. The frequency domain signals are acquired from the time domain signals by using Fast Fourier Transform (FFT). Using a speech enhancement operator 10 in the diagram, various operations are performed on the sampling frequencies with specific resolutions under the frequency domain, so as to remove frequencies of non-primary background sounds and noises, or to enhance frequencies of required speech. With such procedure, the band of speech is accounted for a substantial ratio in output results obtained. The output results are processed using inverse FFT (IFFT) to return to the time domain signals for further audio output.
The abovementioned technique, including the speech enhancement operator 10, is prevailing in audio output functions of telephones and mobile phones, and is particularly extensively applied in GSM mobile phones. Processing modes or methods for this technique involve spectral subtraction, energy constrained signal subspace approaches, modified spectral subtraction, and linear prediction residual methods. Nevertheless, speech enhancement is still generally accomplished by individually processing left-channel and right-channel audio signals in common stereo sound outputs.
Although the method shown in FIG. 1 accomplishes speech enhancement without FFT and IFFT transformation, it has a drawback of unobvious and undistinguishable processed results, and fails to effectively fortify human speech or filter other minor sounds. The technique shown in FIG. 2, effectively using FFT, is capable of acquiring human speech or background sounds with respect to the sampling frequency of particular resolutions under the frequency domain, and performing corresponding human speech enhancement or background sounds filtering. Yet, when this technique is applied in processing left and right channels individually, the system inevitably requires a large amount of system memory such as DRAM or SRAM during operations thereof. In addition, after processing by the speech enhancement operator 10 under the frequency domain using FFT, IFFT is applied to return the time domain output signals. Performing FFT and IFFT transformation also requires a large amount of system memory and further requires extensive resources of a processor. Therefore, a primary object of the invention is to overcome the aforementioned drawbacks of the techniques of the prior art.