1. Field of the Invention
The present invention relates to the field of digital signal processing of audio, and, more particularly, to using a loudness-level-reference segment of audio to normalize relative audio levels among different audio files when combining content of the audio files.
2. Description of the Related Art
Speech recognition engines are tested and trained using speech recordings. These speech recordings include both speech and background noise. In order to comprehensively test or train a speech recognition engine, many different voices and background noises are required. The different combinations are meant to simulate real-world conditions in which the speech recognition engine will operate.
For example, a speech recognition engine used by an interactive voice response (IVR) system can expect to be used by talkers of different background environments are expected, each having different ambient noises and ambient noise levels. Background environments can include an interior of a car, a crowd, a public transportation environment, a business environment, a household environment, and the like.
One technique for obtaining audio files needed by the speech recognition engine is to record audio of a number of different talkers in each of a number of different audio environments. This technique, referred to as a real environment audio recording technique, is very expensive in terms of required man hours for talkers and recording operators to obtain the audio files. Additionally, the resulting audio files can require significant storage space.
Another technique for obtaining the audio files is to record talkers once in a sound room or an environment with minimal ambient noise and to record background sounds once for a number of different environments. The talker audio files are then combined with the background sounds to generate audio files with talker content and with ambient noise for different environments. This technique, referred to as a post-recording mixing technique, is much less expensive. Fewer total recordings are necessary to obtain the desired combinations of talkers and environments. Additionally, the audio files can be combined when needed, which conserves storage space of a speech processing system.
The post-recording mixing technique imposes a number of challenges. One challenge is to ensure that the relative audio level of the background sounds is appropriate for the relative audio level of the speech. When the audio levels are different, the combined audio file does not properly simulate a live situation. Accordingly, the tests and/or training activities that are based upon the combined audio are inaccurate.
On conventional means for adjusting audio levels is to have a human agent manually adjust the audio levels of the two component files. Results from a manual adjustment technique are highly dependent upon a skill of the human agent, are generally not subject to verification, and require significant time.
Another conventional means is to calibrate all recording devices to an equivalent audio recording level. When a recorded sound has a particularly high audio level relative to calibrated settings, clipping can occur. Similarly, when a recorded sound has a particularly low audio level relative to the calibrated settings, the resulting recording can be of relatively low quality. What is needed is a solution for implementing the post-recording mixing technique, which is not subject to drawbacks inherent in conventional implementation of the post-recording mixing technique.