Embedding information in audio signals, or audio steganography, is vital for secure covert transmission of information such as battlefield data and banking transactions via open audio channels. On another level, watermarking of audio signals for digital rights management is becoming an increasingly important technique for preventing illegal copying, file sharing, etc. Audio steganography, encompassing information hiding and rights management, is thus gaining widespread significance in secure communication and consumer applications. A steganography system, in general, is expected to meet three key requirements, namely, imperceptibility of embedding, correct recovery of embedded information, and large payload. Practical audio embedding systems, however, face hard challenges in fulfilling all three requirements simultaneously due to the large power and dynamic range of hearing, and the large range of audible frequency of the human auditory system (HAS). These challenges are more difficult to surmount than those faced by image and video steganography systems due to the relatively low visual acuity and large cover image/video size available for embedding.
One of the commonly employed techniques to overcome the embedding limitations due to the acute sensitivity of the HAS is to embed data in the auditorily masked spectral regions. Frequency masking phenomenon is a psychoacoustic masking property of the HAS that renders weaker tones in the presence of a stronger tone (or noise) inaudible. A large body of embedding work has been reported with varying degrees of imperceptibility, data recovery and payload, all exploiting the frequency masking effect for watermarking and authentication applications.
Psychoacoustical, or auditory, masking is a perceptual property of the HAS in which the presence of a strong tone makes a weaker tone in its temporal or spectral neighborhood imperceptible. This property arises because of the low differential range of the HAS even though the dynamic range covers 80 dB below ambient level. In temporal masking, a faint tone becomes undetected when it appears immediately before or after a strong tone. Frequency masking occurs when human ear cannot perceive frequencies at lower power level if these frequencies are present in the vicinity of tone or noise-like frequencies at higher level. Additionally, a weak pure tone is masked by wide-band noise if the tone occurs within a critical band. The masked sound becomes inaudible in the presence of another louder sound; the masked sound is still present, however.
By exploiting the limitation of the HAS in not perceiving masked sounds, an audio signal can be efficiently coded for transmission and storage as in ISO-MPEG audio compression and in Advanced Audio Coder algorithms. While the coder represents the original audio by changing its characteristics, a listener still perceives the same quality in the coded audio as the original. The same principle is extended to embedding information by utilizing the frequency masking phenomenon directly or indirectly.
General steganography procedure employing the frequency masking property begins with the calculation of the masker frequencies—tonal and noise-like—and their power levels from the normalized power spectral density (PSD) of each frame of cover speech. A global (frame) threshold of hearing based on the maskers present in the frame is then determined. Also, the sound pressure level for quiet—below which a signal is generally inaudible—is obtained. As an example, the normalized power spectral density, threshold of hearing, and the absolute quiet threshold are shown in FIG. 1 for a frame of speech. The spectral component around 1000 Hz in this figure, for instance, is inaudible, or masked, because of its PSD being below the global masking threshold level at that frequency. It may be noticed that with the threshold at approximately 75 dB and the PSD at 52 dB, raising the PSD of the signal at 1000 Hz by as much as 15 dB will still render the component inaudible. (Raising the level much closer to the threshold may alter the threshold itself if the other components within the critical band are lower than the new level at 1000 Hz.) In addition to modifying the PSD, the phase at 1000 Hz can also be changed without causing noticeable perceptual difference. Many other such ‘psychoacoustical perceptual holes,’ or masked points, can be detected over the range of frequencies present in the signal frame. The PSD values and/or the phase values at these holes can be modified in accordance with information to be embedded, with little effect on the perceptual quality of the frame. Alternatively, the phase in the perceptually significant regions can be changed by a small value. Here, the inability of the HAS in perceiving absolute phase, as opposed to relative phase, is used to achieve imperceptible embedding.
In employing frequency-masked regions directly for data embedding, phase and/or amplitude of spectral components at one or more frequencies in the masked set are altered in accordance with the data. To accommodate varying quantization levels and noise in transmission, spectral amplitude modification is generally carried out as a ratio of the frame threshold. Examples of direct embedding in frequency-masked regions can be found in U.S. Patent Application Publication 2003/0176934 and U.S. Patent Application Publication 2005/0159831, which is incorporated by reference herein.
Embedding in temporally masked regions, typically for watermarking an audio signal, modifies the envelope of the audio with a preselected random sequence of data such that the modification is inaudible. Due to the small size and selection of data, however, temporal masking is primarily suited for watermarking applications.
Several steganography methods using indirect exploitation of frequency masking have been recently proposed with varying degrees of success. These methods typically alter speech samples by a small amount so that inaudibility is achieved without explicitly locating masked regions.
Cepstral domain features have been used extensively in speech and speaker recognition systems, and speech analysis applications. Complex cepstrum {circumflex over (x)}[n] of a frame of speech x[n] is defined as the inverse Fourier transform of the complex logarithm of the spectrum of the frame, as given by
                                                        x              ^                        ⁡                          [              n              ]                                =                                                    F                                  -                  1                                            ⁡                              [                                  ln                  ⁡                                      [                                          F                      ⁡                                              (                                                  x                          ⁡                                                      [                            n                            ]                                                                          )                                                              ]                                                  ]                                      =                                          1                                  2                  ⁢                  π                                            ⁢                                                          ⁢                                                ∫                                      -                    π                                    π                                ⁢                                                                  ⁢                                                                            ln                      ⁢                      X                                        ⁡                                          (                                              ⅇ                        jω                                            )                                                        ⁢                                      ⅇ                                          j                      ⁢                                                                                          ⁢                      ω                      ⁢                                                                                          ⁢                      n                                                        ⁢                                      ⅆ                    ω                                                                                      ⁢                                  ⁢        where                            (        1        )                                          X          ⁡                      (                          ⅇ                              j                ⁢                                                                  ⁢                ω                                      )                          =                              F            ⁡                          [                              x                ⁡                                  [                  n                  ]                                            ]                                =                                                    ∑                                  k                  =                                      -                    ∞                                                  ∞                            ⁢                                                          ⁢                                                x                  ⁡                                      [                    k                    ]                                                  ⁢                                  ⅇ                                                            -                      j                                        ⁢                                                                                  ⁢                    ω                    ⁢                                                                                  ⁢                    k                                                                        =                                                                            X                  ⁡                                      (                                          ⅇ                                              j                        ⁢                                                                                                  ⁢                        ω                                                              )                                                                              ⁢                              ⅇ                                  j                  ⁢                                                                          ⁢                                      θ                    ⁡                                          (                      ω                      )                                                                                                                              (        2        )            is the discrete Fourier transform of x[n], with the inverse transform given by
                                          x            ⁡                          [              n              ]                                =                                    1                              2                ⁢                                                                  ⁢                π                                      ⁢                                          ∫                                  -                  π                                π                            ⁢                                                X                  ⁢                                                                          (                                      ⅇ                                          j                      ⁢                                                                                          ⁢                      ω                                                        )                                ⁢                                  ⅇ                                      j                    ⁢                                                                                  ⁢                                          ω                      ⁢                      n                                                                      ⁢                                  ⅆ                  ω                                                                    ,                            (        3        )            andln X(ejω)=ln|X(ejω)|+jθ(ω), θ(ω)=arg[X(ejω)]  (4)is the complex logarithm of the DFT of x[n].
While real cepstrum (without the phase information given by the second term in Eq. (4)) is typically used in speech analysis and speaker identification applications, complex cepstrum is needed for embedding and watermarking to obtain the cepstrum-modified speech. If a frame of speech samples is represented byx[n]=e[n]*h[n]  (5)where e[n] is the excitation source signal and h[n] is the vocal tract system model, Eq. (4) above becomesln [X(ejω)]=ln [E(ejω)]+ln [H(ejω)]  (6)
The ability of the cepstrum of a frame of speech to separate the excitation source from the vocal tract system model, as seen above, indicates that modification for data embedding can be carried out in either of the two parts of speech. Imperceptibility of the resulting cepstrum-modified speech from the original speech may depend upon the extent of changes made to the pitch (high frequency second term) and/or the formants (low frequency first term), for instance.
Since the excitation source typically is a periodic pulse source (for voiced speech) or noise (for unvoiced speech) while the vocal tract model has a slowly varying spectral envelope, their convolutional result in Eq. (5) is changed to addition in Eq. (6). Hence, the inverse Fourier transform of the complex log spectrum in Eq. (6) transforms the vocal tract model to lower indices in the cepstral (“time”, or quefrency) domain and the excitation to higher cepstral indices or quefrencies. Any modification carried out in the cepstral domain in accordance with data, therefore, alters the speech source, system, or both, depending on the quefrencies involved.
Prior work employing cepstral domain feature modification for embedding includes adding pseudo random noise sequence for watermarking with some success. Other prior work has observed that the statistical mean of cepstrum varies less than the individual cepstral coefficients and that the statistical mean manipulation is more robust than correlation-based approach for embedding and detection. More recently, prior work shows that by modifying the cepstral mean values in the vicinity of rising energy points, frame synchronization and robustness against attacks can be achieved.