Conventional telephone networks, such as the public switched telephone network (PSTN) and some mobile networks, limit audio to a frequency range of between around 300 Hz and 3,400 Hz. For example, in a typical PSTN call, an analog audio signal is converted into a digital format, transmitted through the network, and converted back to an analog signal. For instance, the analog signal may be processed using 8-bit pulse code modulation (PCM) at an 8,000 Hz sample rate, which results in a digital signal having a frequency range of between around 300 Hz and 3,400 Hz. Generally, a signal having a frequency range of between around 0 Hz and 4,000 Hz is consider a narrowband (NB) signal.
In contrast, a wideband (WB) signal may have a greater frequency range, e.g., a frequency range between around 0 Hz and 8,000 Hz or greater. A WB signal generally provides a more accurate digital representation of analog sound. For instance, the available frequency range of a WB signal allows high frequency speech components, such as portions having a frequency range between 3,000 Hz and 8,000 Hz, to be better represented. While an NB speech signal is typically intelligible to a human listener, the NB speech signal can lack some high frequency speech components found in uncompressed or analog speech and, as such, the NB speech signal can sound less natural to human listeners.
High frequency speech components are parts of speech, or portions thereof, that generally include frequency ranges outside that of an NB speech signal. For example, fricatives, e.g., the “s” sound in “sat,” the “f” sound in “fat,” and the “th” sound in “thatch,” and other phonemes, such as the “v” sound in “vine” or the “t” sound in “time”, may be high frequency speech components and may have at least some frequencies above 3000 or 4000 Hz. When fricatives and other high frequency components are processed for an NB speech signal, some portions of the high frequency components (referred to hereinafter as missing frequency components) may be outside the frequency range of the NB speech signal and, therefore, not included in the NB signal. Since high frequency speech components may be only partially captured in an NB speech signal, clarity issues that can annoy human listeners, such as lisping and whistling artifacts, may be introduced or exacerbated in the NB speech signal.
Bandwidth extension (BWE) generally involves artificially extending or expanding a frequency range or bandwidth of a signal. For example, BWE algorithms may be usable to convert NB signals to WB signals. BWE algorithms are especially useful for converting NB speech signals to WB speech signals at endpoints and/or gateways, such as for interoperability between PSTN networks and voice over Internet protocol (VoIP) applications.
Detection of speech frames with high frequency speech components can be useful for generating, from an NB speech signal, a WB speech signal having enhanced clarity. For example, by detecting speech frames containing high frequency speech components and estimating missing frequency components associated with such speech frames, such as a, speech quality and sound clarity can be enhanced in a generated WB speech signal. For instance, lisping and whistling characteristics found in the NB speech signal can be alleviated in the generated WB speech signal, thereby making the WB speech signal more natural and pleasant to human listeners.
Accordingly, in light of these difficulties, a need exists for improved methods, systems, and computer readable media for fricatives and high frequencies detection.