1. Field of the Invention
The invention relates to systems and methods for applying reverberation to audio with selection by a server of at least one reverberation filter for application to the audio (e.g., at least one input audio stream asserted to the server from at least one client device) and application of at least one selected filter to the audio by a client device (or by the server and the client device). Typical embodiments are systems and methods which implement a voice-over internet protocol (VoIP), in which audio asserted to the server from each client device is indicative of speech by an audio source in a virtual environment (e.g., a multi-player game environment) shared by all the client devices.
2. Background of the Invention
Throughout this disclosure, including in the claims, the expression performing an operation “on” signals or data (e.g., filtering, scaling, or transforming the signals or data) is used in a broad sense to denote performing the operation directly on the signals or data, or on processed versions of the signals or data (e.g., on versions of the signals that have undergone preliminary filtering prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements audio signal processing may be referred to as an audio processing system, and a system including such a subsystem (e.g., a system that generates X output signals in response to audio signals and non-audio signals, in which the subsystem generates the audio signals and the non-audio signals are received from an external source) may also be referred to as an audio processing system.
For networked virtual environments, such as social communities or massively multiplayer on-line (MMO) games, meaningful interaction through voice conversation with real people can be a valuable feature. First adopted through side-clients enabling telephone quality, walkie-talkie style communication, voice services are becoming more integrated and are now connecting hundreds of millions of users on PCs, game consoles and cell phones. In the next few years, voice communications through social or gaming environments will represent a significant portion of the total voice minutes. The goal of immersive voice is to make the audio component seamless and transparent to the technology, creating an immediate feeling of connectedness or presence in the user. Research suggests effective immersive voice is a function of both the voice fidelity, as well as the plausibility, consistency and perceptual level of engagement of the user.
Historically, immersive voice has been primarily associated with some form of spatial audio capture and reproduction. Spatialized voice communication has been extensively explored in the context of teleconferencing applications with a limited number of clients or endpoints, starting with early Bell Laboratories experiments of stereophonic telephony in 1930. In the 1990s, several experiments were conducted relying on multi-channel or binaural acquisition and rendering. Initially, voice in games and other virtual environments was typically mono and functioned much like a traditional conference-call for members of a particular team. Early research efforts to build immersive communication environments include the Massive system which built a 3D virtual environment with voice for teleconferencing. Building upon previous work in voice-over internet protocol (VoIP) and immersive teleconferencing and taking advantage of advances in commodity computer and audio hardware, integrated voice services quickly evolved to offer high-quality spatialized voice.
Contrary to traditional teleconferencing applications involving a relatively low number of participants, recent networked applications require serving hundreds to thousands of clients in a single virtual world. Typical massively multiplayer online (MMO) games can support over 5000 players in one virtual world. In successful games there can be many parallel copies of the virtual worlds leading to millions of people playing simultaneously. The players can be spread over large real-world distances. The worlds themselves can have very dense voice scenes with hundreds of people within visual range in popular parts of the map. In order to provide immersive voice in these environments in a scalable and cost effective manner it is important to consider the delivery costs such as bandwidth and the number of servers required and how they scale to the number of people in the environment. In order to keep server costs down it is important to support thousands of players on each physical server. It is also important to ensure that bandwidth costs are kept low even in very crowded scenes.
As a result, scalable VoIP servers generally implement a combination of voice packet forwarding as well as mixing of the voice streams on the server. In mixing mode, the server creates a simplified representation of the voice scene audible to each client by grouping different voice streams together to create clusters. The audio mixture corresponding to all the voice streams grouped in a cluster is computed on the server and streamed back to the client. In general the number of clusters is significantly lower than the number of connected clients therefore limiting the required bandwidth.
If the number of active talkers is small, the voice streams can be directly forwarded to each client, in which case any further processing must be carried out client-side.
Modeling the effects of distance, occlusion and reverberation on the voice signals is of primary importance in environments where participants can communicate realistically from multiple areas or rooms. The most advanced VoIP systems currently implement direct line-of-sight occlusion modeling as well as simplified diffraction effects resulting in unrealistic proximity cues. For MMO games where localizing teammates and enemies is of primary importance, rendering inappropriate distance cues can lead to a tactical disadvantage.
Due to the high computing cost and the difficulty to combine reverberation processing with clustering or spatial scene simplification, none of the previous work so far was able to render convincing early sound scattering and reverberation effects capable of conveying realistic proximity cues for large numbers of participants.
Sound reverberation effects due to sound scattering off wall surfaces carry major cues related to the size of the environment and distance to sound sources. Therefore, reverberation helps users to establish a better sense of presence in virtual environments and is arguably one of the most important audio effects to simulate in virtual environment applications supporting VoIP communication.
Client-server solutions have been proposed for dynamically computing sound propagation paths between clients connected in a virtual environment, but they have been limited to applications with very few concurrent clients and cannot scale to massive environments.
In current video games, reverberation effects are either directly pre-rendered into the sound effects or implemented at run-time using dynamic artificial reverberation filters. Parameters of the reverberation decay can be directly manipulated by the sound designer to achieve a desired effect without requiring any geometrical modeling.
While simplifying the authoring process, traditional artificial reverberators suffer from a number of issues. They impose a “single room” model and constrain the shape of the decay profile (e.g., exponential). They make limited use of geometry and therefore fail to convincingly model coupled or outdoor spaces or provide finer-grain surface proximity effects. Finally, they do not scale to accommodate large numbers of concurrent effects. Recently a number of geometrical approaches have been presented to model dynamic sound reflection and diffraction interactively.
A practical approach to simulating acoustics of a virtual environment is to precompute the acoustical response at several locations throughout the environment in an off-line process so that the results can be efficiently re-used to process audio signals at run-time (e.g., during game play). A main benefit of such off-line computation is that high-order scattering (reflection/diffraction) can be simulated, providing improved proximity cues and distance perception. The acoustical response of an environment can be represented by a set of predetermined reverberation filters which can be stored, for later use (e.g., during game play) to process a dry signal in order to impart a reverberant characteristic to the dry signal. A method for generating such a set of reverberation filters is described in the paper by Nicolas Tsingos, entitled “Pre-Computing Geometry-Based Reverberation Effects for Games,” AES 35th International Conference on Audio for Games, 2009 (“Tsingos”).
As described in Tsingos, to implement such an off-line computation the acoustical response of the virtual environment can be determined (sampled) for pairs of key locations in the environment, each key location acting in turn as a source location or a listener location. At run-time the current locations of each desired source and listener pair are then used to access the closest pre-sampled pair of key locations, and the desired acoustical response associated with the closest sampled pair is returned. To properly sample discontinuities created by wall boundaries, the environment can be partitioned into zones, the acoustical response of each zone can be determined (sampled) for pairs of key locations in the zone, and a predetermined acoustical response associated with a sampled pair of key locations (closest to the locations of the desired source and listener) in a zone is returned only for desired sources and listeners located in that zone.
The early reflections present in reverberation filters (which simulate a virtual environment's effect on emitted sound) generally vary significantly depending on the considered pairs of source and listening points. In contrast, the later parts of such reverberation filters are generally more consistent throughout the environment. For this reason, it is customary in architectural acoustics to separate the early part and late part of the reverberation determined by a reverberation filter.
A typical, compact representation of a reverberation filter (which simulates a virtual environment's effect on emitted sound) is its energy decay profile through time (e.g., as determined by integrating the energy of an acoustic signal emitted from a source in the environment as a function of its arrival time at a listener, and quantizing the energy values into a number of decay blocks (each decay block corresponding to a different arrival time range) at the desired sampling rate as described in Tsingos). For example, the lower graph in FIG. 1 represents the energy decay profile (in one frequency sub-band) of an exemplary reverberation filter of this type. If diffuse energy exchanges are modeled, the energy of each diffuse ray can also be directly integrated into the profile during the ray-tracing step. Additional parameters (e.g., a ratio of directional-to-diffuse energy as well as principal direction of incidence at the listener for reflected sound) can also determine or characterize a reverberation filter which simulates a virtual environment's effect on emitted sound. For example, the upper graph in FIG. 1 represents a diffusiveness index (a ratio of directional-to-diffuse energy) as a function of time, of the filter whose energy decay profile is shown in the lower graph of FIG. 1.
Stored data that determine a reverberation filter (for a source listener pair in a virtual environment) can be of several different types. For example, a decay block structure including attenuation values (e.g., in dBs) for different frequencies can be stored to model the filter's time-frequency envelope (e.g., an attenuation value is stored for each of a predetermined number of frequency bands, for each time window of the filter). As described in Tsingos, one can also compute and include in the stored decay block structure a principal direction and a diffusiveness index indicative of the ratio of directional-to-diffuse energy (e.g., 1 is pure directional, 0 is pure diffuse) for each time window of the filter (e.g., a diffusiveness index data determining the upper graph of FIG. 1).
FIG. 2 illustrates an exemplary processing pipeline (described in the above-cited Tsingos paper) for implementing a reverberation filter which simulates a virtual environment's effect on emitted sound, and applying the filter to an input signal.
The “4-band Decay Profile” identified in FIG. 2 represents a set of four attenuation values (each for a different frequency band) of the filter's time-frequency envelope, for each time window of the filter. For example, values A1 in FIG. 2 are the four attenuation values for the first time window (corresponding to the earliest reverb), and values A2 in FIG. 2 are the four attenuation values for the second time window. The relatively small set of values comprising the 4-band Decay Profile can be stored. In order to apply the filter to an input audio signal, the stored values can be read from storage, and interpolation can then be performed on the filter attenuation values for each time window to generate the “15-band Decay Profile” for the filter. The 15-band Decay Profile comprises fifteen interpolated attenuation values (one for each of the fifteen frequency sub-bands of a fifteen-band partition of the frequency domain) per time window. For example, the four values A1 in FIG. 2 for the first time window are interpolated to generate fifteen interpolated values IA1 for the first time window. In alternative implementations, the decay profile has more than (or less than) four bands, and/or the decay profile (having N bands) is upsampled to more than or less than fifteen bands (e.g., a four-band profile for each time window is upsampled to more than fifteen bands). Typically the number of subbands used during the reverberation processing will depend on how many are imposed by the codec that is used to transmit the voice data (since most codecs use a subband/filter-bank structure to encode the audio).
The fine grain temporal structure of the reverberation impulse response is modeled as noise (e.g., white noise). Thus, for each time window of the filter, the fine grain temporal structure of the filter is a burst of precomputed noise attenuated by the attenuation value (of the filter's time-frequency envelope) for the time window. For example, values N1 in FIG. 2 are the noise for the first time window (corresponding to the earliest reverb), and values N2 in FIG. 2 are the noise for the second time window. A short time Fourier transform (STFT) or another time-to-frequency-domain transform (e.g., the Modified Discrete Cosine Transform or “MDCT”) is applied to the noise for each time window, to generate noise frequency coefficients for each time window. For example, values NC1 in FIG. 2 are the noise frequency coefficients for the first time window and values NC2 in FIG. 2 are the noise frequency coefficients for the second time window.
The input audio signal (typically a speech signal) to be filtered by the attenuation filter consists of audio data frames, each corresponding to a different time window of the input audio signal. For example, values S1 in FIG. 2 are the input audio data frame for a first time window, values S2 in FIG. 2 are the input audio data frame for a second time window (the time window prior to the first time window), and values SN in FIG. 2 are the input audio data frame for the Nth time window (which occurs “N−1” time windows before the first time window). A short time Fourier transform (STFT) or other time-to-frequency-domain transform is applied to each frame of input audio data, to generate input frequency coefficients for each time window. For example, values SC1 in FIG. 2 are a first frame of input frequency coefficients (for the first time window) and values SC2 in FIG. 2 are a second frame of input frequency coefficients (for the second time window).
At run-time, a dry audio signal is processed (convolved) with a pre-computed reverberation filter to produce a reverberant (“wet”) signal that conveys the acoustics of the simulated space. To determine the reverberation filter (by convolution in the frequency domain), the coefficients of each block of noise frequency coefficients are multiplied (e.g., complex multiplied) with the corresponding attenuation values of the corresponding block of filter attenuation values (e.g., coefficients of block NC1 of noise frequency coefficients are multiplied with filter attenuation values IA1 of FIG. 2). To apply the reverberation filter to the input audio signal (by convolution in the frequency domain), the values that determine each time window of the reverberation filter are multiplied with corresponding frequency components of the input audio signal (in the same time window). More specifically, for each frame of input audio data (starting with the “N”th frame of the input audio signal), the reverberation filter is applied to the frame and to each of the N−1 previous frames (each having a different delay time relative to the frame). For each frame of input (dry) audio data (starting with the “N”th frame of the input audio signal), the frequency components of the input audio data frame (and of the N−1 previous frames) are multiplied with the corresponding values (for the relevant time window) of the reverberation filter, and products are summed (over all time windows) to generate a frame of output (wet) audio data (the “output frame” indicated in the diagram labeled “Frequency-domain reverberation” in FIG. 2). In the Frequency-domain reverberation diagram of FIG. 2, the input audio signal frame labeled “SC1” is the Nth frame of input frequency coefficients (e.g., values SC1 of FIG. 2), the input audio signal frame labeled “SC2” is the “N−1”th frame of input frequency coefficients (e.g., values SC2 of FIG. 2, delayed by one delay time, t), and the input audio signal frame labeled “SCN” is the previous frame of input frequency coefficients (e.g., values SCN of FIG. 2, delayed by delay time (N−1)t).
In various embodiments of the invention, application of a reverberation filter to an audio signal (by a client and/or a server) is performed in the frequency domain (as in the FIG. 2 example) or in the time domain, or hybrid time-frequency domain reverberation filtering is performed.
The inventors have recognized three main issues to be addressed in order to provide reverberation processing for massively multi-participant VoIP systems.
First, voice conferencing servers for massive environments generally mix or pass-through the voice streams from clients based on voice-activity and local spatial density of the clients. As a result, the reverberation processing can be alternatively performed on the server (when the server is mixing) or on the client (when the server is forwarding). When mixing, the server typically also performs some form of spatial simplification of the voice scene for each client by grouping neighboring sources into clusters. Choosing an appropriate reverberation for a group of sources is thus a key issue in this context.
Second, given the prohibitive cost of reverberation processing it is desirable to split the required processing between the client and server. Configuring the server to provide reverb indicative of a specific early scattering for each pair of connected clients on the server will provide better proximity cues and distance impressions. Providing the complementary late part of the reverberation processing on the client will provide improved impression of the virtual acoustical space. Since late reverberation varies more smoothly (than early reverberation) across a typical virtual environment, the same late reverberation processing can be used for groups of nearby sources. A challenge is to adapt the processing to varying clusters (of voice streams, or other sound streams, from multiple sources) and provide the required information to the client so that early and late reverberation can be appropriately recombined. The server must pass on to the client the information about which reverberation filter to use for each cluster as well as provide information that the client can use to reconstruct the dry signal (in the mixing case) so as to apply the late part of the reverberation.
Third, bandwidth cost is a primary constraint. As a result, the information required for reverberation processing on the client must be provided with minimal bandwidth requirements.