1. Field of the Invention
This invention relates to the field of digital audio, and, more specifically, to digital audio applications in a network environment.
Sun, Sun Microsystems, the Sun logo, Sparc, Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
2. Background Art
Computers and computer networks are used to exchange information in many fields such as media, commerce, and telecommunications, for example. One form of information that is commonly exchanged is audio data, i.e., data representing a digitized sound or sequence of sounds. Voice telephone transmissions and video conferencing feeds are examples of telecommunication information which include audio data. Other examples of audio data include audio streams or files associated with digitized music, radio and television performances, or portions thereof, though audio data may be associated with any type of sound waveform. It is also possible to synthesize sound waveforms by artificially generating audio data having desired magnitude and frequency characteristics.
For the purposes of this discussion, the exchange of information between computers on a network occurs between a computer acting as a xe2x80x9ctransmitterxe2x80x9d and a computer acting as a xe2x80x9creceiver.xe2x80x9d In audio applications, the information contains audio data, and the services provided by the transmitter are associated with the processing and transmission of the audio data. A problem with current network systems is that multiple services, provided by one or more computers acting as transmitters, may provide audio data using different audio protocols. The complexity of the receiver is necessarily increased by the need to accommodate each of the different audio protocols. Further problems associated with the transmission of audio data over a network include errors in the audio signal caused by packet loss, as well as undesirable latency in real-time, or time-critical, audio-related applications such as video conferencing. The following description of audio technology and an example network scheme are given below to provide a better understanding of the problems involved in transmitting audio data over a network.
Audio data technology allows for the capture, storage, transmission and reproduction of sound. To understand how sound can be represented electronically as audio data, it is useful to understand the general nature of sound. Sound refers to a pressure wave propagated through a medium, such as the air. A pressure wave of this sort may be generated, for example, by the vibration of the vocal chords in a human throat, as when speaking or singing, or by a collision of two objects, where a portion of the energy of the collision is dissipated as a pressure wave. The medium through which the pressure wave is propagated attenuates the pressure wave over time in accordance with the physical characteristics, or xe2x80x9cacoustic properties,xe2x80x9d of the medium.
When pressure waves meet the eardrum of a human ear, the eardrum flexes and vibrates in response. The vibration, or modulation, in the eardrum is interpreted by the brain as a sound. An electronic capture mechanism, such as a microphone, has a similar mechanism for detecting pressure waves and generating an electronic signal containing corresponding audio data. A sensor mechanism in the microphone is physically modulated by a pressure wave, and the modulation is electro-mechanically transformed into an electronic signal. The electronic signal may be transmitted or stored directly, or, as is now typically done, the electronic signal may first be digitized (i.e., sampled and quantized). A sound is reproduced from audio data by transforming the electronic signal back into a pressure wave, for example, by electro-mechanically modulating a membrane to create the appropriate pressure wave.
The electronic signal corresponding to a captured sound may be graphically represented by a sound waveform, such as sound waveform 100 illustrated in FIG. 1A. The vertical axis of FIG. 1A, as well as that of FIGS. 1B and 1C, represents the amplitude of the sound waveform, with the horizontal axis representing time over a period of one millisecond. Sound waveform 100 is a continuous waveform. FIGS. 1B and 1C illustrate discrete sampled waveforms generated by sampling sound waveform 100 at sampling rates of twenty-four kilohertz and eight kilohertz, respectively.
A sampling rate is expressed in hertz or samples per second. A sampling rate of twenty-four kilohertz implies that twenty-four thousand samples are taken per second, or one sample is taken approximately every forty-two microseconds. As one would expect, the sampled waveform of FIG. 1C, with a sampling rate of eight kilohertz, has one-third as many samples as the sampled waveform of FIG. 1B.
Higher sample rates generally entail correspondingly greater resource costs in terms of storage and transmission bandwidth requirements to accommodate the data associated with the larger number of samples. However, a higher sampling rate generally provides a more precise reproduction of a sound waveform. The ability to reproduce an original waveform from a set of sampled data is determined by the frequency characteristics of the original waveform and the Nyquist limit of the sample rate. Every signal or waveform has frequency characteristics. A relatively fast changing signal level is associated with higher frequency behavior, whereas a signal level that changes slowly is associated with lower frequency behavior. Most signals have frequency contributions across a broad spectrum of frequencies. The frequencies associated with audible signals, and hence sound waveforms, reside generally within the range of 20-20,000 kilohertz.
According to Nyquist theory, a sampled signal can reconstruct an original waveform from sampled data if the original waveform does not contain frequencies in excess of one-half of the sampling rate. That is, if an original waveform is bandlimited below ten kilohertz, a sampling rate of twenty kilohertz or higher would be sufficient to reproduce the original waveform without distortion. When relatively low sampling rates are used, it is common to pre-filter waveforms to bandlimit frequency behavior and prevent or diminish distortion caused by the sampling process. However, filtering of a sound waveform may result in lower sound quality because higher frequency components of the waveform are attenuated.
Different audio protocols may use different sample rates for audio data. A receiver that is generating sound output from audio data needs to be able to handle the different possible sample rates of the different audio protocols to maintain correct timing intervals between samples during reconstruction of the sound waveform from the audio data samples.
Another aspect of audio data that differs between audio protocols is the quantization-scheme used to quantize or digitize the amplitude of the sampled audio data into digital values that can be represented by a fixed number of bits. The number of bits used to represent each sample of audio data is the resolution of the given audio protocol. Typically, for M bits of resolution, 2M possible digital values or quantization levels exist for sample quantization. For example, eight bits of resolution provide 28, or 256, quantization levels. Higher resolution typically provides for better sound reproduction as sound samples are more precisely represented. Higher resolution also entails higher costs in storage resources and transmission bandwidth to support the larger number of bits.
Just as there are different possible resolutions for audio data, there are also different quantization schemes for distributing the quantization levels across an amplitude range. FIGS. 2A and 2B illustrate examples of linear and non-linear quantization functions, respectively. The horizontal axis of each of FIGS. 2A and 2B represent the sample value of the audio data prior to quantization. The vertical axis of each figure represents the quantization levels of the audio data after quantization is performed. A stair-step function is implemented where all sample values within fixed ranges along the horizontal axis are assigned to discrete quantization levels on the vertical axis.
In the linear quantization function of FIG. 2A, the quantization levels are evenly distributed across the range of values. The result is a stair function that approximates a straight line having a slope of one. In the non-linear quantization function of FIG. 2B, quantization levels are distributed with greater numbers of quantization levels near zero amplitude and fewer quantization levels as the amplitude increases. The result is a stair function that approximates a parabolic or logarithmic curve. An advantage of non-linear quantization schemes is that there is greater relative resolution near zero amplitude, providing improvements in signal-to-noise ratio. A disadvantage of non-linear quantization schemes is that they are more complex to implement than the linear scheme. Different audio data protocols may specify a linear quantization scheme or one of several different commonly-used non-linear quantization schemes.
Audio data has been described above in terms of a single sound waveform. It is possible for multiple sounds, such as multiple voices or instruments, to be represented in a single composite sound waveform by superposition of the individual sound waveforms associated with each sound. The composite waveform thus contains the sound information of all of the sounds. It is also possible to send audio data with multiple xe2x80x9cchannels.xe2x80x9d Each channel of audio data contains the sound information (e.g., digitized samples) of a sound waveform. Each channel may be output from a different audio output device (e.g., speaker), or multiple channels may be xe2x80x9cmixedxe2x80x9d into a composite sound waveform for output from a single audio output device.
The use of multiple channels is often used to provide a spatial effect for sound reproduction, such as with two-channel stereo audio or four-channel surround sound. The spatial effect is created by outputting specific audio channels from pre-positioned speakers. Stereo audio, for example, specifies a left channel and a right channel, meaning that a first channel of audio data is reproduced from a speaker positioned to the left of a listener, and a second channel of audio data is reproduced from a speaker positioned to the right of the listener. More complex systems may use greater numbers of channels and output devices. The particular channel arrangement may vary for different audio protocols.
As has been described, audio protocols may vary in sample rate, bit resolution, quantization scheme, and channel arrangement. These variations allow for a large number of different possible audio protocols. It becomes problematic for a receiver on a network to handle all possible audio protocols that might be used by different transmitters acting as audio data sources on the network. The problems associated with multiple audio protocols are described below with reference to the sample network system illustrated in FIG. 3. FIG. 3 illustrates a sample network system comprising multiple transmitters 300A-300C for sourcing audio data and a single receiver 303 acting as a destination computer. Receiver 303 is equipped with one or more speakers for providing sound output associated with received audio data.
In the example of FIG. 3, transmitters 300A, 300B and 300C, and receiver 303 are coupled together via network 302, which may be, for example, a local area network (LAN). Transmitter 300A transmits audio data along network connection 301A to network 302 using audio protocol A. Transmitter 300B transmits audio data along network connection 301B to network 302 using audio protocol B. Transmitter 300C transmits audio data along network connection 301C to network 302 using audio protocol C. Thus, receiver 303 may receive audio data over network connection 305 from network 302 under any of audio protocols A, B or C, as well as any other protocols used by other transmitters connected to network 302, or used by multiple services embodied within one of transmitters 300A-300C.
Receiver 303 may be equipped with different hardware for audio processing to support each audio protocol, but this increases the complexity of the receiver, and necessitates hardware upgrades when new audio protocols are developed. For systems wherein it is a goal to minimize processing and hardware requirements for a receiver, the added complexity of supporting multiple protocols is undesirable.
In addition to the problems associated with multiple audio protocols, audio systems also suffer from problems associated with latency and packet loss. Latency refers to the time delay between the receipt of audio data at a receiver and the output of a corresponding pressure wave from an audio output device of the receiver. Audio latency is particularly problematic in applications where the audio output is intended to be synchronized with other events, such as video output. For example, latency in the audio portion of a video teleconferencing communication or a television transmission may result in a timing mismatch between the visual cues on a display, such as a character""s mouth moving, and the associated audio output, such as the speech associated with the mouth movements. Such timing mismatches may result in an unsatisfactory audio/visual presentation.
Packet loss is a common occurrence on many network connections, and can result in the loss of many samples of audio data. Audio data is transmitted over a network as a group of samples encapsulated within a data packet. When a packet is received at a receiver, the samples are extracted from the packet and used to reconstruct the sound waveform. When packet loss occurs, many samples of audio data are left out of the reconstruction of the sound waveform.
For streaming audio, the audio data is extracted from its respective packet and processed immediately for output. Typically, it is not possible for a receiver to request that a transmitter retransmit a lost packet, and for the transmitter to respond with the lost packet in sufficient time for the receiver to correct the audio output. The corresponding portion of the sound waveform would have already been processed out of the output device as a pressure wave. The loss of audio data through packet loss can result in unwanted degradation of output sound quality, usually periods of silence, particularly with poor network connections where packet loss occurs relatively frequently.
A method and apparatus of supporting an audio protocol in a network environment is described. In an embodiment of the invention, audio processing and hardware requirements associated with a receiver are minimized by specifying a single audio protocol for transmission of audio data between transmitters on a network and the receiver. The protocol specifies a sampling rate, bit resolution and quantization scheme which allow for high sound quality and further minimize the complexity of the receiver. Transmitters are equipped with drivers to provide for conversion of audio data into the designated protocol as needed.
Aspects of the designated protocol are provided to compensate for problems associated with transmitting audio streams over a network. The designated protocol specifies a format for interleaving audio samples within data packets to minimize errors which are the result of consecutive missing audio data samples due to packet loss. The receiver may further compensate for missing audio data samples through interpolation. In accordance with the designated protocol, a sequence size is specified to govern how the audio data is processed. The transmitter controls the sequence size adaptively to maintain audio latency within a limit specified for each audio application. The designated protocol also provides for determination of a mix mode and a number of channels for specifying how audio data with multiple channels is mixed and routed among multiple audio output devices.