In U.S. Pat. No. 6,653,545, ('545) Redmann et al. teach a mechanism enabling remotely situated musicians to collaborate using electronic instruments, for instance, commonly available MIDI devices.
The '545 system operates by intercepting the musical events generated by the locally performing musician, e.g. his MIDI controller's output stream. These musical events are sent to each of two places: First, and immediately, to all of the remote musicians via a communication channel. Second, to a local delay where the musical event is held for substantially the same amount of time as is required for the communication channel to transport the events to the others. Upon arrival at the remote location(s), and upon expiration of the local delay, the musical event is played at each of the stations; e.g., the MIDI stream is sent to a MIDI sound generator at each location.
The use of MIDI or similar event-driven representation of a musician's performance has the strong advantage of representing a compact data format. A dataset produced by such a system is considerably smaller than essentially all other representations of musical performance, including MP3 files.
However, the '545 system suffers from two significant drawbacks:
First, there are significantly more musicians for whom the instrument-of-choice is an acoustic instrument and for which they own no acoustic-performance-to-MIDI converter. This is not to say such converters do not exist, for instance MIDI controllers that generate musical events from a musician's guitar performance are available, such as the G-50 manufactured by Roland Corporation U.S. of Los Angeles, Calif. and the GI-20 manufactured by Yamaha Corporation of America of Buena Park, Calif. MIDI events generated by these devices are best rendered on their companion instrument synthesizers 180, Roland's XV 2020 and Yamaha's MU 90R, respectively. Additionally, devices that are played like wind or valve instruments, but generate MIDI controller signals, are also available. However, though the “converter boxes” are easily obtained, they do not represent a significant portion of the guitar and other traditional acoustic instrument population. Moreover, even for musicians who do use MIDI devices, it is frequently the case that their remote jam partners do not have the same MIDI sound generators or software synthesizers. As a result, the remote musicians do not hear the same instrumentation that the originating musician hears and intends.
Second, while the '545 patent teaches a Voice over Internet Protocol (VoIP) approach to providing an intercom with which participants can talk to each other, this technology is completely unsuitable for vocal performances. The buffering typical of the receiving end of a streaming media implementation adds a relatively large amount of latency—generally in excess of 50 mS and often amounting to several seconds, to allow late packets to take their place in the stream and to provide time for the re-send of a dropped packet to be requested and performed.
Nonetheless, individuals have attempted long distance jams using VoIP services such as Skype, by Skype Technologies, S.A. of Luxembourg. The results, however, have been reported as musically unsatisfying, primarily because of the latencies encountered.
A lower latency approach is to use “plain old telephone services” (POTS), which provides a low latency, high reliability transport for audio. Two musicians, each with a speakerphone can jam. Such a solution suffers two primary drawbacks: First, the bandwidth of POTS is limited to a little less than 4,000 Hz. This represents a serious impact to perceived audio quality of music. The second drawback is that, although the latency is typically small, each performer hears the other play ‘behind’ the beat, that is, each hears the other performing late. The result is that in an otherwise unregulated jam, both players will ‘slow down’ to accommodate the other's tardiness, and the result is an ever-slower tempo. Even if one or the other players has a metronome to govern the beat, the remote player will sound to the metronome-owning musician as if he is playing late by twice the communications channel latency.
Though today, VoIP services do not typically achieve latencies as low as POTS, that is not expected to remain the case. A number of improvements to Internet packet handling have been defined, and over the coming years will be pervasively fielded. Among these include prioritization for VoIP data packets, so that VoIP data are provided low latency routes, and priority handling by intermediate routers so that packets are not-queued behind, say, file transfers, including music downloads. Such improvements to the Internet protocols will result in VoIP transport latencies approaching that of the POTS systems.
Currently, because of bandwidth limitations common over network connections such as the Internet, it is desirable for a VoIP connection that the audio to be compressed, or coded. Upon receipt at a remote station, the coded audio signal requires decompression, or decoding. The matched pair of algorithms that COmpresses (or COdes) and DECompresses (or DECodes) a signal in this way is known collectively as a CODEC, and there are many well known CODEC algorithms. CODECs may be implemented in hardware, or software, or a combination of the two.
Presently, the most popular CODEC for audio is MP3. MP3 achieves a high degree of compression by disregarding information corresponding to attributes of the audio that human beings don't notice. MP3 is readily able to compress digitized audio to less than a tenth its uncompressed size, and to restore the audio signal to a good facsimile of the original, at least as far as most human listeners are concerned.
However, many CODECs such as MP3 require all of the original compressed audio stream to be received for reconstruction of a continuous audio signal. There is little the MP3 CODEC can do during an interval for which no representative packet is received: the reconstructed audio will cut out. The Internet is an environment prone to packet loss. To overcome this, when audio is streamed over the Internet (compressed or not), consecutive packets are buffered at the receiving end for a relatively long period of time, such as ten seconds. By requiring that this much audio be accumulated in a buffer before it is played, there is an opportunity for the receiving station to request retransmission of a missing packet, and to still have time for its retransmission and receipt before it is needed.
However, while a deep receive buffer works well for one-way communication, is not a good solution for acoustic performers collaborating in real time. The additional delay required by the receive buffer will reduce or destroy any real time effect. In order to jam effectively, musicians will require a very short receive buffer and there is not typically time for retransmission of a missing packet.
In addition to inherent unreliability of packet delivery, networks such as the Internet also have communication latencies that can vary by packet. Packets can even be delivered out of order.
To resolve these issues, selection criteria for a CODEC should emphasize an ability to continue the real time musical or vocal performance with an aesthetically tolerable handling of dropped or late packets.
In their article “A Survey of Packet-Loss Recovery Techniques,” IEEE Network Magazine, September/October 1998, author Perkins, et al. describe a variety of methods by which packet loss of an audio stream may be handled. In the context of wireless telephony, they discuss compensation techniques for packet loss in a voice stream as a hierarchy of increasingly sophisticated schemes:
The simplest scheme when a packet is lost, is just to play silence. If the transmission was significantly silent before when the packet is lost, this may represent a good substitute. This is implemented exclusively by the receiving portion of the CODEC.
During a vocal or instrumental performance, however, a significant portion of the time a note is being held and undergoing a prolonged decay, or is being sustained. A sudden transition to silence and back again can produce a very unaesthetic pop.
Perkins, et al. point out that the physiology of human hearing actually reacts better to an interval of white noise, instead of silence, replacing a missing packet. Preferably, the noise has an amplitude similar to that of the prior packet.
Another crude-but-sometimes-effective scheme sometimes used in telephony is to replay the previous packet. Again, during a relatively quiet portion of the transmission, this will work well. During an unformed, noisy interval, it also works well. This technique is also implemented exclusively by the receiving portion of the CODEC.
For a vocal performance or an instrumental performance having a slow or moderate tempo, repeating the prior packet may sometimes work well, but audio elements representing a fast attack like a drum beat or a guitar string pluck may sound like the performer has played a second note, which may be more disruptive than noise of a similar amplitude.
If repetition is employed, and then needed to compensate for multiple consecutive lost packets, then the amplitude used should fade with each repetition. In the case where performance by a musical instrument such as a guitar or piano is used, the rate at which repeated packets are faded preferably resembles the observed decay rate of the instrument's performance.
In Perkins' review, they talk about the transmission portion of the CODEC helping compensate for missing packets, too.
Interleaving is a technique in which data representative of N consecutive intervals is spread over time: Their transmission is interleaved with additional groups of N consecutive intervals. If a single packet is lost, exactly one of the intervals from each group of N consecutive intervals is lost. This can be of value if disguising or overcoming a frequent loss of a single short interval produces a better result than an occasionally loss of N consecutive intervals. Interleaving has the detrimental effect of introducing a receive buffer delay corresponding to (N*N) intervals, but even when N=2, the intervals would need to be very short for this to be tolerable.
Forward Error Correction (FEC) is another technique the sending portion of the CODEC can use to improve handling of lost or delayed packets. All FEC techniques introduce some redundant data in each packet that can aid in the reconstruction of previously sent but subsequently lost packets. In its simplest version, each packet contains not only its own new data representative of an interval, but fully repeats the data representing the prior packet's interval. While this introduces a 100% increase the data that must be transmitted, it adds a receive buffer delay of only one interval.
A number of CODECs intended for VoIP use are commercially available. Each has various parameters, such as sample rate, bandwidth limitations, data rates, strategies for overcoming packet loss, etc. One key parameter is frame size. Frame size is the number of data samples times the sample rate, and is commonly expressed in milliseconds. Large frame sizes provide more opportunity for a CODEC to achieve data compression, but unfortunately result in longer buffering times both at the transmitting and receiving ends of the connection. For real time musical performance, short frame sizes (e.g., 10 mS) are preferred, and known. Some commercially available, short frame size CODECs even support an audio bandwidth exceeding that of an ordinary telephone connection (e.g., >8 k samples/sec). An example is iPCM-wb™ by Global IP Sound of Stockholm, Sweden which can operate with a 10 mS frame size and 16K samples/sec. Between this improvement in bandwidth over a POTS call, and anticipated improvements in transport latencies for VoIP, such a connection would be preferable to a simple POTS connection. However, it still suffers from the musicians' mutual perception of always being late with respect to the beat.
There remains a need for a way to permit multiple remote acoustic performers to collaborate in real time and over useful distances, such as across neighborhoods, cities, states, continents, and even across the globe.
There is a further need to enable them to record those collaborations.
Because of the delays inherent in communication over significant distances, a technique is needed which does not compound that delay.
Further, there needs to be a way of limiting the adverse effects of excessive delay, and to allow each station to achieve an acceptable level of responsiveness.
The present invention satisfies these and other needs and provides further related advantages.