Editing and manipulating audio signals presents a unique challenge. Whereas it is relatively simple to outline an object in a photograph, or even in a video stream, doing so in an audio track is not so straightforward, particularly when mixtures of sounds are involved. For example, recorded audio data of music or other real-world sources often contain a superposition of multiple sounds that occur simultaneously.
Makers of audio processing software have spent significant resources on developing techniques for visualizing audio data in forms that help a user understand and manipulate it. The most widespread representation for audio is the trace of the actual air pressure across time, which is often referred to as the waveform.
While the waveform representation provides accurate visualization of sound, unfortunately, it only conveys a small amount of information. An experienced user might be able to deduce some basic information using this representation, but in the case of most sound mixtures there is very little information to be found.
Another approach for visualizing audio data is time-frequency visualizations (often referred to as frequency or spectral representations). Time-frequency decompositions are a family of numerical transforms that allow one to display any time series (like sound) in terms of its time-varying frequency energy content. The most common of these representations is the spectrogram, which one can readily find in many modern audio processing editors. More exotic time-frequency transforms, such as wavelets, warped spectrograms and sinusoidal decompositions have also been used, but they effectively communicate the same information to a user. Common to all these visualizations is the ability to show how much acoustic energy exists at a specific point in time and frequency. Since different sounds tend to have different distributions along that space, it is often possible to visually distinguish mixed sounds using such visualizations.
Although such representations may be occasionally informative for expert users, they do not facilitate an object-based interaction with audio, such as allowing a user to select, modify, or otherwise interact with particular sounds from a sound mixture.