Humans are sensitive to repetition in sounds, dialogue, and music. A single recorded footstep, repeated without variation, quickly becomes annoying to a listener. A single recorded piece of dialogue, repeated multiple times by a game, a toy, or a simulation, becomes distracting and destroys the illusion of live interactivity. The same piece of music, repeated many times over the course of the lifespan of a product or service, becomes tedious. Alarm systems and health monitoring devices can be ignored, because these auditory warnings are so repetitive. Virtual reality environments, digital simulations, and entertainments lose their realism when ostensibly natural sounds are repeated exactly.
A typical modern piece of interactive media can contain thousands of sound effects, dialogue, and music. These sounds are commonly recorded, edited, and mixed by audio engineers using highly specialized knowledge and tools. This process is both labor and effort intensive. Many modern pieces of digital media require multiple engineers and many hours to craft all the sounds needed to achieve a particular artistic or technical effect.
Music recordings, in particular, go through a mastering process, in which engineers select specific sound cues. These sound cues are then fixed in time, with mixing and other effects, to generate a fixed, linear score. By contrast, in a live artistic performance, musicians and singers make choices in the moment of performance. These choices imbue the sound with an improvisational, “of the moment” character that is often not adequately captured as a linear recording. Moreover, multiple takes or recordings of a particular song, performance, or experience may exist; however, the engineer selects which take a listener will ultimately hear. A listener is therefore deprived of a unique experience, as would have been the case had the user listened to the performance live.
Similarly, narrators and actors delivering lines as part of a performance will impart each bit of dialogue with a unique character, tone, and/or emphasis. As a result, no two recordings of spoken, live vocals will be precisely the same. And again, while multiple takes of a performer's voice can be, and often is, recorded, a sound engineer will select a single take to proceed with and ultimately to share with a listener. Moreover, once all takes have been listened to, there is no novelty to listen to previously recorded takes.
Furthermore, the process of merely making a specific sound span a particular length of time is cumbersome. For example, the process of changing a recording of laugher from two seconds in length to one second in length requires a sound editor to dissect and reassemble the individual elements of the sound using specialized programs and tools. While automated processes exist for changing the lengths of recordings arbitrarily, these processes often distort the sound unnaturally. For instance, the sound may be end up with a “chipmunk” effect if the sound is being compressed, or the sound may have a smearing, “stretching” effect introduced if the sound is being lengthened; this is particularly common with vocoders and related technology.
Engineers have made several prior attempts to address these problems; however, each of the previous approaches has significant limitations. One well-known approach is to individually record, edit, and modify all sounds manually. However, recording and editing novel sounds in this individual manner is a time and effort intensive process. Additionally, this approach generally requires the efforts of multiple audio engineers and recording artists, all of whom will need to use professionally-oriented software and tools.
Another well-known approach used by application developers involves creating, storing, and playing back multiple linear recordings, or takes, of a sound effect. However, this approach involves the use of additional production resources to record and implement the takes. At program execution time (that is, at playback time), a particular take is chosen and played for the user. Although this method somewhat reduces listener fatigue resulting from repetition, in time, the sounds still become repetitive as the listener's ear is fatigued from hearing only the predetermined sounds. As a result, it is not uncommon from consumers to simply disable all sounds from, for example, a video game, rather than enduring the repetitive sounds. Moreover, in practice, the multiple takes are stored in Random Access Memory (RAM). The RAM of consumer-grade computers is generally limited; as a result, maintaining these pre-recorded variations in a ready-to-play state consumes significant RAM in an interactive application and may cause the need for additional RAM to be purchased to store the sounds. For the developer of an interactive application, this additional resource consumption is undesirable, as is the time needed to manually record and implement each take.
One approach at random variation for pre-recorded sounds involves randomly varying the pitch or volume of linearly recorded sound effects. However, this sort of variation significantly reduces the quality of the final sound. Pitch-randomized sounds have a perceptible “chipmunk” quality, meaning that they tend to be higher in pitch. Similarly, volume-randomized sounds have a perceptibly different character than the original, static recording. As a result, the overall effect of randomization purely via pitch and volume variation is not convincingly natural. Additionally, the given takes are still quickly exhausted, causing ear fatigue.
Another approach for randomizing playback of sounds includes music stitching, and related algorithms. An expert designer or artist uses specialized software to divide input music or sounds into segments. The designer then describes to a software program how to stitch the segments back together using a predetermined or stochastic process. However, this process requires that the designer both understand and be able to express the high-level structure of the sound or music in question by using specialized tools and programming methods. Further, since each individual segment must be chosen, ordered, and have a probability assigned by a human designer, the music stitching approach requires hours or days to implement even a single variable sound. In the real world, many natural and musical sounds involve complex grammars made up of thousands or millions of elements. Thus, the corresponding syntax of a sound gets exponentially more complex as a sound gets longer or more complicated. As a result, it is at best extremely expensive, and at worst impossible, to accurately model sounds with stitching while also preserving a high quality and variety of randomized output.
Yet another approach for randomizing playback of sounds involves corpus-based concatenative synthesis (CBCS). CBCS systems often require many hundreds or thousands of sound variations to describe a sound. This requirement is impractical for sound designers, who generally maintain approximately a dozen (or fewer) takes of a particular sound as source material. Moreover, CBCS systems are not designed to accurately model sounds with non-trivial grammars, such as human speech and music. In fact, many common one-shot type sounds, such as impact sounds or footsteps, may not be accurately describable using CBCS. In addition, CBCS requires a significant amount of data storage to hold a database of possible sounds. This makes CBCS impractical for many modern computer systems, such as toys or cellular telephones.
Regardless of the approach used, existing random playback approaches require a trained sound designer or computer programmer to manually enumerate the list of possibilities using highly specialized software tools, such as scripting or a database. Such a process is tedious, error prone, and requires special training or experience. Additionally, this authoring process must be manually repeated for each new sound. This increases both the cost and complexity of creating and experiencing dynamic sounds.
Consequently, prior techniques for authoring and rendering non-repetitive sounds have remained in the hands of technical specialists. Representing and performing these variations requires significant computer resources and do not result in high enough quality to avoid ear fatigue. Additionally, authoring new, non-repetitive sounds is a time-consuming process that requires a trained expert to implement each sound differently. Accordingly, a need exists for high-quality, randomized, and dynamically generated sounds. Such synthesized output with random variations increases the range of expressiveness of simulations, games, toys, motion rides, theatrical presentations, movies, appliances, and many other media and digital devices.