1. Technical Field
The invention is related to music synthesis, and in particular, to automatic synthesis of music from a database of musical notes and an input musical score by concatenating an optimal sequence of candidate notes selected from the database.
2. Related Art
Techniques for synthesizing music sound are most commonly split into one of two categories, including “model-based synthesis” techniques and techniques based on “concatenative synthesis.”
In general, “model-based synthesis” techniques use a “recipe” for creating sound from scratch, wherein new waveforms are generated with different qualities by modifying the parameters of the recipe. For example, one conventional model-based synthesis technique generates expressive performances of melodies from a model derived from examples of human performances. A related technique synthesizes instrumental music, such as a trumpet performance, by using a performance model that generates a sequence of amplitudes and frequencies from a music score in combination with an instrument model that is used to model the sound timbre of the desired instrument.
In contrast, concatenative synthesis is an idea that has typically been used in the field of speech generation, but has recently been applied to the field of music generation. In the context of speech generation, concatenative synthesis generally operates by using actual snippets or samples of recorded speech that are cut from recordings and stored in a database. Elementary “units” (i.e., speech segments or samples) are, for example, “phones” (a vowel or a consonant), or phone-to-phone transitions (“diphones”) that encompass the second half of one phone plus the first half of the next phone (e.g., a vowel-to-consonant transition). Some concatenative synthesizers also use other more complex transitional structures. Concatenative speech synthesis then concatenates units selected from the voice database then outputs the resulting speech signal. Because concatenative speech synthesis systems use actual samples of recorded speech, they have the potential for sounding “natural.”
In the context of musical sound synthesis, some concatenative synthesis schemes operate by using a database of existing sound, divided into “units,” or “samples” with an output waveform being generated by placing these units or samples into a new sequence. For example, one conventional sound synthesis scheme uses concatenative synthesis to generate sound that represents a new realization of a musical score, played using sound samples drawn from a large database. In general, this scheme relies on a very large database of recordings to construct a great number of “sound events” in many different contexts, with a large emphasis being placed on an analysis of each sound event for extraction of features that are used in evaluating and selecting samples having the best fit transitions. Natural sounding transitions are then synthesized for a music score by selecting sound units containing transitions in a desired target context relative to the music score. Another conventional sound synthesis scheme provides a “musical mosaicing” approach that uses concatenative synthesis to automatically sequence snippets or samples of existing music from a large database to match a target waveform.
With any of the aforementioned concatenative synthesis based music generation techniques, score alignment is an important consideration. Consequently, one technique uses a dynamic time warping to find the best global alignment of a score and a waveform, while a related technique uses a hidden Markov model to segment a waveform into regions corresponding to the notes of a score.