The worldwide music industry generated $33.1 billion in revenue in 2001 according to the RIAA. The American music industry alone generated approximately $14 billion in 2001 (RIAA). Over 250,000 new songs are registered with ASCAP each year in the United States. According to Studiofinder.com, approximately 10,000 recording studios are active in the domestic US market. In reference to publisher back catalogs, EMI Music Publishing, for example, has over one million songs in their back catalog.
The revenue of the music industry depends on the protection of musical intellectual property. Digital music files, however, are relatively easy to copy or plagiarize. This represents a well-publicized threat to the ability of the music industry to generate revenue from the sale of music.
Various methods for representing music are known. The most common methods are “standard notation”, MIDI data, and digital waveform visualizations.
Standard musical notation originated in the 11th century, and was optimized for the symphony orchestra approximately 200 years ago. The discrete events of standard notation are individual notes.
Another method is known as “MIDI”, which stands for Musical Instrument Digital Interface. MIDI is the communication standard of electronic musical instruments to reproduce musical performances. MIDI, developed in 1983, is well known to people who are skilled in the art. The applications that are able to visualize MIDI data consist of the known software utilities such as MIDI sequencing programs, notation programs, and digital audio workstation software.
The discrete events of MIDI are MIDI events. Digital waveforms are a visual representation of digital audio data. CD audio data can be represented at accuracy ratios of up to 1/44100 of a second. The discrete events of digital waveforms are individual samples.
Compositional infringement of music occurs when the compositional intent of a song is plagiarized (melody or accompanying parts) from another composition. The scope of infringement may be as small as one measure of music, or may consist of the complete copying of the entire piece. Mechanical infringement occurs when a portion of another recorded song is incorporated into a new song without permission. The technology required for mechanical infringement, such as samplers or computer audio workstations, is widespread because of legitimate uses. Depending on the length of the recording the infringing party may also be liable for compositional infringement as well.
Intellectual property protection in regard to musical works and performances exists by virtue of the creation thereof in most jurisdictions. Registration of copyright or rights in a sound recording represents means for improving the ability of rights holders to enforce their rights in regard to their musical intellectual property.
It is also common to mail a musical work to oneself via registered mail as a means to prove date of authorship and fixation of a particular musical work.
Also, many songwriter associations offer a registration and mailing service for musical works. However, proving that infringement of musical intellectual property has occurred is a relatively complicated and expensive process, as outlined below. This represents a significant barrier to enforcement of musical intellectual property, which in turn means that violation of musical intellectual property rights is relatively common.
In musical infringement, it is first generally determined whether the plaintiff owns a valid copyright or performance right in the material allegedly infringed. This is generally established by reference to the two layers of music/lyrics of a musical work or a sound recording. If the plaintiff owns a valid copyright or performance right, the next step is generally to establish whether the defendant has infringed the work or performance. This is usually decided on the basis of “substantial similarity”.
FIG. 1 shows a comparative analysis of two scored melodies by an expert witness musicologist.
In the United States, it is generally a jury who decides the issue of mechanical substantial similarity. The jury listens to the sample and the alleged source material and determines if the sample is substantially similar to the source.
Many shortfalls in individual music representations exist, such as the lack of representation in the analysis layer of music (motif, phrase, and sentence). There is generally no standardized method for a song to communicate its elements. Standard notation cannot generally communicate all elements accurately of electronic and recorded music. The following table illustrates a few of the shortfalls that standard notation has vs. electronic/recorded music.
Musical ExpressionStandard NotationElectronic/Recorded MusicRhythmPositional divisions/beat64 divisions/beat1000 divisions/beatDurational quantize64 divisions/beat1000 divisions/beatPitchCoarse pitch range12 semitones/octave12 semitones/octave# Of discrete tunings between0100semitonesPitch variance within a note1 pitch per note eventPitch variance can becommunicated 1000 times/beatArticulationLegato, Staccato, accentArticulation envelopes can bemodulated in real timeDynamics12 (subjective) divisions127 discrete pointsppppp - ffffffStereo panningNone64 points left64 points rightInstrument specific controlNoneElectronic instrumentssupport performanceautomation of any parameter
In a MIDI file, mechanical data and compositional data are indistinguishable from each other. Metric context is not inherently associated with the stream of events, as MIDI timing is communicated as delta ticks between MIDI events.
The digital waveform display lacks of musical significance. Musical data (such as pitch, meter, polyphony) is undetectable to the human eye in a waveform display.
Prior art representations of music therefore pose a number of shortfalls. One such shortfall arises from the linearity of music, since all musical representations are based on a stream of data. There is nothing to identify one point in musical time from another. Prior art music environments are generally optimized for the linear recording and playback of a musician's performance, not for the analysis of discrete musical elements.
Another shortfall arises from absolute pitch. Absolute pitch is somewhat ineffective for the visual and auditory comparison of music in disparate keys. Western music has twelve tonal centers or keys. In order for a melody to be performed by a person or a musical device, the melody must be resolved to one of the twelve keys. The difficulty that this poses in a comparison exercise is that a single relative melody can have any of twelve visualizations, in standard notation, or twelve numeric offsets in MIDI note numbers. In order for melodies to be effectively compared (a necessary exercise in determining copyright infringement), the melodies need to be rendered to the same tonal center. FIG. 2 shows a single melody expressed in a variety of keys.
A number of limitations to current musical representations arise from their use in the context of enforcement of musical intellectual property. Few universally recognized standards exist for testing substantial similarity, or fair use in the music industry. There is also usually no standardized basis of establishing remuneration for sampled content. The test for infringement is generally auditory; the original content owner must have auditory access to an infringing song, and be able to recognize the infringed content in the new recording. Finally, the U.S. Copyright office, for example, does not compare deposited works for similarities, advise on possible copyright infringement, or consult on prosecution of copyright violations.
There is a need therefore for a musical representation system that relies on a relative pitch system rather than an absolute pitch. This in order to assist in the comparison of melodies. There is also a need for a musical representation system that enables the capture and comparison of most mechanical nuances of a recorded or electronic performance, as required for determining mechanical infringement.
There is a further need for a musical representation system that is capable of separating the compositional (theoretical) layer from the mechanical (performed layer) in order to determine compositional and/or mechanical infringement. This representation would need to identify what characteristics of the musical unit change from instance to instance, and what characteristics are shared across instances. Communicating tick accuracy and context within the entire meter would be useful to outline the metric framework of a song.
Preparation of Multi-track Audio for Analysis
Prior art technology allows for the effective conversion of an audio signal into various control signals that can be converted into an intermediate file. There are a number of 3rd party applications that can provide this functionality.
MIDI (referred to earlier) is best understood as a protocol designed for recording and playing back music on digital synthesizers that is supported by many makes of personal computer sound cards. Originally intended to control one keyboard from another, it was quickly adopted for use on a personal computer. Rather than representing musical sound directly, it transmits information about how music is produced. The command set includes note-on's, note-off's, key velocity, pitch bend and other methods of controlling a synthesizer. (From WHATIS.COM)
The following inputs and preparation are required to perform a correct audio to MIDI conversion. The process begins with the digital audio multi-track. FIG. 3 illustrates a collection of instrument multi-track audio files (2). Each instrument track is digitized to a single continuous wave file of consistent length, with an audio marker at bar 0. FIG. 4 shows a representation of a click track multi-track audio file (4) aligned with the instrument multi-track audio files (2). The audio click track audio file usually is required to be of the same length as the instrument tracks. It also requires the audio marker be positioned at bar 0. Then, a compressed audio format (i.e. mp3) of the two-track master is required for verification.
As a next step, a compressed audio format of all of the samples (i.e. mp3) used in the multi-track recording must then be disclosed. The source and time index of the sampled material are also required (see FIG. 5).
Song environment data must be compiled to continue the analysis. The following environment data is generally required:                Track sheet to indicate the naming of the instrument tracks;        Total number of bars in song;        Song Structure with bar lengths. Every bar of the song must be included in a single song structure section (Verse—16 bars, Chorus—16 bars, etc.);        Type and location of time signature changes within the song;        Type and location of tempo changes within a song; and        Type and location of key changes within a song.        
Before audio tracks can be analyzed, the environment track must be defined. The environment track consists of the following: tempo, Microform family (time signature), key, and song structure.
The method of verifying the tempo will be to measure the “click” track supplied with the multi-track. Tempo values will carry over to subsequent bars if a new tempo value is not assigned. If tempo is out of alignment with the click track, the tempo usually can be manually compensated. FIG. 6 illustrates bar indicators (6) being aligned to a click track multi-track audio file (4). Current state-of-the-art digital audio workstations, such as Digidesign's Pro Tools, include tempo marker alignment as a standard feature.
Time signature changes are customarily supplied by the artist, and are manually entered for every bar where a change in time signature has occurred. All time signatures are notated as to the number of 8th notes in a bar. For example, 4/4 will be represented as 8/8. Time signature values will carry over to subsequent bars if a new time signature value is not assigned.
Key changes are supplied by the artist, and are manually entered for every bar where a change in key has occurred. In case there is a lack of tonal data to measure the key by, the default key shall be C. Key values will carry over to subsequent bars if a new key value is not assigned.
Song structure tags define both section name and section length. Song structure markers are supplied by the artist and are manually entered for at every bar where a structure change has occurred. Structure Marker carry over the number of bars that is assigned in the section length. All musical bars of a song must belong to a song structure section.
At the end of the environment track definition, every environment bar will indicate tempo, key, time signature and, ultimately, belong to a song structure. FIG. 7 shows the final result of a song section as defined in the Environment Track.
After the environment track is defined, each track must be classified to determine the proper analysis process the instrument tracks can be classified as follows:                Monophonic (pitched), which includes single voice instrument, such as a trumpet.        Monophonic (pitched), which includes vocals, such as a solo vocal.        Polyphonic (pitched), which includes multi-voice instrument, such as a piano, guitar, chords.        Polyphonic (pitched vocal), which includes multiple vocals singing different harmonic lines.        Non-pitched (percussion) such as “simple” drum loops, where no pitch information is available and individual percussion instruments.        Complex, such as full program loops and sound effects.        
FIG. 8 illustrates the process to generate (7) MIDI data (8) from an audio file (2), resulting in MIDI note data (10), and MIDI controller data (12).
The classifications A) through F) listed above are discussed in the following section, and are visualized in FIGS. 9-14.
The following data can be extracted from Audio-to-Control Signal Conversion: coarse pitch, duration, pitch bend data, volume, brightness, and note position.
Analysis Results for Various Track Classifications
MonophonicPolyphonicPercussionComplex waveAnalysisAnalysisAnalysisAnalysisCoarse PitchxPitch bend dataxNote PositionxxXxVolumexxXxBrightnessxxXxDurationxxXxMonophonic Audio-to-MIDI Analysis includes:                pitch bend data, duration, volume, brightness, coarse pitch, and note position.Polyphonic Audio-to-MIDI Analysis includes        volume, duration, brightness, and note position.Percussion-to-MIDI Analysis includes:        volume, duration, brightness, and note position.Complex Audio-to-MIDI Analysis includes:        volume, duration, brightness, and note position.Generated events and user input data are combined in various track classifications.A. Monophonic—Pitched.        
FIG. 9 illustrates the process to generate (7) MIDI data (8) from an audio file (2). The user enters input metadata (12) that is specific to the Monophonic Pitched track classification.
Generated EventsMonophonic Audio-to-MIDI Analysis DataUser input eventsTimbreSignificant timbral changes can be noted with MIDItext eventB. Monophonic—Pitched Vocal
FIG. 10 illustrates the process to generate MIDI data from an audio file (7) resulting in generated MIDI data (8). The user enters input metadata (12) that is specific to the Monophonic Pitched Vocal track classification.
Generated EventsMonophonic Audio-to-MIDI Analysis DataUser input eventsLyricLyric Syllables can be attached to Note events withMIDI text eventC. Polyphonic Pitched
FIG. 11 illustrates the process to generate (7) MIDI data (8) from an audio file (2). The user enters input metadata (12) that is specific to the Polyphonic Pitched track classification.
Generated EventsPolyphonic Audio-to-MIDI Analysis DataUser input eventsCoarse PitchUser enters coarse pitch for simultaneous notesTimbreSignificant timbral changes can be noted with MIDItext eventD. Polyphonic Pitched—Vocal
FIG. 12 illustrates the process to generate (7) MIDI data (8) from an audio file (2). The user enters input metadata (12) that is specific to the Polyphonic Pitched Vocal track classification.
Generated EventsPolyphonic Audio-to-MIDI Analysis DataUser input eventsCoarse PitchUser enters coarse pitch for simultaneous notesLyricLyric Syllables can be attached to Note events withMIDI text eventE. Non-Pitched, Percussion
FIG. 13 illustrates the process to generate (7) MIDI data (8) from an audio file (2). The user enters input metadata (12) that is specific to the Non-Pitched Percussion track classification.
Generated EventsPercussion, Non Pitched Audio-to-MIDI AnalysisUser input eventsTimbreUser assigns timbres per note onGeneric percussion timbres can be mapped to reservedMIDI note on rangesF. Complex Wave
FIG. 14 illustrates the process to generate (7) MIDI data (8) from an audio file (2). The user enters input metadata (12) that is specific to the Complex Wave track classification.
Generated EventsComplex Audio-to-MIDI AnalysisUser input eventsSample IDReference to Source and time index) can be notedwith text event
There are generally two audio conversion workflows. The first is the local processing workflow. The second is the remote processing workflow.
FIG. 15 illustrates the local processing workflow. The local processing workflow consists of multi-track audio (2) loaded (21) into a conversion workstation (20) by an upload technician (18). The conversion workstation is generally a known computer device including a microprocessor, such as for example a personal computer. Next, MIDI performance data (8) is generated (7) from the multi-track audio files (2). After the content owner (16) has entered (23) the input metadata (14) for all of the multi-track audio files (2), the input metadata (14) is combined (25) with the generated MIDI data (8) to form a resulting MIDI file (26).
FIG. 16 illustrates the remote processing workflow. The remote processing workflow consists of multi-track audio (2) loaded (21) into the conversion workstation (20) by the upload technician (18). The upload technician (18) then generally forwards (27) a particular multi-track audio file (2) to an analysis specialist (24). Next, MIDI performance data (8) is generated (7) from the multi-track audio file (2) on the remote conversion workstation (20). At this point, the analysis specialist (24) enters (23) the input metadata (14) into the user input facility of the remote conversion workstation (20). After the analysis specialist (24) has entered (23) the input metadata (14) for the multi-track audio file (2), the input metadata (14) is combined (25) with the generated MIDI data (8) to form a resulting partial MIDI file (28). The partial MIDI file (28) is then combined (29) with the original MIDI file (26) from the local processing workflow.
In order to MIDI encode the environment track, tempo, key, and time signature are all encoded with their respective Midi Meta Events. Song structure markers will be encoded as a MIDI Marker Event. MIDI Encoding for track name and classification is encoded as MIDI Text events. MIDI encoding for control streams and user data from tracks is illustrated in following table.
Table of MIDI TranslationsCoarse PitchMIDI Note NumberPitch BendPitch Wheel ControlVolumeVolume Control 7BrightnessSound Brightness Control 74Duration and timingNote On + Note OffLyric and TimbreMIDI Text
FIG. 17 illustrates the package that is delivered to the server (in a particular implementation of this type of prior art system where the conversion workstation (20) is linked to a server) for analysis. The analysis package consists of the following:                formatted MIDI file;        mp3 of 2 track master;        mp3 of isolated sample files, with sources and time indexes;        Artist particulars, song title, creation date etc.; and        Upload studio particulars and ID from machine used in upload.        