1. Field of the Invention
This invention relates to a data interchange format of voice sequence data, a music sound and voice reproducing apparatus, and a server apparatus of a music data file containing voice sequence data.
2. Description of Prior Art
A standard MIDI file format (SMF) and a synthetic music mobile application format (SMAF) have already been known as data interchange formats for use in distributing or mutually exchanging data representing music applied to a sound generator. SMAF is a data format specification for representing multimedia contents in a portable terminal or the like (See non-patent literature 1).
SMAF will now be described hereinafter by referring to FIG. 15.
In this diagram, there is shown an SMAF file 100, provided with data blocks referred to as chunks in a basic structure. A chunk comprises a fixed-length (8-byte) header and an appropriate length body. The header is further separated into a 4-byte chunk ID and a 4-byte Chunk Size. The chunk ID is used for a chunk identifier and the Chunk Size indicates a length of the body. The SMAF file has a chunk structure and each of various data included in the SMAF file has also the chunk structure.
As shown in the drawing, a content of the SMAF file 100 comprises a contents info chunk 101 containing management information and one or more track chunks 102 to 108 including sequence data which will be fed to an output device. The sequence data is a data representation in which controls to the output device are defined in the order of time passage. All sequence data included in the single SMAF file 100 are set to start reproduction of multimedia simultaneously at time 0. Consequently, all sequence data of multimedia are reproduced in synchronization with each other.
Sequence data is represented by a combination of an event and duration. The event is a data representation of a content of a control applied to an output device corresponding to a media type of the sequence data. The duration is data representing a duration time between a preceding event and a succeeding event. Although processing time required for an event is not actually zero, it is assumed zero in the SMAF data representation and every time flow is represented by the duration. Timing for executing an event can be uniquely determined by integrating the duration time from the beginning of the sequence data. Processing time consumed for an event does not affect a start time for processing of the next event in principle, since the processing time is very short as compared to the duration time. Therefore, sequential events with a value 0 between them are interpreted to be executed simultaneously.
In SMAF, as the output devices, there are defined a sound generator device 111 for generating sounds with control data equivalent to a musical instrument digital interface (MIDI), a PCM sound generator device (PCM decoder) 112 for acoustically reproducing PCM data, and a display device 113 such as an LCD for displaying texts or images.
The track chunks include music score track chunks 102 to 105, a PCM audio track chunk 106, a graphics track chunk 107, and a master track chunk 108 in correspondence to the respective output devices. In this connection, the track chunks other than the master track chunk, namely, the score track chunks, the PCM audio track chunk, and the graphics track chunk can be described up to a maximum of 256 tracks.
In the shown example, the music score track chunks 102 to 105 contain music sequence data for commencing the sound generator device 111, the PCM track chunk 106 contains wave data such as ADPCM, MP3, and TwinVQ reproduced by the PCM sound generator device 112 in event sequential format, and the graphics track chunk 107 contains a background image, an inserted still image, text data, and sequence data for reproducing them by using the display device 113. The master track chunk 108 contains sequence data for controlling the SMAF sequencer itself.
On the other hand, as a technique for a sound synthesis, there are known a filter synthesis such as LPC, a composite sinusoid speed synthesis, and other waveform synthesis methods. In the composite sinusoid speed synthesis method (CSM method), a speech signal is modeled with a sum of a plurality of sine waves for speech synthesis. It is a simple synthesis method and yet offers high-quality speech synthesis (See non-patent literature 2).
In addition, there has been suggested a voice synthesizer for generating a singing voice by synthesizing voices with a sound generator (See non-patent literature 1).
The non-patent literature 1 is a SMAF specification, Ver. 3.06, Yamaha Corporation, [Searched for on Oct. 18, 2002.], Internet <URL: http://smaf.yamaha.co.jp>
The Non-patent Literature 2 is Shigeki Sagayama and Fumitada Itakura, “Some Investigation of Composite Sinusoid Speech Synthesis and Prototype Hardware Realization,” ASJ Trans. of the Com. on Speech Res., S80-12, pp. 93-100, May 1980
Other prior art document is Patent Literature 1, namely, Japanese Unexamined Patent Publication (Kokai) No. 9-50827
As set forth hereinabove, SMAF includes MIDI-equivalent data (music data), PCM audio data, text or image display data, and other various sequence data, and the entire multimedia sequence can be reproduced synchronously on the common time base.
In SMF and SMAF, however, a representation of a voice (human voice) is not defined. Accordingly, there can be a method of extending MIDI such that voices may be synthesized by extending a MIDI event in SMF or the like. In this condition, however, there is a problem that data processing is complicated when selectively taking out a voice part at a time and synthesizing the voices.