The present invention relates to automatically transcribing music (vocal music, vocal humming, and sounds of musical instruments) into a musical score.
In such an automatic music transcription system, it is necessary to detect the basic items of information in musical scores: sound lengths, musical intervals, keys, times, and tempos.
Generally, since acoustic signals are the kind of signals which contain repetitions of fundamental waveforms in continuum, it is not possible immediately to obtain the above-mentioned items of information.
Therefore, the present applicants have already proposed an automatic music transcription system as disclosed, for example, in Unexamined Patent Application No. 62-178409.
This automatic music transcription system is shown in FIG. 1. The system is provided with autocorrelation analyzing means 14 for converting hummed vocal sound signals 11 into digital signals by means of analog/digital (A/D) converter 12. The digitized sound is called vocal sound data 13. Pitch information and sound power information 15 is then extracted from the vocal sound data 13. Segmenting means 16 divides the input song or hummed sounds into a plural number of segments on the basis of the sound power information. Musical interval identifying means 17 identifies the musical interval on the basis of the afore-mentioned pitch data with respect to each of the segments as established by the afore-mentioned segmenting means. Key determining means 18 determines the key of the input song or hummed vocal sounds on the basis of the musical interval as identified by the afore-mentioned musical interval identifying means. Tempo and time determining means determines the tempo and time of the input song or hummed vocal sounds on the basis of the segments established by division by the afore-mentioned segmenting means. Musical score data compiling means 110 prepares musical score data on the basis of the output of the afore-mentioned segmenting means, musical interval identifying means, key determining means, and tempo and time determining means. Musical score data outputting means 111 generates musical score data 112 prepared by the afore-mentioned musical score compiling means 110.
It is to be noted in this regard that such acoustic signals as those of vocal sounds in songs, hummed voices, and musical instrument sounds consist of repetitions of fundamental waveforms. In an automatic music transcription system for transforming such acoustic signals into musical score data, it is necessary first to extract for each analytical cycle the repetitive frequency of the fundamental waveform in the acoustic signal. This frequency is hereinafter referred to as "the pitch frequency". The corresponding cycle is called "the pitch cycle." This "pitch" information is taken into account, in order accurately to determine various kinds of information on such items as musical interval and sound length in acoustic signals.
Two extracting methods, frequency analysis and autocorrelation analysis, have been developed in the fields of vocal sound synthesis and vocal sound recognition. Autocorrelation analysis has hitherto been employed because it extracts pitch without being affected by noises in the environment and because it permits easy processing.
In the automatic music transcription system mentioned above, the system calculates the autocorrelation function after it converts acoustic signals into digital signals. Therefore, an autocorrelation function can be calculated for each analytical cycle.
Pitch extraction accuracy is similarly dependent upon the sampling cycle. If the resolution of a pitch so extracted is low, then the musical interval and sound length determined by the processes described later will have a low degree of accuracy.
It is conceivable to use a higher frequency for sampling, but such an approach is liable to result in the inability of the system to perform real-time processing, as well as a larger-sized, more expensive, automatic music transcription system apparatus. The disadvantages are a consequence of the increase in the amount of data processed in arithmetic operations such as the autocorrelation function.
Acoustic signals have the characteristic feature that their power is augmented immediately after a change in sound. This feature of sound is utilized in the segmentation of on the basis of power information.
Unfortunately, acoustic signals, particularly those appearing in songs sung by a man, do not necessarily take any specific pattern in the change of their power information. Songs have fluctuations in relation to the pattern of change. In addition, the sound to be transcribed also often contains abrupt sounds, such as outside noises. In these circumstances, a simple segmentation of sound with attention paid to the change in the power information has not necessarily led to any good division of individual sounds.
In this regard, it is noted that acoustic signals generated by a man are not stable in sound length, either. That is, such signals have much fluctuations in pitch. This has caused an obstacle to the performance of good segmentation based on pitch information.
Thus, in view of the fluctuations existing in pitch information, conventional systems often treat two or more sounds as a single segment in some cases.
With existing transcription equipment, even sounds generated by musical instruments do not readily lend themselves to segmentation based on pitch information. This shortcoming is due to ambient noises intruding into the pitch information after capture by the acoustic signal input apparatus for converting acoustic signals into electrical signals.
When musical intervals, times, tempos, etc. are determined on the basis of sound segments (sound length), the process of segmentation becomes a very important factor in the preparation of musical score data. A low accuracy of segmentation reduces the accuracy of the ultimately developed musical score data. A high initial accuracy of segmentation is therefore desired when final segmentation utilizes the results of the power information. A high initial accuracy is also desired when final segmentation utilizes the results of both pitch information segmentation and the results of power information segmentation.
Acoustic signals, particularly those acoustic signals uttered by a man, are not stable in their musical interval. These signals have considerable fluctuations in pitch even when the same pitch (one tone) is intended. Accordingly, it is very difficult to identify musical intervals in such signals.
When a transition occurs from one sound to another, it often happens that a smooth transition is not made to the pitch of the following sound. Pitch fluctuations occur before and after the transition. Consequently, the segments on either side are often mistaken for another sound segment. The result is that sound segments with pitch transitions are often identified as belonging to a different pitch level in the identification of a musical interval.
In order to explain this in specific terms, methods permitting simplicity in arithmetic operation are considered for the automatic music transcription system mentioned above. For example a given sound can be identified with a pitch closest on the absolute axis to the average value of the pitch information within the segment. The sound can also be identified with the pitch closest on the absolute axis to the medium value of the pitch information of the segment.
With a method like this, it is possible to identify the musical interval well when the interval difference between two adjacent sounds is a whole tone, for example do and re on the C-major scale. But, if the difference between two adjacent sounds is a semitone, for example of mi and fa on the C-major scale, there may sometimes be an inaccuracy in the identification of the musical interval. For example, the sounds intended to be mi on the C-major scale can be identified as fa.
In addition to sound length, the musical interval is a fundamental element. It is therefore necessary to identify the interval accurately. If it cannot be identified accurately, the accuracy of the resulting musical score data will be low.
The key, on the other hand, is not merely an element of musical score data. The key gives an important clue to the determination of a musical interval. A key has a certain relationship to a musical interval and to the frequency of occurrence of a musical interval. In improving the accuracy of the musical interval, it is desirable to determine the key and to review the identified musical interval.
Furthermore, as mentioned above, the musical intervals of acoustic signals, particularly those of vocal music, deviate from the absolute musical interval. The greater the deviation, the more inaccurate the musical interval identified on the musical interval axis. The deviation of the musical intervals in vocal music heretofore has resulted in lower accuracy in music transcription.
In summary, the automatic music transcription system and apparatus disclosed in the present applicants' published patent application No. 62-178409 may generate musical score data with low accuracy. It has so therefore not found widespread practical use.