Naturalists and others often identify animal species in the field, initially, on the basis of their observable vocalizations.
Experienced bird watchers and naturalists have identified specific species of birds by their unique song vocalizations for centuries. Several hundred such songs have been documented in North America alone, and several books, tapes, and CDs have been published on the topic.
Amateur bird watchers, naturalists, students, and other interested parties may wish to identify birds by their songs in the field, but may not have the training, skill and experience necessary to do so effectively.
In addition, bird watchers may wish to know if particular species of birds have been present in an area without monitoring that area themselves for extended periods of time.
Several attempts to recognize bird species from their vocalizations have been reported in the literature. Several of those prior attempts are now described.
One method of classifying birds on the basis of their vocalizations is described by Anderson et al., Automatic Recognition and Analysis of Birdsong Syllables from Continuous Recordings, Mar. 8, 1995, Department of Organismal Biology and Anatomy, University of Chicago. Anderson et al. teach a method including digitizing the signal into 15-bit samples at 20,000 samples per second, performing a series of 256-point Fast Fourier Transforms (FFTs) to convert to the frequency domain, detecting notes and silence intervals between notes, and comparing the FFT series corresponding to each note with templates of known notes classes using a Dynamic Time Warping technique.
The approach of Anderson et al. has several shortcomings. The templates are insufficiently flexible to account for the substantial variability of vocalizations among individuals of some species. The approach produces an unacceptable rate of false positives among species with overlapping note classes. Furthermore, template matching is highly susceptible to noise. Finally, the approach of Anderson et al. emphasizes the wrong parts of the frequency spectrum because the bioacoustic perception of frequency is not linear as this method assumes.
Another published method is described by Kunkel, G., The Birdsong Project, 1996-2004, Western Catskills of New York State, http://ourworld.compuserve.com/homepages/G_Kunkel/project/Project.htm. Kunkel teaches a method for monitoring and classifying birds from their vocalizations including receiving a signal from a microphone for ten seconds, digitizing the signal into 8-bit samples at 14,925 samples per second, performing a series of 256-point FFTs to convert to the frequency domain, detecting notes and silence intervals between notes, and extracting parameters for each note including the frequency of the note at it's highest amplitude, the frequency modulation of the note as a series of up to three discrete upward or downward rates of change representing up to two inflection points, the duration of the note, and the duration of the silence period following the note. The parameters corresponding to notes of known bird songs are compiled into a matrix filter, and the matrix filter is applied to recordings of unknown bird songs to determine if the known bird song may be present in the sample.
The approach described by Kunkel has several different shortcomings. For example, birdsongs from many different species often contain individual notes that would be classified as the same, making those species difficult or impossible to distinguish. Like Anderson et al., the approach taken by Kunkel fails to take into consideration the context of notes, resulting in false positives among species with overlapping note classes. Also, since some birds have broadband vocalizations consistent determination of frequency modulation parameters may be difficult. Indeed, even a noisy background would make the consistent determination of frequency modulation parameters difficult. Finally, like Anderson et al., in Kunkel frequency is measured on a linear scale which fails to properly account for the bioacoustic perception of frequency.
In Mcllraith et al., Birdsong Recognition Using Backpropogation and Multivariate Statistics, IEEE Transactions on Signal Processing Vol. 45 No. 11, November 1997, two different methods for classifying birds from their vocalizations are taught.
In the first method, birdsong is digitized into 8-bit samples at 11,025 samples per second, automatic gain control is used to normalize signal levels to peak amplitude, a series of 512-point FFTs is performed to convert to the frequency domain, 16 time -domain coefficients are generated for a 15th order LPC filter for each frame, and a 16-point FFT performed on the filter coefficients to produce 9 unique spectral magnitudes, and the 9 spectral magnitudes combined with a song length to produce a series of 10 parameters per frame used in a back-propagating neural network, tallying per-frame classifications over the length of the song.
In the second method, birdsong is digitized into 8-bit samples at 11,025 samples per second, a leaky integrator function is used to parse the song into notes counting the number of notes and determining the mean and standard deviation of both the duration of notes and duration of silent periods between the notes, and using a series of 16-point FFTs to determine the mean and standard deviation of normalized power occurring in each of 9 frequency bands resulting in a total of 23 possible parameters. Only 8 of these parameters were found to be statistically significant and were used in classification tests with either a back-propagating neural network or with other statistical methods.
These methods aggregate spectral information on a frame-by-frame basis, without regard for any finer structure to the birdsong. Because many bird species have overlapping spectral properties, these methods can not achieve high recognition rates across a large number of individual species. Furthermore, in both of these methods, both power and frequency is measured on a linear scale, disregarding that the bioacoustic perception of sound is not linear in either power or frequency.
In Kogan et al., Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden Markov models: A comparative study, Aug. 4, 1997, Department of Organismal Biology and Anatomy, University of Chicago, yet another method of classifying birds from their vocalizations is taught. This method includes digitizing the bird song, extracting a time series of Mel Frequency Cepstral Coefficients (MFCC), and computing the probabilities that HMMs representing a known birdsong produced the observation sequence represented by the sequence of coefficients. The HMMs have a fixed number of states for each note with a simple bi-grammar structure to recognize occurrences of note pairs.
Under this method, individual notes need to be manually classified before training the HMMs. This is a difficult, labor-intensive and time-consuming process. Furthermore, HMMs of a fixed number of states do not discriminate well among notes of variable durations. The simple bi-grammar of Korgan et al. correlates too coarsely to the structure of birdsongs to be able to distinguish among a large number of diverse species. Also, Korgan et al. teach using a diagonal co-variance matrix that does not provide adequate discrimination across models with respect to the relative weighting of spectral and temporal properties. Finally, the Mel-scale used by Korgan et al. is ill-suited to the analysis of birdsongs with higher-frequency vocalizations.
Härmä, Aki, “Automatic Identification of Bird Species based on Sinusoidal Modeling of Syllables”, 2003, Laboratory of Acoustics and Audio Signal Processing, Helsinki University of Technology. Aki teaches a method of classifying birds from their vocalizations comprising the steps of digitizing the vocalization at a rate of 44,100 samples per second, performing a series of 256-point FFTs to convert to the frequency domain over the duration of the sample to create a spectrogram, extracting individual sinusoids from the spectrogram, determining the frequency and log power trajectory of each sinusoid through time, and using these parameters to compare against those of known birdsong.
One disadvantage of the Aki approach is that it may be difficult to extract individual sinusoids from a broadband vocalization. Another disadvantage of the Aki approach is that it may not consistently extract sinusoids in vocalizations with harmonic components. Yet another disadvantage of the Aki approach is that it does not consider how sinusoids might be combined into notes or how notes might be combined into phrases. Yet another disadvantage of the Aki approach is that frequency is represented on a linear scale whereas the bioacoustic perception of animal vocalizations is on a logarithmic scale.