Neural networks can be used to process an optically captured image of a music score and produce a digital representation of the corresponding musical notes, from which audio can be synthesized. Such processing provides an “OCR” (optical character recognition) function for sheet music. See, for example, U.S. patent application Ser. No. 11/303,812 entitled “System and Method for Music Score Capture and Synthesized Audio Performance with Synchronized Presentation” to Robert D. Taub, filed Dec. 15, 2005.
A neural network receives input and processes it to make a decision or draw a conclusion about the input data. The input data includes samples or constituent parts relating to the decision or conclusion. A set of neural network nodes, which may be organized in multiple layers, receives and performs computation on the input. The output of the neural network comprises a set of values that specify the network conclusion with respect to the input. For example, the neural network layers may break down input comprising an optical image of a music score into its constituent musical parts, so that the musical parts correspond to the input samples and the output values may specify decisions about the input, such as whether a given input sample was a whole note or half note, and the like. Thus, input samples are received, each sample is processed by the neural network (NN), and the NN produces a decision or conclusion about each sample. A trained NN incorporates data structures associated with labels assigned by the NN that correspond to the output values.
To use any NN in processing an input sample, it is first necessary to train the network so it can provide useful output upon receiving the input sample. Each item of training data must be comprised of an input sample and corresponding desired output values. Selecting an appropriate collection of training data and preparing the collection for input to the neural network can be a lengthy and tedious process. Carrying out the training process is important in order to provide a network that has the capability to produce an accurate OCR rendition of the captured music score image. A data set that is acceptable for training a neural network can be split into two distinct data sets: (1) a data set for training, (2) a data set for testing the quality of training achieved. The remainder of this document refers to a training data set with the understanding that a testing data set could be created using equivalent methods.
To generate training data for training a music OCR neural network, images of multiple music scores must be processed, and for each score image, a corresponding data description must be produced with correct data parameters according to the neural network configuration and the desired output. A single score image to be processed has potentially hundreds of notes and symbols that must be associated with the correct label or output value. Many scores must be processed to create a training data set. With a smaller set of training data, training will be more quickly concluded, though in general output accuracy and quality will be compromised. With a larger set of training data, the neural network will be better trained and should thus provide higher quality output, though preparing the training data will take more time and effort.
What is needed is a means of efficiently collecting and preparing data for use in training the neural network. The data collection technique must produce data sets that are based on music score images and that have correct data values (i.e., desired output) for the parameters that describe the music score images in terms of the neural network data structures, such that the parameters are converted during processing in accordance with the network data structures. This processing required may consist of image manipulation, artifact location mapping, and file format conversion.
To ensure high quality output, the collected data of the training data set must be of a high quality. That is, a high degree of consistency should exist between the labels and the image samples to which they are intended to correspond. Also, image characteristics should conform to desired criteria for which the system is intended to successfully process. Thus image characteristics should reflect “real world” artifacts that the system will deal with successfully, such as embodying typical optical distortion and being legible to a human.
Finally, the training data should be presented to the NN that conforms to specifications of the data set format and content and the NN configuration. In general, to successfully train a NN, a deliberate scheme is required for presenting data set elements, where the scheme may influence data element order, frequency, content manipulation, and the like.