Technical Field
The present invention relates generally to music and, in particular, to music modeling using clock Long Short-Term Memory (LSTM).
Description of the Related Art
Modeling time-series data such as music, speech or sensor data is an important area of machine learning. Elements of time-series data of naturally occurring phenomena are predictable by sequences of elements in the same or similar data, because they often have patterns. These cues are referred to as “context”.
Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) are machine learning models which can use context. They have hidden layers, and learn sequences by using recurrent inputs.
LSTM is an extension of RNN, and its memory for storing context makes it possible to treat long-term data. The memory can be written and reset with reference to the contents of the memory. Without resetting, the contents of the memory eternally exist. The writing to and resetting of the memory are dependent on inputs and context. These dependencies are learned by neural networks using training data in the same way as RNN. The neuron layers for writing and resetting are respectively referred to as the input gate and the forget gate.
One problem with RNN and LSTM is that they take more time for learning the transition of context compared to a Hidden Markov Model (HMM). LSTM and RNN use Stochastic Gradient Descent (SGD) for updating weights, i.e., their parameters. In the case of LSTM, the transition of context (aka context transition) means the resetting of the memory realized by the firing of the forget gate. The firing of the forget gate is determined by a sigmoid function, and getting the norm of weights large enough to enable alternating firing between 0 and 1 takes a long time. This is because the derivative of the sigmoid function and learning rate is too small to get such a norm. The values of the derivative of the sigmoid function has a maximum of 0.25 and has the characteristic of becoming smaller as the weights become bigger. The learning rate is in inverse proportion to the time to learn the weights of a certain norm, but empirically, the learning rate is set smaller than 1. This is the reason why LSTM takes a lot of time for learning the transition of context. In the case of HMM, such problems do not happen because SGD is not necessary owing to the analytic solution of parameters.
Thus, there is a need for a LSTM-based method for music modeling that can exploit context.