Visual recognition systems take images or video frames as inputs, and output labels indicating semantic categories of the input images. Such systems have wide applications, including but not limited to face recognition, object recognition, scene classification, and hand-written recognition. Many state-of-the-art visual recognition systems use hierarchical feed-forward neural networks, also referred to as deep neural networks, whose parameters are trained based on training examples (e.g., images, video frames) labeled by human labelers. Usually a neural network for visual recognition is a very complex system, containing an input layer, multiple hidden layers, and an output layer, often with tens of thousands, or even millions, of parameters to be determined through a training procedure.
In practice, the amount of training data is usually limited and expensive to obtain due to labeling costs. Consequently, it is very difficult to train such a complex learning system, thus, recognition accuracy is often far from satisfactory.
A traditional way to determine the parameters is by supervised training based on provided labeled examples. To guarantee a good performance, this method requires a large number of training examples that are usually expensive to obtain. Moreover, this method does not really improve the recognition accuracy by using human domain knowledge and unlabeled data.
A pre-training strategy has been suggested which puts the neural network as an encoder part of an encoder-decoder network and which enforces the hidden features of the neural network that can be used to recover the input data by going through a decoder procedure. After this pre-training phase, the encoder part is further trained, usually by a supervised training procedure. Since the pre-training phase does not require labeled data, the approach makes use of unlabeled data to pre-train the network, and thus offers a good initialization of the parameters. However, the method needs to train an additional decoder network, which is used by the recognition system in the end. Furthermore, the approach does not use the human domain knowledge.
Accordingly, a method and system is needed which uses domain knowledge and unlabeled data to train a neural network of a visual recognition system.