In countries like India, several government, bank, real estate etc. related transactions take place on paper. There is a strong recent initiative to reduce paper based transaction, however digitization of archival data remains a big challenge for achieving this goal. Robust character segmentation is a challenge for many Indic scripts, and hence the accuracies of Optical Character Recognition (OCR) remain poor.
Robust character segmentation is a challenge for many Indic scripts, and hence the accuracies of Optical Character Recognition (OCR) remain poor. An OCR engines fail on Indian scripts mainly because character segmentation is non-trivial. Segmenting words from scripts is relatively easier and thus creation of a word level dataset provides a viable alternative. This data can help applications such as indexing, transcription, OCR etc.
Feature based word clustering is an alternative that is employed for word recognition. Further randomly initialized deep networks work well for object recognition. However the randomly initialized deep network are not fine-tuned for shape feature extraction.
Although supervised feature based word clustering, which is the method that is currently employed for word clustering is ava however, this method requires large amount of training data, computing resources and takes long time for training.