Extracting data (e.g., text, numbers, symbols, etc.) from images (e.g., filled forms, drawings, digital documents, etc.) and building meaningful information from the extracted data is a complex and time consuming task as a number of different text, numbers, and symbols are essentially required to be identified and correlated. Typically, such data extraction and information building is done manually and is prone to human errors. More recently, computer based system have been employed to automatically extract data and build meaningful information from digital images. Many of such systems employ optical character recognition (OCR) techniques for extraction of data from the digital images.
Existing OCR techniques have been built on pre-defined symbols, numbers, and text on which they have been trained. However, as the digital images and the training data (text, number, symbols, etc.) available in the digital images are very limited, training a machine learning algorithm for OCR to identify the data with high level of accuracy is challenging. Further, once an OCR technique has been trained for or has learnt a set of symbols (e.g., in a specific domain), it is difficult to apply it to new set of images which may be similar to the previous set but yet may have many new symbols that the OCR technique may not recognize. Additionally, there are many situations when the data in the digital images may vary due to multiple factors. For example, data available in the digital images are highly inconsistent and depends on various factors such as image resolution, noise effect, font size, and type variation, and so forth. Moreover, in the digital images, the information is split into the various places and needs to be associated correctly.
Existing OCR techniques are therefore not able to perform with the good accuracy on the multiple digital images. Further, pre-defined OCR techniques are not only ineffective but also may be erroneous. It is therefore desirable to provide an effective technique to extract and identify the various different symbols, numbers, and texts in the digital images and to correlate them so as to build the appropriate and complete meaningful information.