In an increasing digital world, a huge amount of electronic documents are generated on a day-to-day basis by word processing applications (e.g., MS Word), by imaging (e.g., scanning) hard copy of the documents, or by other such means. Many a times these electronic documents are originally generated or otherwise converted into a more universally accessible image format such as a portable document format (.pdf), a JPEG format (.jpg or .jpeg), etc.
Typically, these electronic documents include important terms or sections in different text style (e.g., font, height, width, intensity, etc.) so as to facilitate ease of review and use. For example, business documents such as statement of work (SOW), master service agreement (MSA), etc. many include important terms or sections such as company name, contract date, contract termination date, important clauses in bold text. Further, in some documents such as white papers or research papers, titles, section headers, table headers, figure names may be in bold text.
Often, there may be a need to identify and/or extract these important terms or sections from such documents. For example, for morphological analysis (semantic analysis) of documents, bold text plays very important role in section segmentation and important information extraction. Further, if a table of contents is not provided for a document, then the user has to manually traverse through the entire document to identify required information, which is tedious process. Again, bold text plays an important role in facilitating such reviews by helping generate the table of content.
Current techniques to extract entities from an image format of a document based on text style are inefficient and cumbersome as the text style varies across the document(s) at multiples levels. For example, the text style in a document may vary in terms of intensity, resolution, skew, rotation, and so forth. Additionally, the current multilevel entity extraction techniques are highly time consuming. Further, the machine learning based techniques require a lot of training data, and training time. The machine learning based techniques are also not suited for resource constrained computing devices such as mobile devices.