PDF documents are widely used nowadays as brochures, manuals, white papers, financial documents, or presentations and have mostly replaced previous formats for the above mentioned purposes. Examples of the previous formats include, but are not limited to text files, Hyper Text Markup Language (HTML) pages, MS Word documents, MS Excel, and Comma Separated Values (CSV) files. However, extracting information from PDF documents is difficult when compared to other formats mentioned above. One of the reasons is that PDF documents are of different types and may be generated in multiple different formats.
Conventional techniques may extract text, images, and tables from PDF documents. However, the conventional techniques fail to extract vector graphic images from the PDF, while retaining vital information. Vector graphic images are mostly included in technical manuals, troubleshooting manuals, or enterprise articles that are in the PDF format. A vector graphic image is not extracted as a singled image and is scattered into small segments after extraction. As a result, conventional image extraction techniques are not accurate.