Fixed-format structured documents, sometimes called fixed-layout flat documents, are documents in which the contents within documents are specified based on its properties and location on the page and not on the basis of a document structure (i.e., a hierarchical tree consisting of paragraphs, lists, tables etc.). PDF documents, which are ubiquitous, are one such example of fixed-format structured documents and are considered the de-facto standard for document exchange, collaboration and archival. However, the lack of formatting data gives rise to some non-trivial issues when working with these types of documents. In particular, PDF structure information is needed in many typical workflows related to PDF processing, such as: making a given PDF accessible (for screen reading, for example); executing a PDF reflow and other PDF content interactivity provided by current and future generation PDF tools; executing a PDF editing feature; and executing a PDF compare feature, to name a few examples.
One solution to address such issues involved a concept of PDF accessibility tags, which try to circumvent this problem to some extent. However, the number of untagged PDFs (or poorly or otherwise inadequately tagged PDFs) present in the real world is so large that it is still a critical problem to correctly “detect” the structure of a PDF document. Known techniques exist for automatic structure detection and creation of tags in untagged PDFs. The bounding box algorithm is a backend engine used for structure recognition for the PDF editing feature. However, while existing solutions for PDF structure recognition are effective on simple content types like paragraphs, lists, and bordered tables, they perform poorly in detecting complex types, such as borderless tables (also referred as ‘open table’ interchangeably hereafter). To this end, the present disclosure relates to techniques for solving the challenging aspect of borderless table detection within fixed-format structured documents, such as a PDF document.