Document conversion is a popular way for users to leverage information across media platforms. For example, in order to utilize web page information in a report or presentation, a user may convert the webpage to another type of document (e.g., a PDF document). Thus, the user is left with an electronic document including the web page information that can be used for more versatile purposes. For instance, the user may utilize the converted electronic document to extract text, mine data, and otherwise leverage the information in the converted electronic document.
Several problems exist, however, with conventional document conversion techniques. For example, conventional document conversion techniques are lossy and typically do not carry forward much of the structural and other semantic information incorporated into the web page. To illustrate, a typical web browser displays a web page based on underlying hyper-text markup language (“HTML”). The underlying HTML includes semantic information that dictates how text and other elements are displayed (e.g., display positions, font size, pixel color). For example, as illustrated in FIG. 1A, the web page 102 includes elements such as a heading 104a, text 106a, and a bulleted list 108a. As shown in FIG. 1A, the heading 104a is bold with a large font size, while the text 106a has a smaller font size and different margins, and the bulleted list 108a has a different font and margins. These semantic attributes are generally conveyed in HTML as part of HTML tags associated with each displayed element.
Additionally, the underlying HTML associated with a web page also organizes the semantic information associated with display elements into a structural hierarchy that dictates how the display elements relate to each other. For example, as shown in FIG. 1A, the HTML tags defining the heading 104a and the text 106a may be nested within a first style definition tag, while the bulleted list 108a may be nested within a second style definition tag. This structural hierarchy carries important semantic information about the associated web page 102.
Due to the nature of HTML, however, conventional document conversion techniques do not carry through the semantic information conveyed in HTML tags or the associated structural hierarchy to a document converted from HTML. Thus, while the resulting converted document may include the text and other display elements from the web page, the relationships between the various text and other display elements are lost. For example, as shown in FIG. 1B, an example PDF document, PDF 110a, that results from the typical conversion of the web page 102 (e.g., as shown in FIG. 1A) includes the elements from the web page 102, but the structural hierarchy indicated within the tag hierarchy 112a is empty. This is because the structural relationships between the tags rendered into the PDF 110a have been lost to conventional document conversion techniques.
Accordingly, due to this loss of semantic information, the converted document fails to convey how groups of text and display elements relate to each other (e.g., due to loss of headings, rearranging of paragraphs and other text groupings), as well as the order in which the document should be read (e.g., due to loss of structural information that defined columns, paragraphs, margins, indents). As shown in FIG. 1B, due to the loss of semantic information of the original web page, the resultant PDF is inaccurate and lacks identification of tables, paragraphs, lists etc. Furthermore, the resultant PDF does not indicate if given text is part of a paragraph or an image caption in web page. The lack of tagged content in the resultant PDF often leads to a bad user experience because the resultant PDF is not easily readable on screens of varying form factors like smartphones, smartwatches etc. Furthermore, it is difficult to discern the author's intent and intended user experience.
Some conventional HTML to PDF generators have web capture capabilities. Such conventional HTML to PDF generators, however, are typically coded to a specific web browser or rendering engine and require recoding upon updates to the web browser or rendering engine. Furthermore, conventional HTML to PDF generators typically only have limited tagging capabilities or require manual tagging of the resulting PDF.