With the increase in internet usage and applications, users are now accessing information and searching for information online. Information on the web is typically represented through electronic documents created using markup languages. Electronic documents created using markup languages are easily accessible to users through a typical web browser. A typical markup language document is made of different types of content, for example, textual content, images, videos, etc., and carries syntax information that instructs a browser how to render different types of content in the markup language document to a user. The syntax information comprises a set of markup language tags that are executed on the browser. Furthermore, rendering a document on a browser can be controlled, for example, by using cascading style sheets (CSS) that describe the formatting of a document written in a markup language. A CSS document is typically attached, embedded, or linked to a markup language document. The CSS defines how each element, for example, font size of text, color of a background or text, position and alignment of content elements, etc., in the markup language document appears on the browser.
Conventional markup language documents are typically displayed as continuous running documents without any page breaks. These continuous running documents are not print-friendly. A typical markup language document can accommodate a large amount of content, whereas a standard print ready page has, for example, 8.5″×11″ dimensions with margins that reduce the space available for accommodation of a large amount of content during a print operation. The content has to be broken at two levels, that is, a horizontal level or page width and a vertical level or page height. The page width relates to a line break, and the page height relates to a page break. Content rendering on a browser can have loose lines, and spaces are often distributed in ways that make a page appear to have rivers of blanks flowing through the page. How the browser renders this content has to be understood in order to meaningfully interpret the content subsequently. Line breaks rendered by the browser can be discerned as belonging to four different types, namely, word space breaks (wsbr), soft hyphen breaks (wshbr), hard breaks (wbr), and para breaks (wsp). Word space breaks are discerned by finding which spaces are quashed to a zero width. The word space breaks are then interpreted as the end of a line or a line break. Similarly, for manually introduced soft hyphens, if a line breaks in a soft hyphen, then the soft hyphen attains a non-zero width which is also interpreted as the end of the line or as a line break. A hard line break can be discerned when an offset decrease is encountered. Therefore, any markup language content that falls outside a printing area needs to be resized and repositioned accordingly for an optimal print output without losing any data when a print operation is performed.
One method for printing continuous running pages involves introducing page breaks based on a vertical height equal to a page of printing media upon which the content is to be printed. The problem with relying on introducing page breaks based on the vertical height is that text lines and other content are disrupted in between a page and the same is printed. There are additional problems, for example, numbering the pages as page numbers are forced and not based on the content, page layout issues on print media and on handheld devices, etc. Floats such as images and tables can split and spill across pages and trying to avoid these can result in large vertical gaps, making the presentation undesirable.
Content in a document can be easily read by a computer when the content is marked up. In markup language documents, for example, hypertext markup language (HTML) documents, word spaces and line breaks are not explicitly tagged. The word spaces and the line breaks remain anonymous, for example, as generic word spaces and line breaks, and hence are difficult to read and understand for printing accurately. With the advent of handheld devices, for example, smartphones, tablets, etc., there is a need for an optimized rendering of markup language documents and hence the concept of a fluid page was originated. The non-print-friendly documents, page numbering issues, and other page layout problems still exist in fluid pages. There is a need for bridging fluid web-content and fixed-page typesetting originating as a fluid HTML, without a reference printer at the destination.
Markup language documents are typically interactive and dynamic in nature, whereas the print is essentially static in nature. For example, hypertext markup language (HTML) documents contain free flowing or reflowing content. Images, paragraphs, videos and other similar content are arranged in an HTML document as tags. HTML documents are adaptable to different devices. That is, if an HTML document is viewed in a web browser, then the HTML document adapts to the web browser and displays content of the HTML document as per the specifications of the web browser. If this HTML document is viewed on a mobile browser of a mobile device, then the HTML document adapts to the specifications of the mobile browser. However, the HTML content is not suitable to print. Since the HTML content is not fixed, a printer would interpret specific elements of the HTML content inaccurately and therefore print the HTML content inaccurately. While there are many transformation techniques and file formats, these file formats are not reversible and do not restore fluidity of the transformed markup language documents. One of the main reasons that the fluidity cannot be restored is that the page output in non-reversible file formats are defined graphically as a set of printer instructions at a glyph level that lose structural information at a character level and a content level.
Markup language content and associated content elements are interpreted and defined using markup language tags on any standard web browser. The tags included in a markup language document are typically executed on a server or on a web browser. Scripts or tags that run directly on a web browser have less latency time compared to a server side execution of tags. Moreover, a server side execution of tags requires an active network connection, whereas a client side execution of web browser compatible tags runs without an active network connection. Most textual markup language documents are rendered in a client-server architecture, where there are delays and additional communication cost between a server and a user's client device for presenting and printing markup language documents. Pagination of a hypertext markup language (HTML) document involves partitioning content of the HTML document and presenting the partitioned content on individual pages. Conventional solutions include pagination of HTML documents based either on cut-off markers or the number of items to be displayed per page. These solutions are typically implemented using server side technologies. There is a need for a client side implementation, and there have been a few attempts at client side pagination due to the improved performance that the client side pagination can yield.
U.S. Pat. No. 7,647,553 B2 provides a hypertext markup language view template that allows a hypertext markup language content document to flow into a series of containers. This is performed by identifying the layout of the hypertext markup language document by using view templates. In this method, a hypertext markup language authorship is provided that takes a bottomless continuous running hypertext markup language page and positions the content in a series of predefined containers within the display media. The content is flowed into the predefined containers. This method does not handle the positioning of footnotes on the same page where respective footnote citations reside, which makes it difficult for a user to refer to citations. This method also does not place floats proximate to their corresponding citations, which makes it difficult for the user to access floats corresponding to the citations. Furthermore, this method does not address header and footer conversion issues.
U.S. Pat. No. 6,789,229 B1 addresses issues with pagination that involves more processor intensive tasks. This method uses pagination techniques that involve determining reproducible pages followed by numbering individual pages based on hard breaks. This method requires a predetermined list of hard breaks occurring in the document being processed which requires a lot of processing time to display page numbers and therefore, there is a need for a faster and efficient technique to process page numbers.
A publication by Hewlett-Packard Laboratories titled “Automatic Pagination of HTML Documents in a Web Browser” discloses automatic pagination of hypertext markup language (HTML) documents on the client side. The methods disclosed in this publication utilize a built-in library of JavaScript® functions in a browser and size attributes to format an HTML page. The paginations are performed through extensible stylesheet language transformation (XSLT). These pagination techniques render page numbers in tabs which occupy more space if the number of pages is large. These methods do not handle page numbers when a print operation is initiated. Moreover, these methods do not position floats and footnotes on the same page where their respective citations reside. These methods transform a regular HTML page into individual pages with paginated tabs, but do not efficiently handle a journal or a novel style HTML page which translates to hundreds or even thousands of individual pages.
Conventional file formats, for example, the portable document format (PDF) of Adobe Systems Incorporated and the electronic publication (ePub®) format of Open eBook Forum DBA are two typical file formats used in documentation. The portable document format is based on a fixed layout and does not support a fluid layout. Page numbers in the portable document format are forced and not based on the content. The ePub file format is designed with reflowable content, which can optimize text and graphics according to a display device. However, the ePub file format does not support header and footer at a conversion stage, places floats at random locations, and does not proxy floats, for example, videos and long tables to a linked source, thereby hindering the user experience.
Hence, there is a long felt but unresolved need for a computer implemented method and a file format transformation system deployed on a client device that transforms marked-up content in a first file format, for example, a hypertext markup language (HTML) format to a reversible second file format that can be stored offline, executed with less latency and without an active network connection on any browser on any operating system, and can be restored to a continuous page. Moreover, there is a need for a computer implemented method and a file format transformation system that implements document tagging of all content including spaces and line breaks to transform fluid pages to fixed pages that are print-friendly and provide a fixed page view that captures document elements, for example, line breaks, floats, footnotes or end notes, page numbers, headers and footers, captions, etc., which are expressed relationally and assigned page appropriate placement. Furthermore, there is a need for a computer implemented method and a file format transformation system that position floats and footnotes on the same page where their respective citations reside, support headers and footers at a conversion stage, place floats at appropriate locations, and proxy floats, for example, videos and long tables to a linked source, thereby enhancing the user experience.