Individuals, schools, small and large companies all produce a tremendous amount of documentation whether for personal view or public dissemination. Companies may have product manuals that accompany its products or employee handbooks for its employees. Schools may have course catalogs directed to students or graduate-level theses for publication to the public at large. Historically, these documents were maintained in a hard copy format stored in one or a number of locations for people to review when needed. However, with the growth of local and wide area networking, many companies recognized the value of converting paper documents into electronic documents. Electronic document systems were developed that managed large numbers of electronic documents that were converted from the hard copies.
Many of these documents may originally have been created with a word processor application, such as MICROSOFT CORPORATION's MS WORD™, COREL CORPORATION's WORDPERFECT™, or the like. When placed onto the electronic system, the documents may be in the original word processing format, such as MSWORD™ DOC, WORDPERFECT™ WPD, Rich Text Format (RTF), or the like, or may have been converted into a graphic-represented format, such as ADOBE SYSTEMS INC.'s Portable Document Format (PDF), MACROMEDIA, INC.'s FLASHPAPER™, or the like. Graphic-represented format documents are generally more universally accessible because they typically require only a viewer or player application, such as ADOBE SYSTEMS INC.'s ACROBAT™ READER, MACROMEDIA INC.'s MACROMEDIA FLASH™ PLAYER, or the like. Thus, the graphic-represented document is viewable across any number of different platforms, as long as the platform is equipped with the appropriate player. Parties with access to a company's or school's local or wide area network were then able to view the documents on a computer screen without needing to have the hard copy or be at a location near the entity.
As the capabilities and reach of the Internet began to increase, it provided a more widely-available delivery mechanism for such electronic documentation. Instead of needing access directly to the entities' networks, parties, whether employees, students, or simply the general public, may virtually access any entity's available information from almost any Internet access point. Entities now maintain intranet and Internet locations for parties to gain access to entity documentation using standard Web browsers either while directly connected to the entity's network or via an entity-sponsored Web server. While some of the legacy documentation imported to Internet-accessible locations from the early electronic online document systems remain in a graphic-represented format, some of the legacy documentation is also being converted directly into hypertext markup language (HTML) documents that may be viewed on standard Web browsers without requiring additional format-specific viewers or players. Accessing users are then able to browse through the documentation using the familiar Web browsing navigation paradigm.
Considering all of the available legacy documentation that an entity may wish to repurpose for use with an Internet-accessible electronic document system, applications may be used for converting the legacy documentation into HTML. Converting legacy documents, whether in word processing format or graphical-represented format, into HTML is a relatively simple task that may be automated by software logic. However, converting a 100-page manual into an HTML document will generally produce a single HTML document in which the user would have to use the scroll bars to access all 100 pages. While the information in these 100 pages is all there and available to the user, the user may have difficulty traversing the manual to find the things that he or she wishes to find.
In order to address this undesirable trait, developers may generally manually break up the converted legacy documents into a collection of separate HTML page. Thus, users may navigate between the collection of Web pages that make up the entire legacy document, instead of scrolling through one, very long Web page. However, the process of manually breaking such documents into separate HTML pages is very time consuming. Developers typically go through each legacy document and mark where the document should be broken up. While this process may not take particularly long for a short document, it is extremely tedious for large documents having hundreds or even thousands of document pages. Automated systems may insert a break in a document at specified points that correspond to single HTML pages. However, this systematic approach often breaks documents illogically (i.e., breaking at the beginning of a new section or in the middle of a section as opposed to breaking on a major heading or sub-heading).
Well-formatted legacy documents may be processed by a conversion application that automatically reads and analyzes the formatting to determine the more-logical points in the document at which to break, e.g., before a major heading as opposed to just after a major heading. A legacy document may be well-formatted if it was created using standard styles from the native word processing application. However, in practice many legacy documents were created using ad hoc inline styling without consideration to creating a well-formatted document. For example, instead of selecting to apply a Heading 1 style, the author would select a large font, bold the text, make the text all capital letters, and perhaps center it on the page. The result is a document that may have a well-formatted appearance, but which attained that formatting through single, inline styling assigned by the author. Therefore, conversion applications that rely on well-formatted documents will fail to identify appropriate or logical beak points because there is an undefined style.
Another method that may be used to overcome this problem is to apply style sheet formatting, such as Cascading Style Sheets (CSS), to an HTML document and automatically break the document according to a particular rule or grouping of CSS style rules. For example, a conversion application may examine and analyze a CSS file applied to a particular HTML document and provide that HTML page breaks should occur before major headings, which may be stylized as a Heading 1 in HTML. Thus, when the page-break logic encounters a major heading, it will break the HTML document into a new HTML page.
Using such style sheets for page division works only when the style sheet exists for the HTML document. If one does not already exist, a developer may convert a word processing document into an HTML document and then create a style sheet document to apply to the converted HTML. However, such a development process is typically very time intensive. Therefore, if no style sheets exist to leverage against or the legacy document is not well-formatted, a developer may either manually divide the legacy document, divide it by page size only, or some combination of the two, none of which are a desirable process.