1. Field of the Invention
The present invention concerns generating a hypertext markup language (HTML) file based on an input bitmap image, and is particularly directed to automatic generation of an HTML file, based on a scanned-in document image, with the HTML file in turn being used to generate a Web page that accurately reproduces the layout of the original input bitmap image.
2. Description of the Related Art
In recent years, the popularity of the internet has grown dramatically. One reason for such growth has been the widespread adoption of HTML (HyperText Markup Language), which is a language for describing document appearance, document layout and hyperlink specifications. It defines the syntax that describe the structure and the content of a document including text, images, and other supported media. The language also provides connections among documents and other Internet resources through hypertext links and other hyperlinks. Using HTML, a Web page can be created which contains, in addition to bitmap images, graphic images, and text of various styles and sizes, hyperlinks which permit a viewer of the Web page to easily jump to another point within the page or to a completely different Web page, even one that is provided by a different server.
Once an HTML file is made available on the World Wide Web via a server, any client connected to the World Wide Web can access the page merely by typing the page address in the appropriate field of his browser. After the address has been entered, the browser requests the server to send the HTML file, which can contain text, references to graphic and bitmap image files, and formatting and hyperlink information for the entire page. Upon receipt of the HTML file, the browser automatically requests the graphic and bitmap image files referenced in the HTML file from the identified source.
To display the HTML file and the downloaded image files, the browser relies on HTML commands embedded in the HTML file. These commands are referred to as "tags". The tags indicate features or elements of a page and cause the browser to perform various functions, such as a particular type of formatting. HTML tags can be identified in HTML files by their syntax. That is, the tags are surrounded by left and right angle brackets, such as "&lt;P&gt;". In this case, "&lt;" indicates the start of the HTML tag, "P" is the tag itself (here a tag indicating a new text paragraph), and "&gt;" indicates the end of the tag. Often, tags come in pairs so as to indicate the start and end of a special function. The beginning tag initiates a feature (such as heading, bold, and so on), and the ending tag turns it off. Ending tags typically consist of the initiating tag name preceded by a forward slash (/) For example, &lt;strong&gt; and &lt;/strong&gt;, surrounding text, will display the surrounded text more strongly that other text. Any additional words in a tag are attributes, sometimes with an associated value after an equal sign (=), which further define or modify the tag's actions.
HTML 3.0 is presently the de-facto World Wide Web standard that defines permissible tags and nesting of tags. Approximately 100 different tags are permitted and defined.
Because of the complexity of HTML 3.0, as well as its cumbersome usage requirements, considerable effort is expended by the Web designer when authoring visually appealing and useful Web pages. For example, assume that an organization had good existing written marketing materials which it wanted to reproduce identically on a Web page. Even this seemingly simple task has typically required that a specialist spend a significant amount of time authoring HTML instructions by hand in an attempt to reproduce the layout and appearance of the written materials.
Several systems have been proposed that would automate this job of authoring HTML files from written documents. Xerox Text Bridge Pro and Caere Omni Page Pro are examples of systems which scan in written documents and generate HTML files based on the scanned-in document image. Where these systems fail is in producing HTML files that accurately represent the layout, tables and images of the original written document. In particular, a major problem has been automatically generating HTML instructions for the case where the written document is arranged in columns, or, more generally, when regions in the original document are horizontally adjacent. The term "horizontally adjacent," when used with respect to two image blocks, means a situation where the vertical extent of the two blocks overlap, or, equivalently, where a horizontal line can be found which will intersect both blocks. Similarly, the term "vertically adjacent," when used with respect to two image blocks, means a situation where the horizontal extent of the two blocks overlap, or, equivalently, where a vertical line can be found which will intersect both blocks.
A typical example of the problems associated with such systems is illustrated by reference to FIGS. 1 and 2. FIG. 1 depicts an original printed document 10 to be converted into HTML format. As shown in FIG. 1, the original document, has, among other features: title 1 in the upper left corner, subtitle 2 in the upper right corner, text columns 4, 5 and 6, picture 7 in the lower left corner, and footer 9 in the lower right corner.
FIG. 2 illustrates how a Web page 20 would be displayed on display 23 by a Web browser based on the HTML file generated by an existing system for converting bitmap images into HTML. Elements in FIG. 2 corresponding to those in FIG. 1 are numbered similarly to those in FIG. 1. Thus, after processing by the existing system, title 1 is reproduced as title 11. However, subtitle 2, rather than being reproduced in the upper right corner, is instead reproduced as element 12 in the upper left corner, just below title 11. Similarly, the entire text column structure of the original document has been eliminated, and picture 17, rather than being in the lower left corner, occupies the entire width of the page and is interposed between lines of text. Finally, footer 19 is reproduced in the lower left corner, instead of the lower right corner.
The above comparison shows the complete failure of commercially available systems to capture many layout and stylistic elements of the original document. Other known available systems also too frequently miss important layout features. Accordingly, the problems caused by the complex and cumbersome nature of HTML are not adequately addressed by commercially available systems.