The approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
In recent years, there have been growing efforts to digitize large quantities of printed content such as books and periodical issues and effectively distribute such content over the Internet.
This objective seems a reasonable one, since printed matter represents pure information and the internet is an efficient means of distributing such information.
Even long prior to the creation of the Internet, attempts to digitize and electronically distribute large quantities of printed content were widespread. For decades, the collective Gutenberg Project has been digitizing many thousands of classic books into text form and making them available for free downloads from major university computer sites. Database services such as Lexis-Nexis had digitized large portions of the archives of major newspapers and periodicals and made the articles available in searchable form to paying customers, originally through specialized computer terminals and more recently also through a subscription web site.
More recently, in late 2003, Amazon.com released a free web-based system containing over 100,000 readable, searchable books in electronic form, and Google and Yahoo have subsequently also announced plans to provide large numbers of books in digitized form. Several magazines have made their archives available over the Internet in a variety of forms, sometimes for free and sometimes on a subscription basis.
Yet despite this seemingly natural fit between the digitization of printed content and its distribution over the Internet, the general adoption and use of these systems has usually proven much less successful than originally expected. For example, the original announcement of Amazon's 100,000 searchable digitized books in late 2003 generated enormous media coverage, but subsequent attention has been quite scanty, seemingly indicating that the actual effective use of the system is considerably lower than was originally envisioned. Various magazines have also privately indicated that the use of their digitized archives is considerably below their original hopes and expectations.
One weakness of these existing digitization systems for printed content may center upon the inherent trade-offs required in the two different forms such digitization schemes usually take, namely the “text-based” and the “image-based”.
Under a text-based digitization system such as that of Lexis-Nexis or the Gutenberg Project, the printed content of a book, magazine article, or newspaper story is converted into a stored file of digital characters, for display as HTML on a web page or in some other form. Character storage formats such as ASCII are used.
This type of digitization has the advantage of providing the content in a light-weight format, and hence is very convenient for use over the Internet, even via a non-broadband connection. Also, the text displayed is exact, searchable, and can be copied-and-pasted from the browser window into any other form.
However, this text-based form of digitization also has serious disadvantages. First, producing the text requires performing a scan of the original printed content, followed by application of Optical Character Recognition (OCR) software to produce the text. Although automatic OCR has increasingly improved in quality, it still produces a noticeable rate of error, requiring subsequent manual-correction of the text, and therefore dramatically increasing the cost of the digitization process.
Also, the printed content of books and periodicals is frequently laid out on the page in a non-trivial and significant manner, and this layout is lost if the material is converted to pure text; furthermore, any colors, drawings, tables, or photographs are obviously lost as well.
In addition, such text-based content is seldom divided by the original pages, instead being usually provided either in the form of the large blocks of text representing complete articles or chapters or else being divided in a somewhat arbitrary manner, with neither of these choices being ideal.
Finally, the ruling of the U.S. Supreme Court in the 2001 Tasini v. New York Times decision appears to prohibit newspapers or magazines from permitting their freelance articles to be republished in a different (e.g. text-based) format without the prohibitively difficult requirement of securing authorization from each and every individual writer, unless the newspapers or magazines had previously obtained such authorization by contract. This was one of the factors recently cited by the New Yorker magazine in preventing its own archives from being digitized into a text-based format.
By contrast, the other, increasingly popular form of digitization is based on the presentation of the exact, scanned images of the printed content, generally as binary image files in JPEG, TIFF, web-optimized PDF, or some other type of binary image file format.
Although these binary image files require considerably more storage than pure text, most of the systems used allow the user to automatically retrieve only the page or two of material being examined rather than the complete contents of the entire book or periodical. Thus, instead of having to transmit the entire multi-megabyte PDF file of a book over the Internet, only a couple of pages are sent at time, allowing even large books to be conveniently readable over a non-broadband connection.
Being scanned binary images, the entire content of the original content material can be preserved, including colors, layouts, drawings, and photographs. If the format used is text-embedded PDF, the binary images are also text-searchable, and software options may be selected to allow the user to extract any portions of the actual text through standard copy-and-paste operations.
Finally, presentation of the exact scanned images of all the pages of a publication, especially if constituted as a single PDF file, seemingly falls within the permissible bounds of the Tasini decision, and therefore may be authorized at the sole discretion of the original publisher.
Despite these major advantages to the use of image files, considerable difficulties still remain. First, even despite recent technological advances, binary image files still remain considerably larger than regular HTML web pages, and many web users are reluctant to add links to these for fear of inconveniencing individuals who are limited to slow Internet connections. Second, the insertion of hyperlinks into the body of binary image files is either impossible or, in the case of PDF files, rather laborious, even though the latter format was actually developed partly to provide this exact capability. And once such hyperlinks are added to a PDF file, changing or modifying these in any way is almost as difficult. Probably for this reason, only a negligible fraction of the digitized printed content on the Internet based on binary images makes use of internal hyperlinks. And since the use of hyperlinks represents one of the most powerful and universal features of the Internet, largely sacrificing that capability is a huge weakness.
Furthermore, binary image files are static and fixed in their structure, and generally quite difficult to easily modify or manipulate. By contrast, the ubiquitous HTML web pages which dominate the Internet are flexible and easy to manipulate, and an unlimited number of such HTML pages can easily be generated from a single template file written in a web application language such as PHP or ColdFusion, with the dynamically-derived web pages being determined by the particular Universal Resource Locator (URL) selected and perhaps the changing values of a server database.
The enormous contrast between the easy linking and flexibility of HTML web pages and the difficulty of applying such techniques to large binary image files, including electronic documents in Adobe portable document format (PDF), probably helps account for the huge current dominance of the former throughout the Internet, and the relatively small amount of digitized printed content based on the latter.