The present invention relates to a system and a method for the automatic preparation and searching of microfilm-type materials, particularly for newspapers and magazines stored on microfilm or microfiche, the conversion of those documents to a digital format and storage of the information contained therein in searchable repositories.
As the Internet grows, many different types of Web sites are becoming connected and therefore are available to users. These Web sites may contain information which is of interest to users, such as news for example. Indeed, many Internet users today obtain at least a portion of their news information from Web sites which publish such information.
Traditional newspapers and other sources of news have therefore been forced to embrace the new media which is represented by Web pages. Currently, many traditional (print) newspapers have Web sites which contain at least a portion of the news and information which is available through the print version of the newspaper. However, archived newspaper and magazine material, which is currently stored in microfilm, is not so readily accessible for publication through the Internet or any other type of network. Newspaper publishers, libraries and other repositories have huge amounts of information which is stored on microfilm. Such microfilm documents represent a huge asset, which cannot currently be properly used. The advantage of microfilm is that it preserves the appearance of the newspaper, magazine or other paper document, as well as the data contained therein. The disadvantage, of course, is that searching through microfilm archives for the information of interest is tedious and difficult. Furthermore, microfilm can only be read at one physical location, since the data cannot be transmitted over a network, for example. Thus, microfilm has a number of significant problems.
Attempts to provide a solution unfortunately have a number of drawbacks. For example, scanning the microfilm documents in order to be able to provide the data through a computer results in a number of errors during the process of OCR (optical character recognition). This process is required for the textual data to be electronically searchable; however, the resultant errors cause the final text to be difficult to search accurately. Correcting these errors manually is a tedious and expensive process, yet currently if these errors are not corrected, the resultant text may not be searchable.
A further attempt to provide searches for text with errors is the xe2x80x9cfuzzy searchxe2x80x9d process, in which a requested keyword and variations on that keyword are all searched simultaneously. Unfortunately, this search method is ineffective for large databases, since too many irrelevant hits are retrieved.
A more useful and efficient system for the automatic preparation and searching of scanned documents is disclosed in PCT Application No. IL01/00797m, by the present inventors and incorporated by reference as if fully set forth herein. In the disclosed system the probability of errors occurring during the preparation of the scanned documents is incorporated into the searching process.
An even more useful solution would provide a complete system for the automatic preparation of a repository of searchable files from archived material. Furthermore such a solution should also be cost effective, operate at least semi-automatically, and also permit access to archived material, and in particular microfilm documents, through an electronic interface. Unfortunately, such a solution is not currently available.
The background does not teach or suggest a system or a method for automating the conversion of microfilm data to a digital format, and the creation of searchable data repositories from the converted digital data. The background art also does not teach or suggest a system and method for enabling users to access the data repositories through a network such as the Internet. The background art also does not teach or suggest a cost effective, at least semi-automatic method for converting microfilm data to a form which can be readily accessed through an electronic interface.
The present invention overcomes these deficiencies of the background art by providing a system and a method for automatically converting microfilm data in to repositories of data in a digital format which may be easily accessed by a user across a network such as the internet. First, preferably a planning phase is performed, in which the production parameters are set depending on a number of conditions such as the nature of the material and the requirements of the customer. Next, preferably data from scanned microfilm reels goes through a preparation phase in which the scanned reels are subdivided. For example a microfilm reel of a newspaper would be subdivided into one or more issues each of which would be saved in a separate data file. Once the files are extracted from the reel, a profile is preferably prepared and jobs are generated. Each file is preferably assigned its own job. The xe2x80x9cAutomatic Processingxe2x80x9d phase executes the generated jobs. As a result every file optionally and preferably undergoes the following automatic processing stages: combining files; analyzing image layout; segmentation; OCR; optional segmentation improvement; and output to XML. In the last stage, the data contained in the files is preferably extracted and then more preferably transmitted to the relevant repository unit.
According to more preferred embodiments of the present invention the system is capable of managing more than one conversion project at any one time, with each project containing one or more publications. Each publication is preferably divided to one or more collections and a search index will be produced for each collection in order to enable accessibility of archived issues, through the use of such search indexes Hereinafter, the term xe2x80x9cnetworkxe2x80x9d refers to a connection between any two or more computational devices which permits the transmission of data.
Hereinafter, the term xe2x80x9ccomputational devicexe2x80x9d includes, but is not limited to, any type of computer operating according to any type of hardware and/or operating systems; or any device, including but not limited to: laptops, hand-held computers, PDA (personal data assistant) devices, cellular telephones, any type of WAP (wireless application protocol) enabled device, wearable computers of any sort, or any other device which has an operating system.
For the present invention, a software application could be written in substantially any suitable programming language, which could easily be selected by one of ordinary skill in the art. The programming language chosen should be compatible with the computational device according to which the software application is executed. Examples of suitable programming languages include, but are not limited to, C, C++ and Java.
In addition, the present invention could be implemented as software, firmware or hardware, or as a combination thereof. For any of these implementations, the functional steps performed by the method could be described as a plurality of instructions performed by a data processor.
Hereinafter, the term xe2x80x9cWeb browserxe2x80x9d refers to any software program which can display text, graphics, or both, from Web pages on World Wide Web sites. Hereinafter, the term xe2x80x9cWeb serverxe2x80x9d refers to a server capable of transmitting a Web page to the Web browser upon request.
Hereinafter, the term xe2x80x9cWeb pagexe2x80x9d refers to any document written in a mark-up language including, but not limited to, HTML (hypertext mark-up language) or VRML (virtual reality modeling language), dynamic HTML, XML (extensible mark-up language) or XSL (XML styling language), or related computer languages thereof, as well as to any collection of such documents reachable through one specific Internet address or at one specific World Wide Web site, or any document obtainable through a particular URL (Uniform Resource Locator). Hereinafter, the term xe2x80x9cWeb sitexe2x80x9d refers to at least one Web page, and preferably a plurality of Web pages, virtually connected to form a coherent group.
Hereinafter, the phrase xe2x80x9cdisplay a Web pagexe2x80x9d includes all actions necessary to render at least a portion of the information on the Web page available to the computer user. As such, the phrase includes, but is not limited to, the static visual display of static graphical information, the audible production of audio information, the animated visual display of animation and the visual display of video stream data.
Hereinafter, the term xe2x80x9cmicrofilm-type materialxe2x80x9d includes, but is not limited to, microfilm and microfiche.