1. Field of the Invention
The present invention relates generally to correcting the text found within documents, and more specifically to correcting Uniform Resource Identifiers (URIs) within noisy documents.
2. Description of the Related Art
Optical scanners and optical character recognition (OCR) systems have become widely used for transferring written textual material into an electronic format. Optical scanners are electromechanical devices that systematically gather information about the relative brightness of light reflected from small defined areas of a page of paper. The data gathered by the optical scanner about the page are typically transferred to a computer in the form of a digital-graphical document representing the information gathered from the page. The OCR system calculates which alphanumeric characters are most likely represented by the digital-graphical data and creates a digital-textual document which represents the original document. In digital-textual documents, the text can be easily edited, and the formatting of the document can be easily manipulated.
OCR systems are not perfect and typically the digital-textual document contains errors when compared to the original document. This is particularly true for original documents containing Uniform Resource Identifiers (URIs). A URI is a standardized string of characters that provide resource identification and location information for resource discovery and access in a computer network environment and particularly on the internet. A URI is frequently far longer than most words, often uses words not found in a dictionary for its textual components, and uses numerous punctuation marks and specialized characters within its structure. These features make a URI particularly susceptible to errors in OCR systems.
URI errors are becoming more common as the increasing popularity of the internet means URIs are used more often in documents. In fact, it is becoming more common for computer applications to automatically recognize URIs within digital-textual documents and further to turn the URIs into hyperlinks. Hyperlinks are locations within a digital document that, when selected by the user of the computer displaying the digital document, automatically gather the resource found at a specific address on a computer or a computer network. When a URI has been turned into a hyperlink, it will retrieve the resource at the address described by the URI.
A URI must be error free in order to function properly as a hyperlink. The applications available today assume that the URI they are provided is complete and accurate. Furthermore, if errors do occur in the URI, they can be easily missed when the digital document is reviewed. Often such URI errors are not discovered until someone tries unsuccessfully to retrieve the resource defined by the URI.
From the forgoing it will be apparent that there is a need for a way to test and correct URIs found within the digital-textual documents created by OCR systems or other systems that reproduce text less than 100 percent accurately.