1. Field of the Invention
The present invention generally relates to a method and system for automated extraction of information from human readable sources, and more particularly to a method and system for handwritten text recognition by unknown writers from documents carrying such text. In an exemplary embodiment, the present invention relates to a method and system for automatically extracting and recognizing handwritten information from legal instruments (e.g., checks).
2. Description of the Related Art
Typically, a check is made by a payer (Pa(i)) to a payee, or a recipient (Re(j)). The check is made on an account that the payer has at a bank (Ban(Pa(i))). This means that the check is drawn on the bank (Ban(Pa(i))).
Checks that arrive at a business as the recipient thereon are usually stamped on the back of the check by that business (Bus(k)(=Re(j))). The business will then deposit these checks at its bank (Ban(Bus(k))). It is possible that the business may use several different banks, so that the checks may be deposited in several different banks.
The business' bank (Ban(Bus(k))) regularly (e.g., in most countries, every working/business day) bundles together all of the checks that it receives and that are drawn on each individual bank. Then, the business' bank (Ban(Bus(k))) sends to the payer's bank (Ban(Pa(i))) all of the checks drawn from accounts on that bank. Therefore, the payer's bank receives the checks from a particular payee in batches or strings of checks.
The payer's bank (Ban(Pa(i))), may want to capture some information from these checks. Such data capture is difficult to perform quickly because most data added by payers on checks, such as payee's name, date, amount, comments, etc. is handwritten. Generally, it is difficult for a bank to capture handwritten information automatically from a check. Some payers use stamps to add payee data to a check. However, even stamps are often obscured by superimposed stamps or writings, and placed in ways which are often not systematic.
Most banks convert received checks from their analog form to a digital form, in particular to allow data to flow and to be stored, retrieved, etc., using electronic means of storage, search, communication, and other aspects of check handling. The information that the payer's bank or other entities may wish to obtain can be extracted from the checks, either when they are handled in paper form, or when they are transformed into an image.
Checks are very familiar objects to most adults in a country like the United States where they are still commonly used. The following description will be directed to checks from the United States. However most if not all of what is described applies equally to checks from most countries. FIG. 1 illustrates a front view of a standard American check and FIG. 2 illustrates a rear view of a standard American check. There are several distinctive fields on the check, which are described below.
Referring to FIG. 1, the MICRline (X) 101 is a relatively long number usually located on the bottom left of the front of the check. The MICRline 101 consists of the branch number, the account number, and the check number for that account. The check number 102 itself is repeated, usually on the upper right corner of the front of the check 100. The name and address 103 of the account owner (e.g., an individual or a company) is usually on the upper left of the front of the check 100. The name and address field 103 may also include a telephone number, and/or some other identifying numbers in the case of a corporation.
The check 100 also includes a number of different fields for writing or stamping additional information that is particular to the check being written. The fields for inputting information include the date that the check is written 104, the payee's name (individual or business) 105, the numerical amount (or courtesy amount) 106, and the written amount (or official amount) 107. Additionally, the front of the check 100 includes a signature field 108 where the payer signs the check 100. Also, the front of the check 100 includes a memo line 111, which is a field for the payer to write what the check is being used in payment for or to include any other pertinent information, such as an account number.
The front of the check 100 also provides information describing the payer's bank. Specifically, the front of the check 100 includes the name and address of the bank 109 and an identifying logo 110 of the bank. The check 100 may also include a notice 112 that the check is equipped with counterfeiting adverse features. Specific details of the features will be defined on the back of the check.
Referring to FIG. 2, the back of the check includes an area for the payee to endorse the check 1113. Also, the back of the check may include the specific details of the counterfeiting adverse features 114, as indicated on the front of the check (see 112), which includes instructions to reject the check if some of these features are compromised.
While most of the world is moving away from checks (although at a rather slow pace, about 4% decrease per year in England, for instance), the use of checks in the United States remains extremely high. In fact, even in countries where overall check traffic has been significantly decreased, there are businesses, which still handle an increasing number of checks. For example, in the United States in 1993, checks represented 80% of the non-cash transaction volume for only 13% of the transaction value, with an average value per transaction of $1,150. Hence, while the use of checks has been declining in some countries, it is still increasing in some.
Checks have been chosen as one example of documents that carry information that can be used for purposes other than the intended use of the document carrying the information. Some of the potentially useful information written on a check (taken as an example of a document) is handwritten by a person whose handwriting is unknown, (or poorly printed) in the sense that automated recognition has not been trained on it. The typical handwriting on a check is so badly written that current image recognition machines cannot decipher the content, nor is it expected that the next few generations of machines will be able to decipher the content.
As shown in FIGS. 1 and 2, at least a portion of the handwritten or stamped information provided on the check refers to the payee of the check. It is often important for the payer bank to track to whom its customers are writing checks.
Currently, the conventional processes for information extraction typically use either manual, human extraction methods or pure image recognition procedures. The manual, human extraction methods involve actual human, visual review of the checks, which is extremely slow and inefficient. The pure image recognition procedures, while automated, are extremely inaccurate. Indeed, some of these procedures provide approximately less than thirty percent accuracy rates. Furthermore, there are no conventional information extraction processes that are directed to extracting payee information from a check.
Furthermore, the average volume of incoming checks to a bank cannot be managed by the conventional methods. That is, the average volume of incoming checks for a bank is the range of millions of checks per day. The inefficient, often human-based, conventional methods cannot manage this large of a volume.
Additionally, because of the large number of checks that need to be analyzed each day, a bank using one of the conventional methods will typically have a large number of employees processing the checks. The large number of employees handling causes reduces the privacy provided to the bank's customers as well increases security risks.