1. Field of the Invention
The present invention generally relates to a method and system of automated extraction of information from human readable sources, and more particularly to a method and system of discovering and delineating within a collection of documents generated/customized by unknown sources subsets that share common semantic features when the common semantic features are unknown prior to examining the documents. In an exemplary embodiment, the present invention will find within a plurality of documents such subsets in cases where the documents may partially or fully include human created analog indicia (e.g., handwritten, spoken, etc.) and where standard automatic recognition techniques are inadequate.
2. Description of the Related Art
Typically, a check is made by a payer (Pa(i)) to a payee, or a recipient (Re(j)). The check is made on an account that the payer has at a bank (Ban(Pa(i))). This means that the check is drawn on the bank (Ban(Pa(i))).
Checks that arrive at a business as the recipient thereon, are usually stamped on the back of the check by that business (Bus(k)(=Re(j))). The business will then deposit these checks at its bank (Ban(Bus(k))). It is possible that the business may use several different banks, so that the checks may be deposited in several different banks.
The business' bank (Ban(Bus(k))) regularly (e.g., in most countries, every working/business day) bundles together all of the checks that it receives and that are drawn on each individual bank. Then, the business' bank (Ban(Bus(k))) sends to the payer's bank (Ban(Pa(i))) all of the checks drawn from accounts on that bank. Therefore, the payer's bank receives the checks from a particular payee in batches or strings of checks.
The payer's bank (Ban(Pa(i))), may want to capture some information from these checks. Such data capture is difficult to perform quickly because most data added by payers on checks, such as payee's name, date, amount, comments, etc., is handwritten. Generally, it is difficult for a bank to capture handwritten information automatically from a check. Some payers use stamps to add payee data to a check. However, even stamps are often obscured by superimposed stamps or writings, and placed in ways, which are often not systematic.
Most banks convert received checks from their analog form to a digital form, in particular to allow data to flow and to be stored, retrieved, etc., using electronic means of storage, search, communication, and other aspects of check handling. The information that the payer's bank or other entities may wish to obtain can be extracted from the checks, either when they are handled in paper form, or when they are transformed into an image.
Checks are very familiar objects to most adults in modernized countries like the United States where they are still commonly used. The following description will be directed to checks from the United States. However, most if not all of what is described applies equally to checks from most countries. FIG. 1 illustrates a front view of a standard American check, and FIG. 2 illustrates a rear view of a standard American check. There are several distinctive fields on the check, which are described below.
Referring to FIG. 1, the MICR line (X) 101 is a relatively long number usually located on the bottom left of the front of the check. The MICR line 101 includes the branch number, the account number, and the check number for that account. The check number 102 itself is repeated, usually on the upper right corner of the front of the check 100. The name and address 103 of the account owner (e.g., an individual or a company) is usually on the upper left of the front of the check 100. The name and address field 103 may also include a telephone number, and/or some other identifying numbers in the case of a corporation.
The check 100 also includes a number of different fields for writing or stamping additional information that is particular to the check being written. The fields for inputting information include the date that the check is written 104, the payee's name (individual or business) 105, the numerical amount (or courtesy amount) 106, and the written amount (or official amount) 107. Additionally, the front of the check 100 includes a signature field 108 where the payer signs the check 100. Also, the front of the check 100 includes a memo line 111, which is a field for the payer to write what the check is being used in payment for or to include any other pertinent information, such as an account number.
The front of the check 100 also provides information describing the payer's bank. Specifically, the front of the check 100 includes the name and address of the bank 109 and an identifying logo 110 of the bank. The check 100 may also include a notice 112 that the check is equipped with counterfeiting adverse features. Specific details of the features will be defined on the back of the check.
Referring to FIG. 2, the back of the check includes an area for the payee to endorse the check 113. Also, the back of the check may include the specific details of the counterfeiting adverse features 114, as indicated on the front of the check (see 112), which includes instructions to reject the check if some of these features are compromised.
While most of the world is moving away from checks (although at a rather slow pace; about 4% decrease per year in England, for instance), the use of checks in the United States remains extremely high. In fact, even in countries where overall check traffic has been significantly decreased, there are businesses, which still handle an increasing number of checks. For example, in the United States in 1993, checks represented 80% of the non-cash transaction volume for only 13% of the transaction value, with an average value per transaction of $1,150. Hence, while the use of checks has been declining in some countries, it is still increasing in some.
Checks have been chosen as one example of documents that carry information that can be used for purposes other than the intended use of the document carrying the information. Some of the potentially useful information written on a check (taken as an example of a document) is handwritten by a person whose handwriting is unknown, (or poorly printed) in the sense that automated recognition has not been trained on it. The typical handwriting on a check is so badly written that current image recognition machines cannot decipher the content, nor is it expected that the next few generations of machines will be able to decipher the content.
There is a need for a process that allows a bank, or other document handling institution, to discover significant subsets of documents in a collection of documents where the common distinguishing features shared by the documents in the significant subset of documents is not known prior to discovering the significant subset. For example, there is a need for a process that will allow a bank to find a large number of checks written to a specific payee where the payee, and any information regarding the payee, is not known prior to discovering the subset of checks written to the payee. Currently, there are no methods or systems in existence, which allow a document handler to discover such subsets of documents in a collection of documents.