The mortgage banking industry is faced with the daunting task of organizing, inputting and accessing a vast number and array of divergent types of documents and manually entering several hundred fields of information from a subset of these documents in order to make a loan to a borrower. Although many attempts have been made to streamline the process, most recently by the Mortgage Bankers Association (MBA) which established standards for representing information in a mortgage transaction, the problem of identifying and capturing information from paper documents, image files, native PDF files, and other electronic files in the loan origination process has yet to be solved in order to take advantage of these standards. In the United States alone, mortgage bankers are faced with the idiosyncratic documents from a minimum of fifty states where some mortgage documents differ from state to state and may have further individual variations within each state. In addition, once the loan is made to the borrower, there is a huge secondary market for mortgages, where existing mortgage loans are bundled and sold to large investment firms. These investment entities, in order to pursue a rational risk management policy presentable to their owners and/or shareholders, must organize and analyze these mortgage documents for asset risk and compliance with local, state and federal laws. Values necessary to compare and analyze these loans must be extracted from paper documents or images of the document, then tabulated, analyzed and the resultant data and documents made readily available in order for informed decision-making to occur. In January 2000, the MBA formed the Mortgage Industry Standards Maintenance Organization (MISMO). This group has driven the development of industry specifications that allow seamless data exchange using standard electronic mortgage documents called SMART DOcS™. The SMART Doc XML specification is the foundation of the eMortgage efforts of lenders, vendors, and investors, as it provides for the electronic versions of key mortgage documents. This specification enables electronic mortgage loan package creation by providing a standard for creating and processing uniform electronic transactions for use in electronic mortgage commerce.
Nor is this dilemma restricted to the mortgage industry. In other industries, including the finance industry, the hospitality industry, the health care field and the insurance industry, there is a constant need to collate documents into logically related groups, and capture key information to enable information exchange. These documents must be further collated in order to identify and store multiple revisions of the same type of document, along with extracting data and inferred information from the documents, together with making the resultant transaction data and underlying documents available in an electronically accessible manner.
Unfortunately, the manual organization, collation of paper documents, and extraction of information is very time consuming and slows the process of making business decisions. Additionally, there is an increased possibility of error due to manual processing. Validation of these decisions is very difficult since the paper documents are stored separately from the electronic databases maintained by the processing organizations. Thus, there is a clear need for process automation and well organized and easily searchable electronic storage of the documents as well as extraction of relevant information contained within the documents.
In other methods or processes known in the art, automated document identification or classification methods fall into one of three categories: (1) they are either completely dependant on image based techniques for classification; (2) they use simple keyword search techniques, Bayesian and/or Support Vector Machine (“SVM”) algorithms for text classification; or (3) they rely on document boundary detection methods using image and text based classification techniques. These methods are inadequate to deal with the wide variation in documents typically seen in the business environment and are not capable of separating multiple revisions of the same document type to enable information to be captured from the most current version of the document, hence limiting the utility of such systems.
Although it is known in the art to view paper documents by conversion into simpler electronic forms such as PDF files, these files, in general, do not allow extracting information beyond Optical Character Recognition (OCR). The OCR quality is highly dependant on image quality and the extraction is frequently of very poor quality. Finally, these methods or apparatuses do not offer a complete solution to the dilemma of analyzing and manipulating large paper document sets. Thus, the automated systems currently available generally have at least the following problems:
(1) such systems are limited to document boundary detection, document classification and text extraction and do not offer advanced document collation with separation of very similar documents, and domain-sensitive scrubbing of extracted information into usable data;
(2) techniques based on the current methods of out-of-context extraction and keyword-based classification cannot offer the consistent extraction of information from documents for automated decision making, or formation of Business Objects such as SMART DocS™ for information exchange between two organizations using industry standard taxonomy;
(3) similarity among documents may lead to misclassification when using pattern-based classification, especially in cases where the optical character recognition quality of the document is poor;
(4) extraction processes that handle structured data using a template-based matching generally fail even with a slight shifting of images, and those with rules-based templates can return false results if there are significant variations of the document;
(5) such systems cannot handle both structured and unstructured documents equally efficiently and reliably to serve an entire business process;
(6) such systems frequently are wed to the strengths and weaknesses of a particular algorithm and are thus not able to handle wide variations in analyzed documents with acceptable accuracy without manual rule creation;
(7) such systems cannot locate the information across the documents and variations;
(8) neither do such systems provide a complete solution to a business problem; and
(9) such systems do not have intelligent scrubbing of extracted information to enable the creation of electronic transaction sets such as MISMO SMART Doc™ XML files.
To analyze complicated documents, workers in several industries, for example, mortgage banking, currently analyze documents using a manual collation process; a manual stacking process; a wide variety of manual classification methods; and manual extraction methods, in particular a manual search and transcription. These methods suffer from the disadvantages of requiring substantial investment of human capital and not being automated sufficiently to handle bulk processing of documents and the information contained in those documents.
The number and kind of documents accompanying a mortgage loan are very specific to the mortgage loan industry, and as mentioned above, vary from state to state, and may vary in the jurisdictions within a particular state. However, the documents related to a given loan for the purchase of a property or properties in any jurisdiction may be assembled into electronic images by scanning (or direct entry, if already in an electronic form) before, during and after funding of the loan to form a partially, or preferably, complete document set, referred to herein as the “Dox Package.” These documents originate from a number of sources, including banks and/or credit unions. Moreover, the order of these documents are assembled and filed depends very much on the individuals involved, their timeliness and their preferences, organization, or disorganization in sorting the various forms and other documents containing the required information. Further, even though some standardization of documents has occurred, such as Form 1003 published by FNMA, certain data essential for further analysis may still be found at disparate locations in idiosyncratic documents. For example, each bank and credit union formats an individual's bank statement in a different manner, yet the data from each format must be extracted for income verification. Additionally, depending on the stage of loan processing, not all of the documents may be present in a Dox Package at a given point in time.
As mentioned above, following the funding of the loan, loans are frequently bundled with many other similar loans and sold on the secondary market. At this stage, entire lots of mortgage-secured loans are bundled and sold with minimal quality control. In current usage in the secondary mortgage market, a randomly selected ten percent sample of mortgage documents (Dox Packages) are analyzed in detail (largely by manual means) and taken as representative for the lot. Obviously, if more loans, or substantially all the loans in a bundle, could be evaluated, better decisions could be made regarding the marketing of mortgage-backed loans on the secondary market. Hence, pricing of these loans in the market would be more efficient. Thus, there is a clear need for the automated analysis, collation of documents, and extraction of information in the mortgage loan industry, as well as other industries with no automated or standardized data input in place.
The following patents and applications may also be relevant in describing the background of the instant invention: U.S. Patent Application No. 2005 0134935 and U.S. Pat. Nos.: 6,754,389, 6,751,614, 6,742,003, 6,735,347, 6,732,090, 6,728,690, 6,728,689, 6,718,333, 6,704,449, 6,701,305, 6,691,108, 6,675,159, 6,668,256, 6,658,151, 6,647,534, 6,640,224, 6,625,312, 6,622,134, 6,618,717, 6,611,825, 6,606,623, 6,606,620, 6,604,875, 6,604,099, 6,592,627, 6,585,163, 6,556,987, 6,556,982, 6,553,365, 6,553,358, 6,542,635, 6,519,362, 6,512,850, 6,505,195, 6,502,081, 6,499,665, 6,487,545, 6,477,528, 6,473,757, 6,473,730, 6,470,362, 6,470,307, 6,470,095, 6,460,034, 6,457,026, 6,442,555, 6,411,974, 6,397,215, 6,362,837, 6,353,827, 6,298,351, 6,289,337, 6,266,656, 6,259,812, 6,243,723, 6,233,575, 6,216,134, 6,212,517, 6,199,034, 6,185,576, 6,185,550, 6,178,417, 6,175,844, 6,157,738, 6,128,613, 6,125,362, 6,101,515, 6,094,653, 6,088,692, 6,061,675, 6,055,540, 6,044,375, 6,038,560, 5,999,893, 5,999,647, 5,995,659, 5,991,709, 5,983,246, 5,966,652, 5,960,383, 5,943,669, 5,940,821, 5,937,084, 5,930,788, 5,918,236, 5,909,510, 5,907,821, 5,905,991, 5,873,056, 5,867,799, 5,854,855, 5,848,186, 5,835,638, 5,832,470, 5,819,295, 5,812,995, 5,794,236, 5,768,580, 5,717,913, 5,706,497, 5,696,841, 5,694,523, 5,689,342, 5,598,557, 5,588,149, 5,579,519, 5,574,802, 5,568,640, 5,535,382, 5,519,608, 5,500,796, 5,479,574, 5,463,773, 5,428,778, 5,426,700, 5,423,032, 5,418,946, 5,414,781, 5,323,311, 5,297,042, 5,287,278, 5,204,812, 5,181,259, 5,168,565, 5,159,667, 5,107,419, 5,091,964, 5,051,891, 5,048,099, 5,021,989, 5,020,019, 4,899,299, 4,860,203, and 4,856,074.