Various processes may be used to extract information from structured documents such as contracts, web pages, and so forth. Additionally, many types of features can extracted from these documents including text data, table data, drawings, and the like, if the document is formatted in such a way that the desired information is not obscured, unstructured, or structured in a manner that is inconsistent with or unexpected by the logic used to extract information from documents.
Information extraction processes may utilize optical character recognition, as well as other content extraction methods to convert documents in an un-editable form into a machine readable and editable form. For example, a collection of scanned images may be transformed into an editable word processing document. Unfortunately, these OCR (optical character recognition) processes may introduce distortions or errors into the documents. Furthermore, extracting features such as tables or drawings require specific schemas or logic that can be used to inspect documents for layouts or other patterns that are indicative of these features. When these layouts or patterns are not present, extraction becomes difficult, if not impossible. Moreover, if a feature is presented within the document in an unexpected format, the extraction process may be unable to locate the feature.