Forms or documents of various types are widely used for collecting information for various purposes. Medical, commercial, educational and governmental organizations use documents of various formats for collecting information and for record keeping purposes. The advent of computers and communication networks resulted in the documents being moved online so that people no longer have to fill out forms on paper. In addition, digitized records, including electronic and scanned copies of paper documents, are now generated using computers. These electronic documents are shared over the communication networks thereby saving time and resources that may be otherwise required for generating and exchanging paper documents.
These documents may contain data in structured and unstructured formats. A structured document can have embedded code which enables arranging the information in a specified format. Unstructured documents include free form arrangements, wherein the structure, style and content of information in the original documents may not be preserved. It is not uncommon for record-keeping entities to create and store large unstructured electronic documents that may include content from multiple sources.
Often, various enterprise systems wish to utilize information from electronic documents to perform operations. It is relatively easy to programmatically extract information from structured documents that have a well-defined or organized data model, such as extracting data from fields in a form where the fields are at a known location in the form (e.g., data in a tabular arrangement). However, when the electronic documents include large unstructured documents, such as the type of unstructured document discussed above, it is technically difficult to extract information that may be needed to perform operations of enterprise systems or other types of systems. This type of unstructured document often does not have a well-defined data model, making it difficult to reliably programmatically parse and extract the needed information from the document.