This invention relates generally to extracting textual content from an image and more specifically to extracting textual content from a filled form document.
In our daily life, we often have to provide official information to an institution, when declaring our taxes or filling in an administrative report. This information is often represented as a filled form document. Those documents can be stored in their original digital format or can be archived or transmitted as raster images, making it difficult for this information to be read by computer-based techniques. Adopting a manual human-based analysis approach to study those forms is both unfeasible and time consuming.
A filled form document or fillable form or just form is a document with fields, also referred as placeholders in which to write or select one of the proposed options. Forms are therefore a specific type of document whose content can be separated into two categories: a group of structural, standardized static content including questions created by the data collector and a group of variable answers located in predefined placeholders or fields that are entered by a data provider. Forms can be seen as a template in which a user will fill the placeholders with their personal data, or appropriate answers to any questions contained in the form. Most of the semantics content is therefore preserved from one filled form to another. In theory, the only changes are the answers written in the fields.
However, depending on the duration of the data collection process, the predefined set of questions or even the layout of the fillable form document might be modified. A modification may be because the data collection entity discovers at some point that data providers require more space to fill in information addressing a specific question, or that it would be beneficial to rephrase or reorganize questions in order to improve data collection. This is especially true for use cases that require constant monitoring or when data collection is integrated into a decision-making work flow
Programmatic forms can be easily generated using annotation tools. A fillable PDF form is where the user can directly fill the answers in the form. All the answers are then automatically saved in an external file. By having access to the generated file, one has already all the information contained in the form, as well as the metadata of how answers relate to questions.
However, when the original fillable was scanned and transformed into an image, all this information is lost and a complete pipeline has to be designed to retrieve the data.
One example of forms is scanned medical forms. Several entities are frequently involved to treat a patient, from the hospital to doctors. The forms are then often scanned and sent via emails or fax from an entity to another. To analyze at a large scale all the information contained in those scanned documents, a tool must be built to extract the content and then understand it.
One of the cases in which continuous data collection is required is the field of pharmacovigilance or drug safety. Adverse Events Reports (AER) are a standardized way of collecting information about potential health threats related to the use of a drug or other pharmaceutical product. In this case, the data collector can be a pharmaceutical company or public health administration which collects reports obtained from a variety of sources, including health professionals, patients and pharmacists, who fill in and transfer one of a set of predefined templates made available for this purpose.
Extracting information from AER documents can be challenging because of various reasons. On the one hand, the document transfer to the data collector is often performed by printing a copy of the filled report in paper and faxing or emailing a scanned version of it. These processes introduce loss and noise in the version finally obtained by the data collector. On the other hand, given the variety of sources involved, it is hard to guarantee the use of a common template that all parties agree upon. Furthermore, even if the template is fixed, there might be modifications introduced a long time (template versioning).
There is therefore a need for an automatic tool able to analyze scanned filled form documents.