It is becoming a fast-growing trend to allow customers to upload form document images (rather than editable documents) as a way to input data into software products; thereby eliminating manual data entry. For machine learning algorithm to process and understand the content of the form document images, a large amount of high quality labeled data for known form images is needed to train the software. Acquiring such labeled data for form images is expensive because the data requires a human involved with verification and manual field level redaction because of the sensitive nature of some fields.
To date, research and development related to synthetic data generation is mostly at character or word level. Previous work on synthetic document data generation cannot synthesize numerical valued data which constitutes more than 50% of the field values in some form documents, such as tax forms, invoices, receipts, or other complex forms. A lack of work exists in synthetic data generation geared towards form document images. Specifically (and importantly), synthetic form image generation that considers the dependency of different fields and provides form field labels required for information extraction from forms is desirable.