The invoice has become a very important financial document for businesses. Unfortunately, it is really challenging and time consuming to keep a good digital record of a pile of invoices by manually entering the metadata of the invoices. An automatic invoice parsing system that can automatically detect the invoice fields and parse out the corresponding field values therefore could significantly save time on manual data entry, as well as avoid any mistakes made by human input.
In a given invoice, the typical fields include vendor name, invoice number, account number, purchase order number, invoice date, due date, total amount, and invoice payment terms. The challenges of an invoice parsing system reside in the following three aspects:
(1) Input variations and noises: Different devices and methods produce different qualities. The size and the resolution of the input may vary significantly when using different devices. Besides the differences of devices, there are many other factors that will affect the quality of the input. The original invoice may be crumpled or incomplete. The capture environmental factors like lighting, skew, etc. may blur the image when it is captured by a mobile device. The dirt on the scanner mirror may introduce noise. Therefore, a robust method is needed that will work well in noisy environments.
(2) OCR/DCE errors: Optical Character Recognition (OCR) or Digital Character Extraction (DCE) is designed to extract characters from images or PDFs. The invoice field detection and parsing techniques described below primarily rely on those extracted characters. Unfortunately, no matter how good the quality of the input image or PDF is, OCR/DCE could cause noisy output. This causes difficulties to field detection and parsing. A simple keyword matching method cannot solve those difficulties. An improved method is desired.
(3) Invoice format variations: There is no unified standard format for invoices used in the business world. There are thousands of invoices with different formats used in day to day transactions. Some invoice fields are present in some invoices, but are not in others. The name of the invoice fields vary in many different ways too. Therefore, a simple template matching method cannot solve these variations.