Determining a type of a digital document, a process also sometimes called a “classification” of a digital document, is a process of assigning one of a number of predefined document types or “classes” to an unknown document. Typical prior art solutions for determining digital document types are based either on pattern recognition techniques or are done using machine learning algorithms (e.g., supervised machine learning algorithms, semi-supervised machine learning algorithms, and the like).
As is known in the art, a machine learning algorithm (or “MLA” for short) is “trained” using a labelled training data set. In order to train the MLA to determine the document type of a digital document, the MLA (during the training phase) is provided with a substantially large number of labelled training objects—each training object containing a digital document with an assigned label indicative of the correct document type. Within supervised or semi-supervised implementations of MLAs, the assigned label is typically created by “assessors”—individuals who manually review training digital documents and assign labels thereto using their professional judgement.
During the training phase, the MLA identifies certain document features of each of the training documents (exact features depend on the execution of the MLA and/or the type of the training documents) and correlates the so-identified document features to the assigned label. By observing a large number of such training objects, the MLA “learns” patterns/hidden relationships between the identified document features and the document type.
The kinds of document features identified during training of the MLA (and, thus, the kinds of document features used by the MLA, once trained, for determining the document type of an unknown document) vary greatly. Some examples of document features that may be identified include (in an example of a digital document containing text): word frequency features, layout features, run-length histograms, and the like.
Once the MLA is trained (and validated using a validation subset of training objects), the MLA is used for classifying an unknown document. By analyzing the unknown document's document features, the MLA uses its trained MLA formula to identify the document type of the unknown document.
It is generally known in the art, that there exists a trade-off between the “cost” of extracting a given document feature and its accuracy vis-a-vis determining the document type of the digital document. Within the technical field of document processing, the “cost” of feature extraction can include computational costs (i.e. processing resources required to extract and/or process such document features), time required to extract and/or process such document features or monetary costs (such as license fees or the like for Optical Character Recognition (OCR) software or other processing software).
OCR, for example, which can be used to identify the words in a sample (such as, for example, in the context of an unknown document to be processed) to enable computation of word frequency or other textual features, can be both computationally and financially costly. The computational cost of performing OCR on a single document page can be from a few hundred milliseconds to a few seconds, depending on the number of words/characters on the page, as well as on the quality of the document. Thus, for a system that processes numerous documents, the toll on processing resources increases significantly as the number of documents increases.