A fundamental step in automatic document management applications is to disaggregate each document into its basic constituents, so a reader can effectively index, search and disseminate the document. For example, in a scientific paper, metadata such as names of authors, affiliations, title and electronic mail identifiers (email IDs) play a fundamental role in consolidating the knowledge of the reader. However, majority of the documents today are in unstructured formats and the documents lack metadata because the authors, typically, are focused on creating the document content and not the metadata. Unfortunately, the automatic document management applications cannot digest unstructured information without lots of human intervention, which means majority of business information cannot be economically employed in automated business processes or in business intelligence. Typically, manually annotating the documents for metadata may not be practical, because, the number of documents to be edited can be significantly large, labor intensive, time consuming, and expensive. Furthermore, manual editing may be prone to errors.
Therefore, it is important and useful to extract such metadata automatically in an efficient and accurate manner. Automatic extraction of metadata may be difficult. Firstly, the layout of the documents may vary significantly, thereby making it difficult to extract the metadata according to predefined layouts. Secondly, format of the documents may also significantly vary requiring them to be transformed into some standard document format from which the metadata may be easily extracted. Thirdly, such transformation into a standard document format may lead to errors and may result in an unformatted content. For example, if a plain text is adapted to be the standard document format, and a portable document format (PDF) document is converted to the plain text, it is common for a single line text to get divided into multiple text lines or a Unicode symbol to get decoded into messy codes. This is particularly true for documents produced using older versions of the PDF.
To address the above-described problems, one category may automatically extract metadata from documents with fixed layouts and well-defined and formatted text (similarly formatted documents), for example, research papers from certain journals or proceedings, by matching the text with specific patterns. However, this type of automatic metadata extraction can handle a certain limited type of documents and typically may not be robust to errors in the text introduced by the document conversion process, such as the one described above.
The second category may use various supervised machine learning techniques to automatically extract metadata from documents. One method uses image processing, and another method uses text classification and yet another method uses sequence labeling. Typically, all of these methods may require preparing a training data set, collecting and labeling training samples, defining a set of features, learning a model and applying the learnt model on testing samples. However, these methods may heavily depend on the distribution of training samples, the selected features, and the ability of the model.
The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present disclosure in any way.