The field of the present invention is document processing and in particular to document section identification and categorization.
Documents and reports are typically organized into sections for quick reference and common practice. These sections serve to provide form and substance by providing a logical pattern to a document, grouping together similar information within a document, and identifying the location of specific information within a document. Section headings serve to label sections and categorize information for later retrieval and use.
The rapid location of document sections and the information included in a specific section is essential in the certain modern marketplaces, such as hospitals, doctors offices, and law offices. In the medical field it has been found that there is a lack of consistency in document section headings so not every hospital, technician, or doctor records the same document section under the same document section heading in every instance. For example, a hospital technician may use ‘Prescribed Medications’ as the heading for a particular section of a medical report while a doctor's dictated medical report refers to the same section as ‘Prescription Drugs’.
Previous attempts at processing documents with structured section headings and organized information have identified this issue of different but equivalent section headings. Systems have attempted to address the issue by primarily using filters and pre-processors. For example, filters have analyzed a document and identified headings for processing. The headings are then replaced with normalized section headings acceptable to the particular system for recognition and categorization.
Unfortunately, these previous systems have difficulties and drawbacks. For example, previous systems essentially perform the filter and pre-processing procedure using handcrafted programs to address a collection of documents and the various section headings contained therein. These handcrafted programs are extremely labor-intensive and complex to create and they require a great deal of experience in programming and knowledge of the relevant headings. This results in long start-up times and high costs before document sections can be efficiently retrieved and used.
Another drawback is the site-specific or document collection-specific nature of the handcrafted programs of the previous systems. The handcrafted programs have not efficiently transferred from site to site and a program designed for one hospital or medical department is rarely adaptable for another.