The present invention, generally, relates to document analysis, more particularly, to estimation of document structure.
Identification of document structure including chapters, sections, paragraphs, middle dots, ordered lists, etc. in unstructured documents, is important since a lot of information is stored in unstructured data formats, such as office documents, web contents, etc. For example, in natural language processing (NLP), needless texts, such as numbered references, are required to be removed prior to NLP. In order to develop software that compares provisions between contract documents, for example, ranges of the provisions are required to be identified.
However, unstructured documents do not share any common structural definition, and common information available in the unstructured documents is merely text information. Since the document structure may be varied depending on its objective, author's personality, etc., definitions of the document structure may be different even if the document formats are identical.
In relation to identification of the document structure, international publication WO2014/005610 discloses a multi-level list detection engine. The multi-level list detection engine identifies list elements in a fixed format text based on the presence of a list identifier. The list elements are grouped into lists based on the properties of each list element relative to other list elements. List elements are then assigned to a list level based on the relative properties of the list elements within a list. Finally, level list assignments are verified and corrected, the levels are merged, as necessary, and the lists are consistently formatted as appropriate to create a final well-formed dynamic multi-level list object.
However, conventional techniques for estimating the document structure often make mistakes. For example, an element that does not constitute any lists, such as numbered references, may often be detected incorrectly as a list element. An element that should be recognized separately from a certain in-line list, since that exists in a different sentence from the in-line list, may often be mingled with elements of the in-line list. Conversely, an element that should be recognized together with a certain in-line list, since that exists in a single sentence where the inline list elements exists, may often be omitted.
Accordingly, what is needed are a method, associated computer system and computer program product capable of estimating document structure from a unstructured document based on included text information with good accuracy while preventing mistakes as possible.