1. Field of the Invention
The present invention relates in general to the field of computers and similar technologies, and in particular to software utilized in this field. Still more particularly, it relates to a method, system and computer-usable medium for preserving conceptual distance within unstructured documents.
2. Description of the Related Art
Many unstructured documents have an inherent hierarchy that implicitly transfers information throughout the document. However, processing such documents as a flat file typically loses conceptual distance and information inherited from sections that are higher in the hierarchy. For example, the text alignment of a document with a title and three consecutive sections would result a textual distance value of ‘1’, ‘2’ and ‘3’ between the title and the first, second and third sections. However, the conceptual distance value between the title and each of the three sections would be ‘1’, as each of the sections is conceptually related to the title. Additionally, textual alignment fails to take into account conceptual similarities of linked passages to other documents, or sections of documents, which would likewise have a conceptual distance value of ‘1’ from the title of the original document. While such passages may be linked through a hyperlink, this data is typically lost or stripped out by most parsers.
Furthermore, many passage similarity metrics currently use textual distance instead of distance between passages, which incorrectly scores passages at the end of a document as less relevant than those that are near the beginning. Moreover, when someone attempts to use a passage, they are generally confined to only seeing what is in the passage itself as inherited conceptual information from parent sections and headers is unavailable, which reduces relevance and informativeness. Other known approaches to this issue include table of content (TOC) generators, which can parse the markup of a single annotated document into a tree structure. However, such approaches do not allow for conceptual cycles, maintaining inter-document relationships, or the implementation of more sophisticated partitioning algorithms.
Another issue related to conceptual distance is determining how to split a training corpus into different entities, such as terms, documents, concepts, and so forth. Current distributional semantic methods, such as latent semantic analysis (LSA) and random indexing, use a static definition of what each of these entities should be. The definitions are then used to generate a matrix, which in turn is converted into a vector-space model using techniques such as singular value decomposition (SVD). Certain methods, such as LSA, use a term-document matrix in which documents are collections of text and terms are some subset of that text. This methodology finds inherent similarities in terms by their contexts within documents. However, the definitions they use can affect the relevance and usefulness of the generated model. Furthermore, many vector space models suffer from improper document length, providing too little or too much information, and words matching or not matching the appropriate values in the model. These models also treat documents as a collection of words and therefore lose sentence context information.