This application relates to information retrieval, specifically retrieval of textual information that is stored in digital form using one or more externally managed terminologies or nomenclatures.
Knowledge workers in all fields rely on bodies of literature to support decision-making. Historically, these bodies of literature existed in printed form and were housed in libraries or other public or private collections and were comprised of a variety of books, book series, encyclopedic references, periodical publications (e.g., topical journals, magazines, trade journals, newspapers), or documents. Library collections were typically focused on either general or topical collections of works that were typically shelved using specific organizational methods (e.g., Dewey Decimal System) that became “industry standards” over time. In addition to providing information about the physical location of a work, ancillary information (metadata) about the nature of the work (e.g., author(s), publisher, year, synopsis of work, keywords) was typically maintained by specialists, cross-indexed and arranged into taxonomies so that librarians and readers could locate specific information more readily. Typically, this information was created, maintained and searched manually. Similar approaches were developed for the efficient storage and cataloging of large document collections (e.g., patent literature) where the parties responsible for maintaining such information devised methods for storage and retrieval of documents based on the similarity in subject material. Sophisticated coding systems evolved over time that provided users of these materials with the necessary information about where to locate specific collections of related topical information. Once the appropriate reference work or document collection was located, it was largely up to the knowledge worker to sift through subsets of the references or documents to determine which, if any, were relevant to address their needs.
Contemporary computerized systems for indexing, searching and retrieving information are an area of continuing innovation, and are challenged primarily by the volume of material that is now available in digital form. Despite all of the developments in the field, knowledge-workers are still confronted with the challenge of locating the correct information to meet their needs. This problem is simply that the volume of material that must be searched and reviewed continues to grow at a superlinear rate, as new information becomes available online and older material is digitized. This imposes increasing demands on contemporary knowledge workers because their depth of knowledge must be correspondingly greater; spanning not only contemporary developments in their field of expertise, but also historically relevant developments. In many areas this task is confounded by domain-specific terminologies, nomenclatures, taxonomies or ontologies that have strong temporal components as well as significant overlap, resulting in incidences of synonymy (multiple terms or names for the same object/concept), polysemy (multiple objects/concepts with the same term or name) or both.
New challenges also exist, because much of the information that is available (which is referred to interchangeably within this application as “content”) is now available mainly or solely in digital form and is no longer retrieved from a physical location. Rather, it is retrieved from one or more computer servers that are located on private or public networks. Queries of such systems must not only map to the correct item of digital content, but also to the correct server where that content is stored and the knowledge worker has access privileges.
Indexing and abstracting schemes for digital content have also undergone many changes in recent years. Advancement in this area has been tremendously stimulated by research into “search” on the Internet. In addition to the use of keywords and other indexing schemes (e.g., US and ECLA patent classification schemes), considerable progress has been made in the area of natural language processing; which essentially entails programming a general purpose computer system to “read” a document, and to syntactically and semantically group that document or article with others that share some degree of similarity with the document or article in question. Ideally, one would like to have a computer understand human language, but that goal remains out of reach, thus far. Among other challenges, natural language is inherently ambiguous.
Various approaches have been developed to address this problem. Those approaches that have shown the most promise to date fall into the broad category of vector space models (VSMs). Unlike approaches that rely on a lexicon or thesaurus to extract meaning from a document on a word-by-word or phrase-by-phrase basis, VSMs automatically extract meaning from documents by measuring the similarity in word usage between documents or among all documents in a collection (a “corpus”). VSM models are based on the distributional hypothesis, which posits that words occurring in similar contexts tend to have the same meaning. In general, VSMs provide a vector of word frequencies, which can then be used for comparative purposes in deducing similar meaning via classification, clustering, ordination, or projection methods. Alternatively, the vectors can be used as selective data filters. Event frequency and pattern matching approaches are in widespread use in the social sciences and natural sciences and have proven very useful in grouping together similar individuals or objects based on predefined characteristics.
VSM and non-VSM methods of text analysis generally follow a similar route, beginning with one or more preprocessing steps. The input document or corpus is generally a plain text representation of a document in digital form, with special formatting features removed. The input file(s) are parsed and tokenized to identify each word or word-like string based on a predetermined set of grammatical rules or regular expressions that define word boundaries (e.g., white-space, punctuation). Additional preprocessing steps (normalization and annotation) often follow including identification of parts of speech, lemmatization, word-use patterns or other special features of interest. The input file may be physically modified by the insertion of “tags” or other forms of annotation that mark features of interest. Alternatively, information about the occurrence and location of each word of interest may be stored independently. In some instances, typographical formatting features may convey additional meaning (e.g., inflected characters, altered type face, superscripting or subscripting).
Pre-processed files are then re-parsed to extract individual tokens from each document or corpus for analysis. In VSM, the occurrence and frequency of occurrence are extracted in the form of a list (a vector). The vector may be further associated with other information regarding the location of each token within the document or corpus (e.g., the file name and byte offset), the frequency of occurrence, adjacent words, concordance of words, etc. Vectors that share common terms may be combined into matrices of high-ordered tensors for further analysis.
VSMs as applied to semantic analysis fall into three broad approaches: term-document model, typified by the Apache Lucene search engine; word-context models, typified by the Latent Semantic Vectors package in Lucene and Pair-Pattern analyses used in latent relationship analysis as typified by the S-Space package, also running in the Lucene environment.
In the Term-Document model, words are treated as dependent variables that occur in documents. Vectors are produced by tallying the frequency of occurrence of each word. Word order, however, is not determined (or, at least not used in subsequent analyses). Term-Document models perform well in capturing information about a document, especially when “stop words” (frequently occurring, non-informative words such as definite and indefinite articles) are left out. The model is closely aligned with the “bag of words” hypothesis and the distributional hypothesis and performs well because the choice of words used by an author is probabilistically influenced by the topic on which they are writing.
In the word-context model, the semantic similarity of words is established by examining their co-occurrence in a document or corpus. While typically used to establish the contextual meaning of words, co-occurrence vectors can also be mapped to individual documents within a corpus to identify similarity in contextual usage of a term and can be used for identifying subsets of documents within a corpus that share similarity of meaning.
Pair-pattern analysis (also known as latent relationship analysis) examines the frequency of co-occurrence of word-pairs in documents or corpora. This approach is typically coupled with a thesaurus to expand the definition of word-pairs and to offset the difficulties and computational burden associated with dealing with sparse matrices in the analyses that typically follow.
Once semantic vectors have been extracted from documents or corpora, a variety of well-known mathematical approaches can be applied to the resulting vectors and tensors to establish similarity, dissimilarity or semantic proximity between or among the documents of interest. The process typically begins with application of an algorithm that transforms the term-frequency vectors or matrices into a distance, similarity or dissimilarity matrix. A wide variety of algorithms are available and well known in the art to derive such values including, but not limited to geometric measures of distance (Euclidean distance, Manhattan distance, cosign similarity) and measures commonly used in information theory (Hellinger, Bhattacharya, Kulbeck-Lubler). Selection of the optimal measure is often determined empirically and tied, at least in part, to the desired method of conveying this information to end-users or readers. Various smoothing and weighting functions may be applied at this step to minimize the effect of outliers and to maximize the amount of information that is actually available within the data set.
The final step in the process is the interpretation of the mathematical analysis results. A variety of approaches that are commonly used in exploratory data analysis and machine learning are available and applicable. Ideally, the method used should minimize information loss. Typically, results of such analyses are summarized in both graphical and tabular form. Documents that are most similar to one another typically plot closest to each other and appear closest to each other in sorted lists. The end-user/reader must then determine which results make the most sense, using the output of the analysis to support their decision.
VSM for semantic searching of large corpora has become an important approach to information retrieval. It is especially useful within a collection of highly structured documents, especially within technical areas including scientific, technical, medical and legal literature. VSM methods also underpin some of the commercial and publically available search systems used in analysis of the patent literature.
VSM based indexing and information retrieval provide a number of distinct advantages over simple query methods based on single terms or lists of terms that are composed using Boolean methods. Deerwester (U.S. Pat. No. 5,788,362) teaches that data and documents can be indexed, filtered and retrieved, using vector and matrix operations and has provided much of the foundational work on which current VSM models are based.
In Liddy et al. (U.S. Pat. No. 5,873,056), VSM are applied in a system that uses natural language processing to generate a subject vector that is representative of the source text. The subject vector was composed of source codes, which were in turn derived from a lexical database that was used to categorize each word in the source text. This embodiment provided access to meanings and word senses in a method designed to disambiguate each word to arrive at an accurate meaning. The subject codes were used to produce weighted, fixed-length vector representations of the semantic content of documents within a corpus.
The major problem encountered with this approach was instances of polysemy, where the system could not automatically assign the appropriate subject code to a given case. The method and system is further limited by the highly restrictive number of subject codes available (124), although finer-grained codes were available as part of a hierarchical coding scheme to improve classification. Such a low level of dimensionality is unlikely to provide meaningful filters of large and disparate corpora. In addition, subject codes appeared to perform only marginally better than keyword indexing.
In Liddy et al. (U.S. Pat. No. 5,963,940) a method and system for indexing documents based on natural language processing, up to 680 subject field codes were used in a VSM. The field codes were arranged hierarchically into categories and subcategories. The system also tagged each word and indexed each at the syntactic, lexical, morphological, semantic, discourse, and pragmatic level. Each term was assigned values for seven fields, and subject code data was stored in a separate database that was used to index the processed documents. The system also provided a special treatment of proper nouns and noun groups and included an expansion to include subordinate members and resolution of synonyms, meronyms and hyponyms (i.e., words or phrases that are included within the meaning of another word or phrase; e.g., “is-a” relationships). Similarity matching was not based on a semantic vector; rather it was based on a vector of combined scores arising from a complex analysis of multiple linguistic features. Extensive training sets were required, especially in the cases where proper nouns and restricted knowledge domains were involved.
In U.S. Pat. No. 6,185,550 Snow et al. describe a method and an apparatus for classifying documents within a class hierarchy using a VSM. In this embodiment, documents are classified using specific term vectors that are part of a separately maintained hierarchy that was developed to dictate directory naming and file storage locations on a computer system.
In U.S. Pat. No. 7,133,860 lizuka et al. describe a device and a method for automatically classifying documents using vector analysis. A relational matrix is described in which distances between words and distances between documents is estimated in a manner analogous to R- and Q-analyses used in numerical taxonomy. In an R-analysis, the estimate of similarity, dissimilarity, correlation or geometric distance is based on correlations among the attributes (dependent variables) that appear within the objects that are included in the classification. An R-analysis is useful in determining the natural weighting that occurs in a dataset. In a Q-analysis, the estimate is of similarity, dissimilarity, correlation or geometric distance among the objects, based on the attributes that are used in the classification. When applied to document analysis, an R-analysis would be useful in determining which words would have the greatest impact on the classification (including those words that occur frequently but have little information content). This is described using an alternative approach in the '860 patent, applying an algorithm that measures “force” of different keywords that were part of a method for determining relationships among clients and commodity products.
In U.S. Pat. No. 7,299,247 Colisti-Yeh et al. described an apparatus and method for producing a semantic representation of information in a semantic space. The advantage of the method described therein was that information in documents were represented at a semantic level that could be adjusted to meet the user's needs, that documents could be clustered, searched and classified based on semantics and that the system and method was trainable. The method does, however, require considerable user-interaction and intervention.
Outside the narrow but active field of VSM application in semantics, other relevant developments have taken place that have bearing on this application.
In U.S. Pat. No. 6,834,290 Ausputz describes a semiotic model for querying a computer. Limitations of semantic and syntactic querying systems are laid out as problems that can only be addressed by a system and method that fully satisfies the Peircean reduction theorem. The system employs a semiotic describer that provides additional information about query terms that include semiotic signifiers. The describers provide information about query terms that include semiotic signifiers. The describers provide information about parts of a query, in the form of a piece of media (digital content) that is associated with one or more signifiers. The system is distinct from other prior art as the '290 patent makes no reference to VSM and none appear to be used. The '290 patent is also silent about the extent of training that is required to achieve acceptable results.
In U.S. Pat. No. 7,925,444, Garrity et al. (which is incorporated herein by reference in its entirety) describe a system and method for resolving ambiguity between names and entities. This invention discloses a method for resolving synonymies and homonymies that exist in biological nomenclature using a semiotic model that satisfies the Peircean reduction theorem through the use of redirection, mediated through actionable, globally unique, persistent identifiers (PIDs). Biological nomenclature (specifically those names applying to Bacteria and Archaea) represents one of a number naming systems and terminologies that have specific application in a particular field of science, technology, medicine or law. Each such system is typically “managed” in that it follows a specific set of rules for creation, maintenance, change and application of those names/terms and the concepts and entities to which they apply. The invention fully supports accessing any form of digital information via names, taxonomic concepts or exemplars (a metadata representation of a physical entity). The invention also supports accurate resolution of names or terms and linking together all elements into one or more taxonomies based on published information. The use of actionable PIDs of any type or class provides a mechanism whereby instances of biological names could be linked directly to semantic/semiotic information about the name and its application.
In US 2010/0198841 Parker et al. (which is incorporated herein by reference in its entirety) build on the '444 patent. They describe systems and methods for automatically identifying and tagging biological names and name-like strings in digital resources and providing semantic resolution services via PIDs. They also describe a method for extending the list of names and name-like strings and automatically tracking the frequency of occurrence and location of each name in digital resources.
In U.S. Pat. No. 8,036,997 Garrity et al. describe a method of uncovering and correcting annotation errors using a self-organizing, self-correcting algorithm. In that application, they also demonstrate how the output of large-scale classifications could be visualized using re-ordered heatmaps.