1. Field of Invention
This invention relates to a method for discovering knowledge from databases; specifically, it applies corpus linguistic analysis techniques to unstructured text columns within the database, focused by partitioning the many contexts described in both structured and unstructured columns, to organize language within context into knowledge.
The present widespread use of database technologies has resulted in a large volume of unstructured text columns as well as structured columns. There is a strong economic motive for enterprises with such databases to extract information from the unstructured columns. So far, there is little support for this requirement. There are data mining and text mining approaches that extract information according to formal rules, but these approaches are based on word and phrase forms, not on an understanding of linguistic usage and meaning. Corpus linguistics techniques, which are described below, provide a way to apply linguistic technologies to extract meaning from texts. But there are no methods that describe how to use corpus analysis methods on the unstructured columns, integrated with formal database columns, as a guide to linguistic interpretation.
Practically every modern enterprise uses a database management system to store operational information. One example is the Veteran's Health Information System and Technology Architecture (VISTA) which is a suite of programs and databases used at the Veteran's Administration (VA) hospitals and clinics on a daily basis. The VA placed VISTA into the open source community so that other hospitals and health-care enterprises could apply electronic medical records technology to their businesses at low cost. At the time of this filing, there are thousands of hospitals and clinics which use VISTA software. Over 50 million visits per year are recorded in the VISTA databases.
VISTA has over 1,940 files and over 44,960 data fields. VISTA does not use relational databases at present, but efforts are being conducted to map VISTA into relational database form for knowledge extraction. Third party software companies are already installing relational versions of VISTA in health care enterprises at the time of this filing. Thus relational databases with tens of thousands of columns of information are presently being used, but extraction of operational knowledge from these databases is at best an expensive custom programming project.
As in most databases, some of the VISTA information is in unstructured text. Columns related to radiology, patient history, physical information and discharge notes, among many others, can only be extracted as narrative text, and require additional software for extracting knowledge, formatting it into structured columns, for subsequent analysis.
The large national Health Data Repository (HDR) is being developed from the databases produced by the VISTA medical information system at selected hospitals. The team will integrate the various representations in both structured and unstructured columns so that health care data can be studied by researchers, and so that the next versions of VISTA can provide interchangeability among medical health records. Prior to this planned effort the databases are not compatible. The team will expend significant effort in correcting the various databases to use the same terminology.
Computational linguistic competence is an important technology for the future. This capability is needed in such applications as question answering, information retrieval, robotics, interactive appliances, interactive vehicles, speech recognition, speech understanding, data mining and text mining. It is a required enabler so high competence speech recognition can be achieved. Having a highly competent linguistic capability could support greater competence when the ambiguities inherent in human language are made tractable for providing feedback to speech recognizers and situation understanders.
Even small text columns in databases have the issues of data cleansing. Various sites use different words for the same concepts. For example, there are 3,396 columns that hold a Yes/No value. These were recorded in 30 different conventions throughout the VISTA installations, such as Yes=1 & No=2, Yes=1 & No=0, etc. Organizing this diversity into a single representation will be essential in organizing the operational data for later research purposes. Data mining tools are best applied when a universal representation for Boolean columns, numeric columns and other structured columns. So reorganizing the HDR for future data mining efforts will require substantial restructuring and there is no tool available, prior to this application, which can provide highly productive ways of performing that reorganization.
VISTA uses the National Drug Code (NDC) to encode medications in unique ways. Nevertheless, Propranolol 10 Mg Tablets are recorded in the various VISTA installations using hundreds of distinct, unique NDC codes. This is a common practice in database technology which has to be corrected when multiple databases are integrated. Other examples include representing the nomenclature for medications, treatments, and many other issues that will have to be unified in the integrated HDR database. These linguistic obstacles to integration will be very expensive to overcome.
Presently, XML text data interchange among business partners is common practice, used widely in N-tier database systems with web services. Most large companies which use the internet for data exchange with partners have begun using XML descriptions. For example, title companies often send lenders an XML copy of a property's title as unstructured text columns within the XML message. The title description can indicate surveyors' comments about the property, its shape, size and other features which affect its value. Lenders and title companies require a method for checking the property description to ensure that it is consistent with other documents, such as appraisals, city or county records, and so on. A database representation that supports knowledge discovery of these title descriptions would provide consistent and low risk decisions about loan applications which reference this title.
In many enterprise architectures, in a wide variety of industries, XML interchange among business partners normally incorporates some columns with unstructured text, and there is at present no linguistic tool available that solves the issues of finding common methods of descriptions contained in these unstructured text columns. A tool is needed that can be applied to manage the context of unstructured columns by using the methods of corpus linguistics.
2. Description of Related Art
People use language within contexts. Yet the best present parser technology is based on context free grammars. There is at present no tool capable of organizing the many layers of context in which language carries meaning within application database contexts. Many context free methods have been developed for parsing language samples, but no effective, practical methods have been developed for relating context to language samples in a way that supports the necessary extraction of linguistic content based on application context. For example, noun phrases are commonly parsed so that the rightmost noun is chosen to be the “head” noun. Yet people use context to identify the head noun of a phrase, and it is often not the rightmost noun.
Corpus linguistics is a fairly recent subfield of linguistics developed to work with natural language texts. Corpora (also called “Corpuses”) are collected as a body of texts, utterances, or other specimens considered representative of language usage. Corpora are available that have millions of words, and are annotated, or tagged, to add information about the meaning of the items in the corpus. Corpus analysis tools, including annotators, lexical resources, part of speech recognizers, parsers, conceptual graphs, semantic processors, logic interpreters and other linguistic tools have been developed to assist linguists in studying the ways in which language is actually used. Presently, most annotation is added by human observers. This practice is subject to the diversity of human opinion among a team of annotators, and with variable accuracy even using a single annotator. Simple annotations, such as part-of-speech labeling have been accomplished by automated functions, but no functions have been able to provide even modest linguistic competence at this task. There is a need for automated contextual information that has not been available to corpus linguists.
For example, the linguistic data consortium (LDC) is an organization at the University of Pennsylvania that has constructed a large body of corpora used in research, and made available to computational linguists for academic research or commercial application of corpus linguistics methods. Various annotations have been used to designate parts of speech, syntactic categories, pragmatics and discourse structures within the corpus. There is a need for methods that can provide contextual partitioning of linguistic entities so that corpus linguists can focus language processing functions into a narrower scope of understanding. Discourse analysis—the extraction of knowledge from lengthy text—is missing a context modeling capability that could organize the knowledge in each utterance and relate that knowledge to other utterances within the same discourse.
Tools have evolved due to the research and development focus on corpus linguistics. Extraction of vocabularies from texts, calculation of word and phrase histograms as used within texts, the separation of vocabularies for information retrieval used by Google, Yahoo, X1 and other search engines, and the application of lexical resources such as WordNet, SNOMED and others have expanded the corpus linguistics advances into commercial and consumer products and services. These tools have not been applied to the unstructured columns of databases because no method has been disclosed that can relate each portion of the text to knowledge extraction of other portions. Text has been treated as a homogenous collection of lexical, syntactic and semantic atoms, not as an organized, context sensitive discourse about related subjects. As a result, the costs of knowledge extraction from databases are vastly more expensive than the benefits due to lack of an effective method.
Present corpus linguistics technology is focused on the most general usage of language without context of enterprises and situations that are highly specialized. There is no widely agreed upon method for representing changing context within a corpus. In database applications, the structured columns provide context only within query statements, and a flexibly controlled context is not available to corpus linguists at present. Corpus linguistics technologies are also not presently focused on exploiting the context of a situation to assist in understanding linguistic constructions. At present there has been no disclosure of methods that can apply the methods of corpus linguistics in the contexts of databases.
Database data models are sometimes referred to as ontologies, although ontologies have constraints and inferential rules in addition to entity relationships. An ontology is one way of structuring conceptual models into classes, objects and relationships among the objects. An upper level ontology of concepts has been suggested among the community of philosophers, logicians and software engineers as an initial ontology for applications. The IEEE Standard Upper Ontology (SUO) working group applied some of the world's best logicians, philosophers and linguists to the problem of choosing at least a small “universal” ontology based on a conceptual framework. This is one approach to narrowing context, but is not specifically linguistically based. Ontologies are presently being studied in many areas, but have not made much commercial progress, perhaps with Cyc as the most well known example. And even after years of searching, the SUO group was unable to agree on a suitable top level ontology. One conclusion many SUO members reached was that no universal ontology exists because the meaning of classes, objects and relationships constitute subjective experiences on the part of the sending and receiving agents, and do not represent abstract properties of reality.
This is certainly the case in present day applications of database technologies. Each database represents a very large number of situations recorded in highly structured ways, but with unstructured text information embedded into the database as well. Ontologies as actually used by people are empirically developed through individual experience, rather than abstractions describing reality in some objective way. Database applications represent specific shared experiences that have been recorded by people and functions, and hence are as empirical as a corpus, but with surrounding context added. There is a need for methods of linguistic classification, theorization, validation and interchange among ontologies which can represent empirical database ontologies in a form that can be used to analyze unstructured text columns in the databases.
At present, linguists assign a limited number—about twenty—of themes or roles to phrases and use corpus linguistics methods to manage some degree of linguistic competence. These limited roles and themes are known to linguists and recognized as methods for representing knowledge about linguistics. But this very limited number of roles and themes confines linguistic analysis, and is insufficient for the much larger numbers of object classes that represent empirical databases. What is needed for linguists to make substantial progress is a greater degree of contextual focus, through an unlimited set of objects, themes and roles organized in a way that represents a linguistic construction within a context. There is a need for representing themes and roles in a deep unlimited context so that alternative solutions can be deeply focused on the situations that linguistic expressions describe. There is a need for an open-ended set of concepts that can represent the themes and roles, and therefore expand contextual knowledge.
There is a strong need for the integration of database technology and linguistic methods, both to extract linguistic knowledge in databases, and to support computational linguists in extracting context and then knowledge from unstructured texts. This invention discloses a method that is applicable both to the computational linguist and to the database engineer.