This invention relates generally to the field of computerized natural language processing and, more specifically, to creating a semantic knowledge base.
In many disciplines, documents can be readily grouped together in a corpus, or document collection; in medicine, for example, one can aggregate radiology reports, electroencephalogram reports, discharge summaries, etc. These free text documents contain a great deal of knowledge in an unencoded form. However, while the value of coded information for decision support, quality assurance, and text mining is readily understood, satisfactory methods are not available for building a comprehensive semantic knowledge base efficiently and inexpensively, which can provide a means to code these free text documents. Especially vexing is the problem of representing the semantic knowledge of entire sentences in a codeable format, which then could be easily manipulated within a relational database management system.
Natural language processing (NLP) can facilitate data exchange and data mining by extracting and codifying the semantics of free-text records. However, even after a sizable investment over many years by different companies, the technology remains too immature to be used in commercial coding applications except against relatively small code sets. There are several reasons why NLP has fallen short:
Understanding human language is extremely knowledge intensive and discourse specific. Existing NLP systems do not have enough domain knowledge to correctly interpret the entire semantics of knowledge domains like radiology.
The development of a comprehensive text-mining system requires a large semantic knowledge base, which mirrors the underlying content the expert wishes to analyze. The tools and knowledge representation methods for creating this kind of knowledge are limited.
The syntactic-semantic parsing approach, which relies mainly on grammatical and lexical rules, is too rigid and cannot reliably determine the semantic equivalence of different sentence expressions for even a moderately sized knowledge domain. Furthermore, such parsing is not scalable or easily adapted as new domain knowledge becomes available.
Many systems have been proposed to semantically search free text, but they are impossible to evaluate without more precisely defining the criteria for their operation. Our invention adheres to four principles for coding free text:
Reject all sentences that are semantically invalid.
Use one proposition/symbol/code/logicalform to represent the meaning of a simple sentence.
Use the same propositions/symbols/codes/logicalforms to represent the meaning of all semantically equivalent sentences.
Reveal all propositions to the end user; if the proposition is “published,” the system can index text whose meaning is represented by that proposition.
Prior art systems fail criterion one. For example consider the following sentence from an actual radiology report, “Lungs and heart are clear.” Semantic parsers such as MedLee can structure and code the sentence as Clear: Lungs, Clear: Heart, but does it make sense? The writer probably meant “The heart is normal and the lungs are clear.” Unfortunately, most computer systems do not possess common sense knowledge and cannot correct or make valid inferences about a writer's intention. Many computational linguists would agree with requirement two but this requires detailed domain knowledge that is often lacking. Most systems struggle with criterion three. Especially when an NLP system semantically analyzes a sentence using predicate calculus and employs a compositional lexicon, divergent codes may represent the same idea. Data mining and decision support are impaired in direct proportion to divergent coding. Finally, if propositions are not revealed up front, users will not have faith the NLP system can actually extract that meaning from free text. Users will also not be able to assess the semantic granularity or depth of the natural language processing system.
Although state of the art in information-retrieval (IR) technology makes it possible to return sentences by relevance ranking, it cannot provide answers to basic semantic queries such as “How many patients had right lower lobe pneumonia?” or “Which patients had a diffuse pulmonary infiltrate?” over a corpus. IR algorithms retrieve documents based on keywords but can not anticipate the hundreds of different possible ways the same knowledge can be embedded within a sentence. The design objective of the present invention is to represent all semantically equivalent sentences with the same proposition(s) (see example in FIG. 1).
Richards, et al. (2003), in U.S. Pat. No. 6,507,829, point out that possible combination of words in a language number in the billions, even when the length of the combination is restricted to those of five words or less, thus making it seemingly impossible to use strong methods such as string matching to identify equivalent semantics. The classification approach proposed by him used weighted N-grams to automatically code text fragments, such as adverse drug events. However, only two-thirds of these text fragments could be correctly classified. Further, one must identify prior to classification all the relevant semantics. This makes it unsuitable for classifying the semantics of a free text document collection where the classification vectors are not known in advance.
Abir, in U.S. patent applications 20030061025, 20030083860, and 20030093261 proposed an approach for associating phrases of similar meaning in various languages to translate documents from one language into another. His approach returned an association between sets of strings, but did not produce a semantic knowledge base or a means to associate sentences to discrete semantic propositions.
Paik et. al. (2001) in U.S. Pat. No. 6,263,335 teaches a domain independent system to automatically extract meaning from a corpus and build a subject knowledge base. The semantics from the corpus are extracted as concept-relation-concept (CRC) triples and stored in a database. However, there are several limitations to his approach. The semantic extraction scheme was not designed to capture the meaning of entire sentences, only those parts (principally phrases) and relations (connectors between phrases) that can be analyzed by the syntactic/semantic rule base. Semantic predicates are defined by the grammar, not the domain expert. The system can not in many cases accurately assign linguistically diverse but semantically equivalent sentences to the same CRCs, because there are potentially hundreds of different ways writers can express the same sentence meaning, and the rules for CRC extraction can not anticipate all these ways. Additionally, domain experts play no part in constructing and designing the semantic knowledge base. Thus they can not choose the preferred terms in CRCs, the predicates which relate concepts, or arrange them in a knowledge hierarchy. Finally, Paik's automatic extraction system lacks domain knowledge which humans routinely use in sentence processing. For example, in radiology, the sentence “No evidence of acute infiltrates or failure” can be logically represented as two propositions: a. There are no acute pulmonary infiltrates, and b. There is no heart failure. Notice that the word “heart” is missing in the sentence, but a physician has no trouble interpreting the correct meaning, because of the associated context of “acute infiltrates”. However, it is exceedingly difficult for a computer using automated extraction rules to make this inference. Paik does not use domain experts to review semantic assignment except in ambiguous cases.
There are well known problems in the art with syntactically parsing sentences that computational linguistics have yet to solve. Experts in the field of natural language processing believe only 30% of English sentences can be structured as logical forms using automated methods [Rebholz-Schuhmann D, Kirsch H, Couto F (2005) Facts from text—Is text mining ready to deliver? PLoS Biol 3(2)]. Syntactic analysis has not solved the problem of modifier attachment. English sentences often include prepositional phrases with ambiguous modifier attachments. For example, in the sentence, “The patient has stool and gas scattered throughout the colon and rectum without evidence of free air or obstruction”, it is immediately obvious to a physician that the phrase “without evidence of free air or obstruction” refers to the colon and not the rectum. Physicians disambiguate these types of sentences easily, because they have clinical knowledge which computers do not.
Additionally, a great deal of English uses non-grammatical expressions which make it difficult for natural language processing approaches which rely on syntactic analysis. For example, the ‘sentences’, “No intracranial hemorrhage or mass effect”, “Status post median sternotomy in the interval”, and “Kidneys, no hydronephrosis”, are typical examples of writing in medical reports where the verb is omitted, and the subject is implied.
Most NLP systems include grammatical transformations that precede semantic analysis. However, if the parser makes an error because it cannot syntactically parse the sentence correctly; there is little chance that the semantic assignment will be done accurately.
Cao (PGPUB 2008/0221874) provides for the semantic representation of parts of sentences. Cao's approach is consistent with the work of other computational linguists that use grammatic and semantic parsing but she supplements this with human annotation. However, she cannot represent the meaning of the entire sentence. The purpose is to link words and phrases in a parse tree. For example the sentence, “I want to fly form New York to Boston” is annotated with “tags” and “labels”. The “tags” are “Pron-Sub”, “Intend”, “Intendo”, “Verb”, “From” “City”, “To”, and “City”. The “labels” are “Subject”, “Intend”, “Verb”, “From Location”, “To Location”, and “S”. Each word receives a “tag”. Each “tag” is linked to a “label”. There is no “label” that defines the meaning of the entire sentence. Semantic annotation does not cover the meaning of an entire sentence.
Wical (U.S. Pat. No. 5,694,523) teaches a content processing system that classifies the content of input discourse using a “knowledge catalog”. Wical defines “knowledge catalog” as a plurality of independent and parallel static ontologies supplemented by dynamic ontologies that represent the broad coverage of concepts that define knowledge. However, Wical was not unable to represent the semantics of sentences at a granular level of detail nor does he try to represent the complex inter-relationships between words in a sentence.
Most natural language processing systems (NLP) focus on extraction heuristics over knowledge representation issues. One cannot tell if the semantics of an entire sentence can be captured, because these systems only attempt to identify a limited number of predefined concepts, rather than extracting the entire meaning of the sentence. An NLP system which extracts the entire meaning must capture and represent subtle nuances of language. It would be desirable to inspect the fidelity of annotated sentences to knowledge base entries, and quantitate the percentage of free text sentences represented in the knowledge base over a large collection of documents, both missing from the prior art.
The dominant method in computational linguistics for representing the meaning of sentences is the use of logical forms. Logical forms use a formalism that is similar to first order predicate calculus (FOPC). Logical forms include predicates that describe relations or properties, and terms which are constant expressions. For example the logical form Visualized (cervical-spine) is a one argument predicate that in this case takes the term cervical-spine. However, predicates are not limited to a single term. One could have a logical form like Degeneration (cervical-spine, severe). This logical form describes not only what is degenerated but the severity of the degeneration. FOPC also teaches more complex symbols such as variables, predicate operators, and modal operators. Defining a complete set of predicates (especially multi-valued predicates) and terms for even a moderately sized knowledge domain is an unsolved problem in computational linguistics.
The present invention does not use FOPC logical forms and thus represents a departure from the prior art. Rather the basic unit of semantic knowledge is the sentential proposition (see definition table) which can represent the meaning of an entire simple or complex sentence using a single logical symbol easily stored in a relational database. While computational linguists prefer first order predicate calculus (FOPC) because it is more expressive than sentential logic, there are several reasons sentential propositions are better suited for use in semi-automated knowledge base construction as part of the present invention:
Sentential propositions often mirror statements in natural language, making it easy for domain experts, versus computer scientists, to write them.
Sentential propositions are expressive and can capture all the significant modifiers and concepts contained in a sentence using a single logical symbol. This makes it easy to store and compare using a relational database management system.
Sentential propositions allow a semantic knowledge base to be easily organized to make semantic knowledge accessible for properly classifying unknown sentences by human experts.
While translating a FOPC logical form to a sentential proposition is possible, it can not be easily performed by a domain expert, an important design consideration of the current invention.
Other approaches to capture meaning from free text include methods from lexical semantics. For example, the medical informatics community has invested tremendous resources in creating lexicons and terminologies to index and code medical documents. A major achievement was the Unified Medical Language System [Burgun A, Bodenreider O. Mapping the UMLS Semantic Network into general ontologies. Proceedings of the American Medical Informatics Association Symposium 2001:81-5] or UMLS, which is a metathesaurus of many large-scale vocabulary systems such as ICD-9 and SNOMED. Unfortunately, even these vocabularies have limited coverage for many concepts used in medical documents. Langlotz showed that only about 45% of radiology terms were covered by UMLS. [Langlotz C., Caldwell S. The Completeness of Existing Lexicons for Representing Radiology Report Information. J. Digital Imaging 15(1):201-205, 2002.]
Additionally, lexical coding schemes do not include all the relevant words such as noun modifiers that describe medical concepts, and often do not code for all the concepts in a medical document collection. No lexicons provide the means to code the semantics of complete sentences.
Corpus based approaches for creating a semantic knowledge base of invariant symbols that represent the meaning of semantically equivalent sentences has not been proposed. The formal methods for deriving a semantic knowledge base by accurately analyzing entire sentences in a corpus requires tools and methods missing in the prior art.
The current art also fails to support the role domain experts play in constructing such a knowledge base and semantically annotating sentences. A few simple examples show why domain experts are needed. Consider the following sentence from a radiology corpus, ‘Supratentorial and infratentorial brain pattern is normal.’ This sentence should be annotated to the proposition, ‘The brain is normal.’ Yet, only someone with domain knowledge can make this inference. Or consider another sentence, ‘The patient is hyper-expanded.’ This should be annotated to the proposition, ‘The lungs are hyper-expanded’. Domain expertise is again required to understand that in the context of a radiology report ‘patient’ is being substituted for ‘lungs’.
Yet, by themselves, domain experts would face immense obstacles to building a knowledge base and perform semantic annotation for even a moderately sized domain. Both integrated tools and methods must support domain experts to build the required knowledge base of propositions and semantic mapping table.
Currently, there are no integrated tools and methods that would enable a domain expert to create sentential proposition(s) that reflect the underlying meaning of entire sentences within a free text record in a consistent manner. There are no inexpensive, reliable means for a domain expert to create a semantic knowledge base of codeable entries that represent in an invariant manner semantically equivalent sentences in a collection of related free text documents.
There are many applications that can be built using the knowledge base of the present invention. One such application could improve document workflow by recognizing free text sentences and graphically displaying the extracted knowledge. Another would be to build text mining engines that could search on the semantic meaning of sentences rather than just key words. Most significantly, data mining and decision support applications could take free text input, and transform them into codeable entries which would make them accessible for relational database analysis.