The present invention relates generally to language processing and, more particularly, to extracting meaning components from text for intelligent processing. The present invention also relates to the retrieval of or pointing to text relating to given meanings.
Most current methods of intelligent text processing and/or retrieval, in order to deal with conceptual contents of the text, focus on the following: a) developing advance dictionaries and thesauri covering as much as possible the vocabulary of the language with extended lists of links between synonyms; b) developing formal grammar for the language; and c) developing semantic network, maps or other ways to describe the relationship between concepts represented in language. To meet the requirements of today""s textual data processing such methods require huge computer systems and resources. For example, the National Library of Medicine (National Institutes of Health) conducts a long-term projectxe2x80x94Universal Medical Language Systems (UMLS) by the U.S. Department of Health and Human Services; 10th Edition, January 1999. Their Thesaurus includes 1,358,891 phrases and their semantic network works with 626,893 concepts. And, this is only in the health care domain. However, even such a large language knowledge base, built as a union of about 50 different standard controlled vocabularies, does not satisfy the practical need of efficient medical text processing and retrieval. The inefficiencies of this system and similar systems are shown either by the large varieties and number of different responses, many of which are irrelevant, to the input phrases that yield essentially identical meanings, or by no response at all to many of them.
In the present invention xe2x80x9clanguagexe2x80x9d is defined broadly herein as any form of human communications that can be represented in a form suitable for processing by a computing system. Examples of language include(but not limited to): the many human spoken languages, sign/picture language, signals, electronic transmissions, pictures, etc.
Prior art systems have limitations when searching, finding meanings, or pointing to locations in data being analyzed for xe2x80x9cobjectsxe2x80x9d or groupings of concepts. In the present invention there are two types of objects defined. The first, referred to as xe2x80x9ctext objects,xe2x80x9d are formed from pieces of text of a very different size, beginning with short phrases and ending with the whole archives, databases, and libraries substantially stored as text. For example, a database itself may be an xe2x80x9cobjectxe2x80x9d for the purposes of the present invention. xe2x80x9cTextxe2x80x9d in this paragraph means written samples of a human language. The second type are objects, also of a very different size, that are stored in the computer memory in any xe2x80x9cnon-textxe2x80x9d format, like computer language files, tables, non-text databases, images, slides, movies, sound recordings, web pages, etc. In the present invention objects of either type are handled in the same fashion, and the retrieval of any object included in a Semantic Index (see below) is independent of the size and complexity of the object and is based on the conceptual description of this object. Hereinafter, xe2x80x9cobjectxe2x80x9d is defined as either type of object defined above except where specified differently from the context.
Hereinafter, xe2x80x9ctextxe2x80x9d is defined broadly as any code representation of language, as defined above, that is suitable for processing by a computing system. Examples of text include (but not limited to): ASCII code representations (of letters, numbers, symbols/signs and control codes), phonic symbols (phonemes, triphonemes), hexadecimal, octal, binary, graphic symbols, etc.
Known prior art methods of extracting meanings from text (and the reverse) suffer from literal matching which is very sensitive to wording, phrase structure, and punctuation of the text. In such systems small changes can result in unpredictable changes in the quantity and quality of the resulting response. An illustrative example may be found from using the common search engines on the Internet. A sample search of xe2x80x9csafe pregnancyxe2x80x9d on Yahoo resulted in seven hits, but xe2x80x9csafe pregnanciesxe2x80x9d resulted in more than six thousand hits.
It is an object of the present invention to relate meanings and text to each other.
It is another object of the present invention to provide an efficient retrieval of meanings with substantially no limitations on syntax in general, including, wording, phrase structure and punctuation.
It is still another object of the present invention to search text or other data for any conceptually predefined objects, where said objects are defined by a set of meanings irrespective of syntax, wording, phrase structure and punctuation.
The objects of the present invention are met in a hardware and software system that forms a combinatorial computer system that extracts meanings from phrases or sentences expressed in a language. The language is formed into a text that is input into a computing system for processing. The present invention is based on four main parts. First, a set of universal basic or primary concepts, called Semantic Factors are formed. These Semantic Factors are independent of any language. Second, a set of morpheme-type elements of the language being processed is formed, called S-Morphs. The S-Morphs are compiled into a dictionary relating to the Semantic factors. The third part comprises algorithms and rules for splitting words from phrases into S-Morphs, and from the S-Morph dictionary relating the S-morphs to the concepts and thereby to meanings. The fourth part comprises Semantic Indexes for objects to be retrieved and algorithms using these Semantic Indexes for pointing to the objects. In the Semantic Index the objects are described using the same set of Semantic Factors that is used to describe the input to the system as a query. As a result the system is capable of accepting any queries in plain English, for example, xe2x80x9cFind or point to any text associated with high blood pressurexe2x80x9d? Another aspect of the present invention is that if an unknown S-Morph is found the system simply passes over it and continues.
Herein, as discussed before, concepts are independent from the lexicon, morphology and syntax of any given language. This allows the Semantic Factor base to be significantly smaller than prior art techniques that evolve large complex grammars, etc.
Further, the present invention provides modifiers for the Semantic Factors that allow comparative and/or quantification features to be associated with the Semantic Factors. With this addition, complex meanings can be derived from text with modest computing systems.
In a preferred embodiment, a grouping of Semantic Factors can be compiled using S-Morph dictionary, referred to as a group, that can be specified as the description of the meaning of an input query. Similarly, for all output objects, groups can be specified describing their meanings. Input text can then be processed looking for such objects. When found, by comparison of the query group and output objects groups, the relevant objects can be output. It is important to notice that the Semantic Factors describe concepts but not actual words, so, for example xe2x80x9cblood,xe2x80x9d could be found in the text for the word xe2x80x9canemia.xe2x80x9d Also the reverse could occur. It is also important to notice that the Semantic Factors, as described elsewhere, are independent of the specific language so the same Semantic Factors can be used with many languages.