1. Field of the Invention
The invention is related generally to a system for and method of efficiently searching data corpora, and more particularly, to grouping data corpora comprising multiple information elements representing as identifiable concepts and storing the grouped data in a database from which information may be efficiently extracted in response to a query.
2. Related Art
With the increased usage of the Internet and other electronic networks, there has been an exponential growth in the volume of data that is available to be collected, stored, and analyzed. There are vast volumes of communications throughout the world that may be transmitted wirelessly from satellites and base stations and may also pass through underground and undersea cables of international, foreign, and domestic networks and thus are susceptible to being intercepted, deciphered, and analyzed. These volumes of data may include financial information, stock transactions, business deals, foreign military and diplomatic secrets, legal documents, as well as more mundane, personal data trails, such as credit card transactions, private e-mails, cell phone calls, Google searches, and other digital data that may pass over wireless networks.
Aside from the Internet, there is also the deep web or deepnet—data beyond the reach of the public, all of which data which may be highly encrypted. This includes password-protected data, U.S. and foreign government encrypted communications, and noncommercial secured file-sharing between known and trusted peers. Various intelligence communities throughout the world have an incentive to collect and store this data where possible, and once these communications are captured, stored, and decrypted, the “data-mining” may begin. This may include searching for target addresses, locations, countries, and phone numbers, as well as watch-listed names, keywords, and phrases, etc., in e-mails. Any communications that arouse suspicion, for example, those to or from targeted entities, may be automatically copied or recorded and collected for further, more thorough searching and analysis.
Most of this data is in the form of machine-readable natural language text, that is, language spoken by people and understood by them in their respective languages, e.g., English, French, Chinese, etc., as opposed to artificial programming languages, such as C++, Java, Visual Basic, etc. Natural language processing (NLP) is that subcategory of Artificial Intelligence (AI) that relates to programming computers to “read” and understand natural language text in the same manner that humans read and understand human language. NLP includes approaches related to information retrieval, machine translation, and language analysis, with the last comprising semantics, parsing, and parts-of-speech tagging.
Conventional searches of large data corpora are generally text-based, i.e., key-word searches. An example of a text-based search is a Boolean search, that is, a term is included in a document or it is not. The use of key-words in combination with Boolean operators or connectors such as AND, OR, NOT, and NEAR may be used to search for documents that contain multiple terms and to exclude documents with certain terms in order to limit, widen, or further define a search. Using basic Boolean operators, a Web searcher can improve his search results, but generally may also retrieve multiple results, many of which are imprecise and may not conceptually match the search topic(s).
Boolean searches may be improved through the use of various mathematical operations or calculations to improve search results. These calculations and operations may, in general, be either on-page or off-page. Examples of the former include determining the frequency of the search terms in the searched documents and also the location of the term in a document, e.g., its title, a description of the document, its content, etc. Examples of off-page techniques include frequency counts of terms used in prior document searches and ranking documents based on citations by other documents in large databases. Many of these techniques require complex mathematical iterations that are constantly being modified and improved. However, these techniques are primarily suited for search engines that search large databases of documents such as are found on the World Wide Web, and are still essentially text-based searches that do not deal with the problem of determining the meaning of words (i.e., word-sense disambiguation).
Another approach to improve search results has been to use synonym-based searching, that is, finding synonyms of the key-word(s) being searched, i.e., words with the same-part-of-speech (nouns, verbs, adjectives and adverbs) having the same meaning, i.e., are interchangeable with the key-word(s) being searched, and forming a synset with these words. For example, {car; auto; automobile; motorcar} may form a synset because these four words can be used to refer to the same concept. In applications focused on information retrieval, searching using synonyms may give more and better results but different senses of the keywords is not necessarily required so long as the sense of the keyword in the search query is the same as the sense of the keyword in the retrieved document.
Each of the synsets may be related to other synsets by semantic conceptual relationships. For example, a more general concept (or hyperonmy synset) of the {car} synset may be: {motor vehicle; automotive vehicle}, a more specific concept (or hyponymy synset) may be: {squad car; patrol car; police car} and {taxi; cab; taxicab}, and parts of the whole (or meronymy synset) may be: {gasoline engine; car door; car window; car seat}. By means of these and other semantic/conceptual relations, all word meanings related to a single concept can be interconnected; however this results in a huge hierarchical network or wordnet.
Examples of lexical databases that may be used in synonym-based searching is WordNet®, which is a large lexical database of the English language, and EuroWordNet, which is a multilingual lexical database with wordnets for several European languages. Generally, in WordNet® each synset is limited to one particular part-of-speech, while in other lexical databases different parts-of-speech may be found in a single synset.
As for word-sense disambiguation (WSD), this refers to the process of identifying which sense of a word (i.e., its meaning) is used in a sentence, when the word is polysemous, i.e., the word has multiple meanings. The importance of WSD is that words that are ambiguous must be given their correct meaning based on the context in which they occur, e.g., their placement in a sentence. For example, the word “plant” may refer to an industrial, chemical, or electrical plant, or a living organism. A third meaning may be the verb “plant,” whose meanings include setting seeds or plants into the ground. This is ambiguity at the lexical level. Ambiguity may also be exhibited at the semantic level (e.g., the headline: stolen painting found by tree) or the pragmatic level (e.g., can you repair the car?).
In order to understand the true nature of a query represented by keywords, the search engine must be able to select the correct sense of a word in a given context where that word is polysemous. In machine translation applications, choosing the wrong meaning of a polysemous word results in wrong translations, and in the case of search engines, the wrong information will be retrieved. Additionally, search engines, to operate more efficiently and accurately, should retrieve documents including related terms, such as “flora” when searching “plant” having the meaning of a living organism.
The process of WSD generally requires two things: a dictionary that specifies the senses that are to be disambiguated (the most commonly-used such dictionary is WordNet®) and a corpus of natural language data that is to be disambiguated. There are numerous techniques from the fields of NLP and machine learning (ML) that may be employed to help improve the accuracy of search engines and translation machines when dealing with WSD. Almost all these approaches normally work by defining a window of n content words (for example, n=10) around each word to be disambiguated in the corpus, and statistically analyzing those n surrounding words using a lexical database such as WordNet®. Another variation of the window used, rather than counting words, is to select a syntactic span, such as a sentence or a phrase. In order to provide better search and translation results, NLP and ML techniques in general attempt to identify the parts of a sentence, convert verbs to various tenses, find relationships between words in a given sentence, disambiguating between synonyms and near-synonyms, and extracting meaning from context. A NLP search engine would in theory find targeted answers to user questions (as opposed to a keyword search).
Conventional approaches to WSD that include dictionary- and knowledge-based methods may rely primarily on dictionaries, thesauri, and other lexical knowledge bases without using any corpus evidence, or may use a secondary source of knowledge such as a small annotated corpus as seed data in a bootstrapping process. Unsupervised methods are those that do not rely on external information and may work directly from raw unannotated corpora. Unsupervised methods are also known under the name of word sense discrimination.
In general, dictionary-based and knowledge-based approaches require algorithms that find similarities between multiple definitions and a current context, such as the word's position in a semantic network. An example of such an algorithm is the Lesk algorithm based on the hypothesis that words used together in text are related to each other and that the relation can be observed in the definitions of the words and their senses. Two (or more) words are disambiguated by finding the pair of dictionary senses with the greatest word overlap in their dictionary definitions, as may be found in the Oxford English Dictionary, the Merriam-Webster 7th New Collegiate Dictionary, or WordNet®. For example, when disambiguating the words in “pine cone,” the definitions of the two words that both include the words “evergreen” and “tree” (at least in one machine-readable dictionary) are most likely the appropriate senses. Thus, such approaches require algorithms to process multiple sentences in diction for each word in a sentence and then compare all those sentences for the same words appearing in the definition.
All of these methods of WSD generally require voluminous dictionaries and complicated algorithms and thus are complex and laborious and cannot be improved without considerable effort. Moreover, when certain elements of a method are modified to improve results, the method may then be less accurate when applied to other data corpora. In view of the foregoing, there is an ongoing need for providing systems and methods of creating and maintaining searchable databases that when queried by a user produce results that are precisely and accurately responsive to the user's query.