The present invention relates generally to the field of computer software and, more particularly, to a natural language information retrieval system employing a hash table technique to reduce memory requirements, a proxy process module to improve processing speed on multi-processor computing platforms, and a debugging module that is not shipped along with the natural language information retrieval system.
The number and size of electronic documents increases continually. Any computer user with access to the Internet can search a vast universe of documents addressing every conceivable topic. Computer users may also search many other sources of electronic documents, such as dial-in databases, CD-ROM libraries, files stored on hard drives, files stored on tape drives, files stored on resources connected through an intranet, and the like. Although the available universe of documents may contain a wealth of information on a wide variety of subjects, searching through this universe to identify a small subset of documents that are relevant to a specific inquiry can be a daunting task. In fact, finding a large supply of searchable electronic documents may often be a far easier task than searching the individual documents to find information that is germane to a particular inquiry.
As a result, computer users have a continuing need for effective tools for searching the large and increasing supply of electronic documents. For example, key-word text search engines allow a computer user to identify documents that contain selected key words. More advanced search engines allow the user to further refine search requests using Boolean logic by limiting the number of words between key words, automatically searching for variations of key words, specifying searches using Boolean logical operations, and so forth. These conventional key-word text search engines have limited utility, however, because simply searching for the presence of key words using Boolean logical operations often identifies a large number of candidate documents. The user must then examine each candidate document to identify those that are actually germane to the user""s inquiry. This type of document-by-document examination can be tedious and time consuming.
Natural language information retrieval (NLIR) systems have been developed to improve over Boolean-logic key-word search engines. Rather than requiring a Boolean key-word search definition, an NLIR system accepts a natural language or xe2x80x9cplain Englishxe2x80x9d question. The NLIR system automatically identifies key words in the question and important semantic relationships between the key words. For example, the NLIR system may analyze the question and identify semantic relationships within the question, such as a verb and the subject and/or object of that verb. The NLIR system then searches the universe of documents to identify those documents in which the same key words appear in the same semantic relationships.
These semantic relationships are typically identified by breaking sentences down into semantic relationships, such as logical-form triples (LFTs). An LFT includes two words from a sentence and a qualifier representing the semantic relationship between the words. For example, a user may enter the natural language question, xe2x80x9cDo elephants have tusks?xe2x80x9d For this question, the noun xe2x80x9celephantxe2x80x9d is in a deep subject relationship (qualifier xe2x80x9cDsubxe2x80x9d) with the verb xe2x80x9chave,xe2x80x9d and the noun xe2x80x9ctusksxe2x80x9d is in a deep object relationship (qualifier xe2x80x9cDobjxe2x80x9d) with the verb xe2x80x9chave.xe2x80x9d Thus, the question xe2x80x9cDo elephants have tusks?xe2x80x9d can be broken down into two LFTs, xe2x80x9celephant-Dsub-havexe2x80x9d and xe2x80x9ctusk-Dobj-have.xe2x80x9d
The NLIR system then searches the universe of documents for files containing the same LFTs. For example, the sentence, xe2x80x9cAfrican elephants, which have been hunted for decades, have large tusks,xe2x80x9d also includes the LFTs, elephant-Dsub-havexe2x80x9d and xe2x80x9ctusk-Dobj-have.xe2x80x9d Thus, the NLIR system would identify a document containing this sentence as a document having a high likelihood of containing an answer to the natural language question, xe2x80x9cDo elephants have tusks?xe2x80x9d This type of semantic-qualified searching can greatly increase the quality of information retrieval. In other words, NLIR techniques can greatly increase the likelihood that a search engine will identify documents that contain an answer to a specific inquiry. NLIR systems that accept natural language rather than Boolean search requests are also easier to use in many situations because computer users are often more familiar with stating inquiries in plain English, as opposed to formulating inquiries in a Boolean-logic format.
Conventional NLIR systems encounter drawbacks, however, because each document in the universe of searchable documents must be analyzed to identify the LFTs present in the document. Performing LFT analysis xe2x80x9con the flyxe2x80x9d for a large universe of searchable documents would be prohibitively time consuming. Moreover, the same LFT processing would have to be performed multiple times for the same document. That is, LFTs would have to be identified for the same document for each natural language question processed in connection with that document. For this reason, LFT processing is typically performed only once for a particular document, and the LFTs present in the document are stored in association with the document. Preprocessing a document to identify LFTs and thus make the document amenable to subsequent NLIR analysis is sometimes referred to as xe2x80x9cindexingxe2x80x9d the document.
Indexing a large number of documents, such as all of the documents present on an electronic database or network, can be very time consuming. Fortunately, powerful techniques have been developed for handling such large-scale data processing tasks. These techniques include, among others, using multi-processor computer systems and multi-tasking operating systems that perform background processing. But conventional NLIR systems are not presently configured to take full advantage of these techniques because conventional NLIR systems rely heavily on global variables that prevent the NLIR system from running multiple processing threads simultaneously. The inability to simultaneously run multiple processing threads typically prevents the NLIR system from operating on more than one processor simultaneously, which undermines a major advantage of conducting the processing on a multi-processor computer system.
In addition, storing a complete set of LFTs for each document for a large number of documents can require a large amount of data storage space. In fact, it is not unusual for a complete set of LFTs to require as much storage space as the document itself. Thus, storing a complete set of LFTs for a large number of indexed documents may require a prohibitively large memory allocation for a storage-space limited program module, such as an electronic encyclopedia sold on CD-ROM. For example, the designers of an electronic encyclopedia program module may not be willing to reduce the number of documents by one-half in order to make the remaining documents amenable to NLIR processing.
In addition, compressing the LFT data to reduce the memory requirement may result in prohibitively slow processing, as each LFT file would have to be uncompressed during question processing.
As a result, the desire to implement NLIR systems in connection with storage-space limited program modules presents a familiar conundrum in software development, in which acceptable processing speed cannot be achieved given acceptable memory requirements. Those techniques presently available for improving processing speed do so at the cost of increased memory requirements, and those techniques available for decreasing memory requirements do so at the cost of decreased processing speed (i.e., increased processing overhead). There is no solution presently available to provide the combination of acceptable processing speed and acceptable memory requirements for certain storage-space limited program modules, such as electronic encyclopedias and the like. For this reason, NLIR processing is not currently feasible in connection with these storage-space limited program modules, which includes an important class of applications sold on CD-ROM. In addition, NLIR processing is not presently feasible in relatively large-scale distributed computing environments, such as search engines used in connection with local-area networks, wide-area networks, intranets, the Internet, and so forth.
Thus, there is a need for an NLIR system exhibiting the combination of acceptable processing speed and acceptable memory requirements when implemented in connection with storage-limited program modules, such as a CD-ROM title. More specifically, there is a need for an NLIR system that does not require on-the-fly LFT processing or storage of a complete LFT listing for each document in a universe of searchable documents. There is also a need for an NLIR system for searching relatively large-scale distributed computing environments, such as search engines used in connection with local-area networks, wide-area networks, intranets, the Internet, and so forth. In addition, there is a need for an NLIR system that takes full advantage of powerful processing techniques, including multi-processor computer systems and multi-tasking operating systems.
The present invention meets the needs described above in an NLIR utility that stores LFTs using a hash-table technique that relies on a quasi-random hash value computed for each LFT. During index processing, the NLIR utility computes hash values for each LFT present in a document. The hash value is parsed into an address hash and a signature hash, and each LFT is represented by its hash signature stored in an array at a memory location based on the associated address hash. The NLIR utility uses this technique to create a hash-table fingerprint for each document in a group of searchable documents. Each fingerprint, which includes a representation of the LFTs in the corresponding document, is stored in a relatively small hash-table array.
During question processing, the NLIR utility obtains LFTs for a natural language question on the fly, and computes hash values for the question LFTs using the same formula that was used during index processing. The NLIR utility then compares the hash values for the question LFTs to the hash-table fingerprints stored for each document in the group of searchable documents. A match between a hash value for a question LFT and a hash value found in a searched fingerprint indicates a very high likelihood that the corresponding document contains an LFT matching the question LFT. The NLIR utility assigns a predefined score to each matching LFT based on the type of LFT, and sums the scores to produce a document score for each document. The NLIR utility returns the document scores to a search engine, which displays the documents in a priory order based on the document scores returned by the NLIR utility.
Thus, during index processing, the NLIR utility preprocess the group of searchable documents to create a hash-table fingerprint for each document without having to store the actual LFTs for each document. Then, during question processing, the NLIR utility performs LFT comparisons directly on the hash-table fingerprints without having to generate the actual LFTs for the searched documents. This use of hash tables gives the NLIR utility the combination of acceptable processing speed and acceptable memory requirements when implemented in connection with a storage-limited program module, such as a CD-ROM title. That is, the NLIR utility does not require on-the-fly LFT processing or storage of a complete LFT listing for each document searched. The resulting NLIR utility may also be used in connection with engines for searching relatively large-scale distributed computing environments, such as search engines used in connection with local-area networks, wide-area networks, intranets, the Internet, and so forth
To obtain LFTs for a document or for a natural language question, one or more sentences defining LFT queries are passed to a conventional natural language processing (NLP) core code module, which is also referred to as the xe2x80x9cbase technology.xe2x80x9d The invention may also include a proxy process module that creates a new process for each client thread that calls the NLIR utility except the first such active thread. In other words, the proxy process module creates a new process for each thread that calls the NLIR utility while the NLP core code module is already processing an active thread. These new processes take single sentences as input and pass them one at a time to the NLP core code module for LFT generation. Because each concurrent LFT query occurs in the context of a separate process, the xe2x80x9cnon-thread safexe2x80x9d base technology code can run on multiple processors simultaneously.
The invention may also include a debugging module that typically is not shipped to customers on the same CD-ROM as the NLIR module. Although they are not typically shipped together, the debugging module can activate and deactivate a trace document function that resides within the NLIR module. When the trace document function is active, the NLIR module produces a diagnostic report known as a xe2x80x9cdocument tracexe2x80x9d for each document processed by the NLIR system. The debugging module can also produce a diagnostic report known as a xe2x80x9cquestion tracexe2x80x9d for a particular question after it has been analyzed by the NLIR module. The document trace lists the LFTs created for a corresponding document, and the question trace lists the LFTs created for a corresponding question. Programmers can therefore use the debugging module to inspect the LFT contents of questions and documents without having to allocate space for the debugging module on the CD-ROM containing the NLIR module. The debugging module may assist programmers in analyzing and debugging the NLIR module and the base technology code.
Generally described, the invention includes a client program module, such as a natural language information retrieval module. The invention also includes a utility module, such as a natural language processing core code module, that is configured to provide service functions in response to commands from the client program module. The invention also includes a proxy process module configured for receiving the commands from one or more active client threads associated with the client program module, creating processes for one or more of the active client threads, and passing the command received from each active client thread to utility module in the context of an associated process. For example, the proxy process module may be configured to receive the commands from one or more active client threads other than the first active client thread, create a process for each client thread other than the first active client thread, and pass the commands received from each active client thread other than the first active client thread to the utility module in the context of an associated process.
The client program module may be stored on a first discrete storage medium, and the invention may include a debugging program module stored on a second discrete storage medium. The debugging program module may include a first interface method for activating a first diagnostic function that, when active, causes the client program to produce a first diagnostic report. The debugging program module may also include a second diagnostic function that, when active, causes the client program to produce a second diagnostic report.
More specifically, the invention provides an NLIR utility configured to implement a method for creating a group of searchable documents, which is also referred to as xe2x80x9cindex processing.xe2x80x9d For each document, the NLIR utility receives text defining the document and parses the text into a plurality of text portions, such as sentences. The NLIR utility obtains one or more logical form relationships corresponding to each text portion, typically by passing the text portion to a conventional NLP core code module. Once logical form relationships have been obtained for the entire document, the NLIR utility defines an array having a size corresponding to the number of logical form relationships for the document. The NLIR utility then creates a hash-table fingerprint for the document by computing a hash value for each logical form relationship. For each hash value, the NLIR utility obtains an address hash and a signature hash based on the corresponding hash value and stores the signature hash in the array at a memory location corresponding to the address hash.
The NLIR utility may parse each hash value to obtain the corresponding address hash and signature hash. The NLIR utility may also identify an array index for an array entry point corresponding to the address hash. If the array entry point is empty, the NLIR utility may store the signature hash at the array entry point. Alternatively, if the array entry point is not empty, the NLIR utility may increment the array index of the array entry point until an empty memory location is defined and store the signature hash at the empty memory location.
More specifically, the NLIR utility may set the array index for the array entry point to the remainder of the address hash divided by the size of the array. In addition, the NLIR utility may set the size of the array to a predetermined percentage larger than the number of logical form relationships for the document. For example, the predetermined percentage may be 110%, the hash value may be a 32-bit value, the address hash may be the upper 16 bits of the hash value, and the signature hash may be the lower 19 bits of the hash value.
The NLIR utility is also configured to respond to a natural language question, which is also referred to as xe2x80x9cquestion processing.xe2x80x9d During question processing, the NLIR utility receives a natural language question and obtains one or more logical form relationships for the question, typically by passing the question to the NLP core code module. Upon obtaining the question logical form relationships, the NLIR utility computes a hash value corresponding to each logical form relationship for the question. Then, for one or more document in the group of searchable documents, the NLIR utility compares the hash values corresponding to the logical form relationships for the question to the hash-table fingerprint for the document, and identifies one or more matching hash values.
The NLIR utility may also obtain a score for each matching hash value and, in response, sums the scores to compute a document score for each document connoting the likelihood that the document contains an answer to the natural language. The NLIR utility may then pass the document scores to a search engine that ranks the documents in order of their respective document scores. The search engine can display a list of highest-ranking documents as a suggestion list of documents that likely contain an answer to the natural language question.
During question processing, the NLIR utility may parse a current hash value into a current address hash and a current signature hash. Parsing the hash value means that the NLIR utility may utilize a first subset of the hash value and the address hash a second subset of the hash value as the signature hash. These subsets may or may not overlap, and may or may not contain all of the digits of the hash value. The NLIR utility may then identify an array entry point in the array corresponding to the current address hash. If the array entry point is not empty, the NLIR utility may identify one or more consecutively-addressed data-containing memory locations beginning with the array entry point.
The NLIR utility then compares the current signature hash to the data value stored at each of the consecutively-addressed data-containing memory locations. If the current signature hash matches the data value stored in any of the consecutively-addressed data-containing memory locations, the NLIR utility identifies the current hash value as a matching hash value. Alternatively, if the array entry point is empty, the NLIR utility may identify the current hash value as a non-matching hash value. In addition, if the current signature hash does not match the data value stored at any of the consecutively-addressed data-containing memory locations, the NLIR utility may identify the current hash value as a non-matching hash value.
The invention also provides an NLIR system that includes an NLIR module configured for creating a group of searchable documents. For each document, the NLIR module receives text defining the document from a search engine and returns a hash-table fingerprint including a representation of logical form relationships for the document to the search engine. In addition, for each document, the NLIR module receives a natural language question and the hash-table fingerprint for the document from the search engine. In response, the NLIR module returns a document score to the search engine connoting the likelihood that the document contains an answer to the natural language question. The NLIR system may also include a search engine configured for ranking the documents in order of their respective document scores. The search engine may also display a list of highest-ranking documents as a suggestion list of documents containing an answer to the natural language question.
According to an aspect of the invention, the NLIR module defines an interface including a first interface method for receiving the text documents from the search engine and returning the hash-table fingerprints to the search engine. The interface defined by the NLIR module also includes a second interface method for receiving a current natural language question and a hash-table fingerprint for a current document from the search engine, and returning a document score to the search engine connoting the likelihood that the current document contains an answer to the natural language question. The interface defined by the NLIR module may also include a third interface method for initiating processing of the natural language question, and a fourth interface method for terminating processing of the natural language question.
According to another aspect of the invention, the NLIR module parses each document into a plurality of sentences and passes each sentence to the NLP core code module. For threads other than the first active thread to pass a sentence to the NLP core code module, the NLIR module passes the sentence to the NLP core code module by way of a proxy process module. This proxy process module creates a process for each NLIR client thread except the first such thread. The proxy process module passes one sentence at a time to the NLP core code module, which identifies one or more logical form relationships corresponding to each sentence and returns the logical form relationships to the NLIR module.
According to yet another aspect of the invention, the NLIR system includes a debugging module that defines an interface that includes a first interface method for activating and deactivating a trace document function that, when active, causes the NLIR system to identify the logical form relationships identified for document text processed by the NLIR system. The interface defined by the debugging module also includes a second interface method for obtaining a diagnostic question trace for an individual question after the question has been processed by the NLIR system.
That the invention improves over the drawbacks of prior natural language information retrieval systems and how it accomplishes the advantages described above will become apparent from the following detailed description of the exemplary embodiments and the appended drawings and claims.