This invention generally relates to database search engines for computer systems. More particularly, this invention relates to concept searching using a Boolean or keyword search engine.
Database search engines permit users to perform queries on a set of documents by submitting search terms. Users must typically submit one or more search terms to the search engine in a format specified by the search engine. Most search engines specify that search terms should be submitted as a Boolean or keyword search query (i.e. xe2x80x9cred OR greenxe2x80x9d or xe2x80x9cblue AND blackxe2x80x9d). Boolean or keyword search queries can become extremely complex as the user adds more search terms and Boolean operators. Moreover, most search engines have complex syntax rules regarding how a Boolean or keyword search query must be constructed. For users to get accurate search results, therefore, they must remember the appropriate syntax rules and apply them in an effective manner. This process can be difficult for many users and, unless mastered, may result in searches which return irrelevant documents.
xe2x80x9cNatural languagexe2x80x9d search engines have been developed which permit users to submit a natural language query to the search engine rather than just keywords. For instance, a user may input the simple natural language sentence xe2x80x9cHow do I fix my car?xe2x80x9d instead of the more complex Boolean search query xe2x80x9chow AND to AND fix AND car?xe2x80x9d Instead of searching for just the keywords contained in the search query, a typical natural language search engine will extract the concepts implied by the query and search the database for documents referencing the concepts. A natural language search engine will therefore return documents from its database which contain the concepts contained in the search query even if the documents do not contain the exact words in the search query. A natural language search query may be submitted to a Boolean or keyword search engine. However, these types of search engines will only return documents containing the exact words in the search query.
Although natural language search engines provide the benefits of easy to understand natural language search queries and concept searching, natural language search engines are not without their drawbacks. For example, natural language search engines are considerably more expensive to develop than a Boolean or keyword search engine. Moreover, natural language search engines can be difficult and expensive to implement, especially where they are used to replace existing Boolean or keyword search engines.
Therefore, there is a need for a method and apparatus for database searching which (1) permits effective searching using a Boolean or keyword search engine with natural language search queries, (2) which permits concept searching using a Boolean or keyword search engine, and (3) which may be implemented without any modification to the Boolean or keyword search engine.
The present invention satisfies the above-described needs by providing a method and apparatus for concept searching using a Boolean or keyword search engine. Using the method and apparatus of the exemplary embodiment, documents are preprocessed before being passed to the search engine for inclusion in the search engine""s database. Search queries are also preprocessed before being passed to the search engine.
With regard to the preprocessing of documents, each document is scanned on a word-by-word basis to identify the xe2x80x9cword tokensxe2x80x9d contained in the document. Word tokens are actual words or word-like strings such as dates, numbers, etc. Once the word tokens in a document have been extracted, each word token is located in a xe2x80x9cconcept databasexe2x80x9d that maps word tokens to concept identifiers. Each word token may map to zero or more concept identifiers.
Once the concept identifiers associated with each word token have been extracted from the concept database, a consolidated list of concept identifiers is created. Each of the concept identifiers in the list is then converted into a unique non-word concept token which identifies the concept. A concept token is a non-word character string which identifies and is mapped to a concept. For instance, the concept token xe2x80x9cQ1A5xe2x80x9d may map to the concept of xe2x80x9ccolor.xe2x80x9d These concept tokens are then arranged into a list.
Once the list of concept tokens has been created, the tokens are inserted into the document. In an exemplary embodiment, a hypertext markup language (xe2x80x9cHTMLxe2x80x9d) META tag is used to insert the concept tokens into the document. Using the HTML META tag, the concept tokens are treated as ordinary text by the search engine and therefore may be searched, but are invisible to the user. The document is then transferred to the server monitored by the search engine. All documents indexed by the search engine are preprocessed in this manner.
With regard to the preprocessing of search queries, an additional component is interposed between the query submitted by the user and the search engine. This component preprocesses the query in much the same way as document preprocessing described above, and then sends a modified query to the search engine.
Queries are preprocessed by first breaking the search terms into word tokens. The word tokens are then referenced in the concept database (the same database used for document preprocessing) and any associated concept identifiers are retrieved. The concept identifiers are then converted to unique concept tokens as described above and are combined into a string with separating spaces. Text is prepended to the string to instruct the search engine to search the contents of all documents"" META tags for the tokens. This string constitutes the preprocessed query which is then sent to the search engine.
The unmodified Boolean or keyword search engine then finds all of the documents whose concept tokens most closely match the concept tokens in the modified query. The preprocessing of both documents and queries is transparent to the search engine. However, the exemplary embodiment of the present invention described herein solves all of the above-described problems by modifying the built-in functionality of the Boolean or keyword search engine to search for concepts rather than keywords.
Therefore, it is an object of the present invention to provide a method and apparatus for database searching which permits effective searching using a Boolean or keyword search engine with natural language search queries.
It is also an object of the present invention to provide a method and apparatus for database searching which permits concept searching using a Boolean or keyword search engine.
It is a further object of the present invention to provide a method and apparatus for natural language and concept searching using a Boolean or keyword search engine which may be implemented without any modification to the Boolean or keyword search engine.
That the present invention and the exemplary embodiments thereof overcome the problems and drawbacks set forth above and accomplish the objects of the invention set forth herein will become apparent from the detailed description of exemplary embodiments which follows.