I. Technical Field
The present invention generally relates to the field of computerized systems and document searching. More particularly, the invention relates to computer-based systems and methods that provide flexible text based searching capabilities.
II. Background Information
Due to the increasing volume of information that is stored in computer systems, it is more difficult then ever before to locate information that is relevant among other information that is not relevant. A search for information that is stored in a computer system may be conducted using an information retrieval system that is typically referred to as a search engine. One example of a search engine is a Web search engine, which can search for information on the World Wide Web.
To quickly locate desired information, search engines make use of a search index. A search index provides a shorthand for locating documents when processing a search query. A search index also optimizes speed and performance in finding relevant documents for a search query. The term indexing refers to the process of collecting, parsing, and storing data for a search index. Without an index, the search engine would scan every document during a search, requiring considerable time and computing power. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are the costs paid for the time saved during a search.
The process of creating a search index generally includes tokenization. Tokenization refers to the process of identifying sequences of characters that represent words and other elements, such as punctuation, which are represented by numeric codes. Document tokenization first breaks a stream of characters into keywords based on specifically defined segmentation points such as XML mark-up, spaces, or certain punctuation characters including commas, colons, semicolons, question marks, and periods. The resulting segments from the tokenization process are considered to be keywords. The tokenization process also determines the position of a keyword within a document. The output from the tokenization process is a series of keywords listed with keyword identification information, such as document identifiers and sequentially numbered keyword positions.
After all documents are tokenized, the tokens are sorted by keyword value. The sorting results in a list of keywords, along with the keyword position information for each occurrence of the keyword. The sorted keywords are combined to form a single list of keywords and accompanying identification information for the document collection. This process of combining the tokens from all of the documents in a collection is referred to as merging.
Once a merged index is created, a search query must also be tokenized in a manner similar to tokenizing the document. When a search query is entered into a search system, the search system searches the merged index to complete the search request. A competed search request outputs a list of all the documents and the respective locations that meet the search request criteria.
Historically, search systems, such as search engines, have categorized certain words that appear in a document as noise words or stop words. Noise words typically occur in most documents. Examples of noise words include “the”, “an”, “of”, “to”, etc. Due to the frequency at which these words occur in documents, noise words may offer limited value in search processing while introducing high processing costs. As a result, conventional text retrieval systems do not index these words. Omitting these noise words in the keyword index reduces the storage and processing requirements for the searching and indexing of a document collection. However, when noise words are not included in a keyword index, a search system is unable to search for certain phrases that include noise words when desired. Accordingly, ignoring noise words may not produce desired results.
Words that are generally considered noise words might be important to the effectiveness of one phrase search, while having little value in another phrase search. For example, “me” and “too” are typically defined as noise words. In a conventional search system, the phrase “me too hungry” would not be searchable. Instead, the query would find all occurrences of the word “hungry.” Conversely, a search index that indexes all noise words might also lead to unintended search results. For example, the phrase “right to privacy” may be used interchangeably with the phrase “right of privacy.” In this example, indexing the noise word “of” would prevent the search from returning documents that contain “right of privacy.” As is evident from the foregoing, there is a need for improved systems and methods for searching that takes into consideration noise words in order to provide more effective search results.