A search engine is a computer program or a set of programs used to index information and search for indexed information. In a conventional search engine, a tokenizer converts an input string (such as a file name) into discrete tokens that are indexed by the search engine. A tokenizer is a software tool programmed to perform tokenization. Tokenization refers to the process of breaking up a string of text into words, phrases, symbols, or other meaningful elements called tokens. Creating such tokens is not an exact science and requires some heuristics. These tokens are then used in a search index for full text searching. A token, in this context, refers to a word or an atomic data element that cannot be parsed any further.
For example, suppose Adam names a document “my-text-doc”, a tokenizer would typically convert the input string “my-text-doc” into three tokens (also referred to as data elements in this disclosure): “my”, “text”, and “doc”. The document “my-text-doc” is then indexed by the search engine using these tokens. A search string can be tokenized in a similar way. Thus, when Bob searches for the document “my-text-doc” using a search string “my text doc”, the search string is tokenized into three tokens: “my”, “text”, and “doc”. The search engine searches its file name index using these tokens, successfully locates the document “my-text-doc”, which is indexed using the same tokens, and returns the correct document.
However, this tokenizer approach cannot handle many of the possible variations that may occur in a string. For example, if Adam names the document “MYtext.doc”, it is not clear whether there should be one token “mytext” or two tokens “my” and “text”. If the document is tokenized one way and Bob searched in the other, the search may fail.
Further, the tokenizer approach produces only a single tokenized representation of a file name and may also take out characters that may be used for searching. Thus, if Adam names the document “MY-text-doc”, and the tokenizer outputs three tokens: “my”, “text”, and “doc” for indexing, then a search for “my-text” or “-text” may fail as neither search string will result in the same tokens: “my”, “text”, and “doc” for searching.
Given the deficiencies in conventional search engines, there is room for innovations and improvements.