Although there are a large number of websites on the Internet or World Wide Web (www), users often are only interested in information on specific web pages from some websites. Given the sheer size of Internet, it has become increasingly important to tailor searches and product recommendations to the user's personal interests/preferences. It is burdensome for both the user and the entity seeking information to ask the user to specify his/her personal interests/preferences. In addition, those interests/preferences may change over time. It is therefore generally considered more convenient to automatically discover user interests/preferences from the web pages the user visits. To enable such a tool, information contained in web pages visited by the user needs to be identified and extracted.
This information may include extracting movie titles, entity names (e.g., sports teams), or other information from web or text documents. On the surface, the problem of matching movie titles (or other information) appears to be a string matching problem. But actually, it has special characteristics and the standard string matching algorithms do not work well.
There are several well-known algorithms for string matching. The Knuth-Pratt-Morris and Boyer-Moore algorithms match a single string pattern against an input string. The Aho-Corasick and Set-wise Boyer-Moore algorithms match multiple patterns. When applied to human language text, these algorithms usually work at character level. That is, the basic elements that form the patterns and the input text are considered to be characters. This means that the alphabet (the set of elements) is relatively small. In the case of English, there are twenty-six letters, plus a few other characters (e.g., space and apostrophe). Some of the algorithms take advantage of the small size of the English alphabet by pre-computing the actions for each element. For example, the Boyer-Moore algorithm pre-computes, for each element in the alphabet, the number of characters in the input text that can be skipped and stores these skip values in a table. However, these algorithms are limited to use with a specific language. In addition, a large number of comparisons are needed to effectively extract the desired information and therefore the algorithms are generally considered to be inefficient for these purposes.