As used herein, the term “delimiter” refers to one or more characters that are used to specify a boundary between separate, independent words occurring in a string of characters. In some character strings, no delimiters are used to specify a boundary between words occurring within the string. Such non-delimited character strings are very commonly used in Internet domain names and computer filenames. An example of an Internet domain name that includes a non-delimited character string is “www.digitalcamerareview.com.” In this domain name, the non-delimited character string “digitalcamerareview” includes the separate, independent words “digital,” “camera” and “review.” An example of a computer filename that includes a non-delimited character string is “catinthehat.gif.” In this filename, the non-delimited character string “catinthehat” includes the separate, independent words “cat,” “in,” “the,” and “hat.” Each word identified within a non-delimited character string may have independent meaning. Furthermore, identified words taken together may have meaning, in which case they form a phrase.
A non-delimited character string that forms a part of an Internet domain name may include words or phrases that provide valuable clues about accessible subject matter within the corresponding Internet domain. If such words and phrases could be accurately identified, they could be used to improve the performance of Internet search engines or other systems that match keywords or other information submitted by a user to domains on the World Wide Web. Likewise, a non-delimited character string that forms a part of a computer filename may include words or phrases that provide valuable clues about the information contained in or represented by a file identified by the filename. If such words and phrases could be accurately identified, they could be used to improve the performance of search engines, desktop search tools, or other systems that match keywords or other information submitted by a user to computer files.
What is needed then is a system and method for tokenizing character strings, including but not limited to non-delimited character strings of the type commonly used in Internet domain names and computer filenames, to accurately identify words and phrases occurring therein.