The present invention relates generally to transactions over a network or internetwork such as the Internet, or a private network (an intranet) and more particularly to a method for resolving an incorrectly entered uniform resource locator (URL).
It is said that the World Wide Web (the “Web”) provides access via the Internet to a large number—some say in the order of 1010—of Web sites and documents. Whatever the actual number, it is large and increasing. In the Web environment, client machines effect transactions to documents, e.g., Web sites on Web servers using the Hypertext Transfer Protocol (HTTP), which is a known application protocol providing users access to files, e.g., text, graphics, images, sound, video, etc., using a standard page description language known as Hypertext Markup Language (HTML) or the later Extensible Markup Language (XML). HTML and XML provide basic document formatting and allows the developer to specify “links” to other servers and files. In the Internet paradigm, a network path to a document or file on a server is identified by a so-called Uniform Resource Locator (URL) having a special syntax for defining a network location. In this description, the term URL includes just the address of a computer in the network, e.g., a domain name. Use of an HTML/XML-compatible browser (e.g., Opera Browser, Netscape Navigator, or Microsoft Internet Explorer) at a client machine involves specification of a link via the URL. In response, the client makes a request to the server where the Web site identified in the link resides (or a duplicate thereof stored elsewhere) and, in return, if the URL is correct, i.e., the request is uniquely resolved, receives in return a document or other object in a display format specified in the HTML or XML of the document specified by the URL.
Typically, a user specifies a given URL manually by typing the desired character string in an address field of the browser. Existing browsers provide some assistance in this regard. For example, modern browsers store URLs that have been previously accessed from the browser during a given time period. Thus, when the user begins entering a URL, the browser performs a “type-ahead” function while the various characters comprising the string are being entered. Thus, for example, if the given URL is “http://www.inventek.com” (and that URL is present in the URL list), the browser parses the initial keystrokes against the stored URL list and provides a visual indication to the user of a “candidate” URL that the browser considers to be a “match”. Thus, as the user is entering the URL he or she desires to access, the browser may “look ahead” and pull a candidate URL from the stored list that matches. If the candidate URL is a match, the user need not complete entry of the fully resolved URL; rather, he or she simply actuates the “enter” key and the browser is launched to the site.
URL resolution through this “look ahead” approach has provided some benefits, but the technique is unsatisfactory because the target URL may not be on the saved list.
Alternatively, a portion of the target URL (e.g., the second level domain name) may be saved in the list but the typing error may be a particular directory or file name toward the end of the long string of characters. In either case, the user is forced to enter a long character string, only to find that the string cannot be meaningfully resolved (by a network naming service or a particular Web server, as the case may be). If the includes an error, a “server not found” error message or the like is returned to the user.
The resolution of the URL occurs at the routers and name servers that are at various location at the Internet (including user's network)—or various location in the private network in the case of a private network—and that maintain tables of Web addresses. A router is a device that forwards data packets along networks based on addresses in the header. A Domain Name Servers (DNS) is a program that that translate domain names that typically are part of a typed URL into IP addresses. Routers and DNSs maintain tables of addresses that provide for resolving a URL.
By a source URL we mean an entered URL, e.g., a possibly incorrectly typed URL. By a valid URL we mean a URL that exists in the network. By the target URL we mean the valid URL of the source URL when correctly entered.
Note that the term URL as used herein includes part of a complete URL specifying a file on a server. Thus, for example, the phrase “a possibly incorrectly entered URL” may mean “a possibly incorrectly entered domain name.”
Some techniques have been invented for resolving an incorrectly entered URL. U.S. Pat. No. 6,092,100 to Berstis, et al., titled “METHOD FOR INTELLIGENTLY RESOLVING ENTRY OF AN INCORRECT UNIFORM RESOURCE LOCATOR (URL)” describes a method wherein if a given URL is entered incorrectly at a Web client, a fuzzy URL detection scheme automatically performs a fuzzy search that returns a list of URLs that most closely match what was originally entered into the browser address field. If the fuzzy search does not reveal a match, the browser may contact a server dedicated to performing a broader fuzzy search. In another alternative, the browser contacts a Web server and the fuzzy search is implemented at the Web server in order to return a particular file. The fuzzy search of the unresolved URL is performed against entries of a lexicon stored as an address table that includes candidate URLs, with each URL indexed by a set of N adjacent letters that appear in the URL, and a ranking of how frequently the N adjacent letters appear. N=2 is provided as an example. For each pair of letters, the entry includes a set of at least one of the URLs in the lexicon having a given character pair. The lexicon is based on a history of recently encountered URLs. The search method considers a typed URL or portion thereof, and in that, sets of N adjacent letters, e.g., of two adjacent letters, and generates a frequency table of how often each set of letters appears in the typed URL. That table is compared (ANDed) with the lexicon table generated from the history. The results are ranked to provide a list of likely URLs.
U.S. Pat. No. 6,092,100 to Berstis, et al. is incorporated herein by reference.
The Berstis, et al. method illustrates some problems that exist with much of the prior art. First, the fuzzy search works on letter combinations. There are some typing errors that would never be caught this way. Consider for example a URL devoted to the mathematician Tschebyscheff who is famous for Tschebyscheff polynomials. This name is commonly also spelled as Chebychev, Chebyshev, Chebysheff, and so forth. Similarly, consider for example a Web site devoted to the Russian composer Tschaikovsky. This also is commonly spelled many different ways, and all these different spellings refer to the same object, but have different letter combinations. Similarly, consider the popular donut Krispy Kreme®. There is a Website http://www.krispykreme.com/ dedicated to this brand. A search based on the way the URL sounds is needed to resolve such a URL. For other URLs, e.g., those involving numbers, the numerical closeness of the number rather than letter combinations is likely to lead to the correct answer.
Similarly, some URLs may include sets of “glyphs” that are actually images instead of pure symbols.
Thus, what is needed is a search method that uses different measured of closeness of URLs adapted to different types of URLs and different parts of URLs. Such methods should be able to resolve URLs or URL parts that sound the same, URL, URLs or URL parts that are misspelled based on letter transpositions, as is common in spelling mistakes, URLs or URL parts that are misspelled based numerical closeness, e.g., URL parts that include numbers, and so forth.
Another problem with the Berstis, et al. method and much of the prior art is that to be practical, the prior art methods need to search some relatively finite index or table of possible URLs. A search typically involves forming a signature of the typed URL or part, such as a hash of the URL or part, and then searching a table of hashes of all URLs. A fuzzy search leads to inexact matches, and this in turn involves some concept of closeness or ranking of closeness. Known measures of closeness, e.g., the Berstis, et al. measure of numbers of matching sets of consecutive letters, and other distance measure for closeness of typed strings are typically discrete, e.g., integer-valued measures. Using such measures, it is only practical to carry out a small number of comparisons/closeness determinations. Moreover, hashing is may destroys any “closeness relationship” between strings or numbers, so is not typically usable for fuzzy searched where closeness of strings is important. Hashing can typically be used only for exact matching of a hashed string or a number. Using exact matching on a hierarchy of substrings may require a prohibitively long time, e.g., that varies exponentially with the lengths of strings that are compared. Thus the Berstis, et al. method considers tables of recently accessed Web sites. The inventors assert that it is not practical in real time to conduct such a fuzzy search against all possible URLs. There are said to be in the order of 1010 URLs in existence. Whatever the actual number, it is clear that it is large and likely to increase as more and more pages are accessible over the Web.
Thus, there also is a need in the art for a practical method for determining an appropriate signature of a typed URLs and fuzzily searching such signature against a very large number of URLs or their signatures.
When string comparisons are used, typically closeness is measured by some integer measure of closeness. Integer measures of closeness do not lend themselves well to many mathematical techniques that have evolved over the years to make fuzzy searches more practical. Thus there is a need in the art for a fuzzy search method for finding a valid URL—e.g., a valid URL part—based on measures of closeness that are not necessarily integer, e.g., that can be computed using floating point arithmetic.