There are multiple services available for searching registered and unregistered domain names. Among these services is a protocol called “WhoIs,” that has existed since the formation of the Internet. The WhoIs protocol queries databases for owner information associated with the registrant or assignee information of registered Internet domain names in a top level domain (TLD). Initially, the Internet contained only a few TLDs such as .com, .net, and .org. As the Internet has expanded, however, many new TLDs have been added, including .gov, .edu, .cc, .tv, .jobs, and many others. TLDs are organized into registries, such as a registry containing the .com and .net, TLDs, and another registry containing the .cc, .tv, and .jobs TLDs. These registries are used by WhoIs search services and utilities.
Current WhoIs searching services, however, are substantially restricted. For example, current services only search a single registry per search request. One reason for this restriction is that information from multiple registries is not traditionally accessible in combination. Registry data file sizes are also restrictive. Some registries contain an extremely large amount of data, and current WhoIs search techniques would take too long to index and search the data of multiple registries. The .com/.net registry alone, among the largest of the domain name registries, contains over 120 million domains, and this results in over 700 million entities in the registry database, comprising domains, nameservers, and their associated attributes and relationships, which must be processed and indexed for searching. Furthermore, changes such as additions or deletions to the registries are made constantly due to the dynamic nature of the Internet, to keep registry databases current. Such modifications must also be indexed in near real-time to keep search results accurate and up-to-date. Full re-indexing is periodically required due to corruptions, failures, design changes, deployment changes and unforeseen scenarios. As such, full indexing must be attainable within a few hours in order to be able to catch up with the incremental changes that get accumulated while the full indexing is completed. This helps maintain the search utility's accuracy and efficiency. Due to the sheer volume of data processing and indexing required to manage even a single registry, and the highly dynamic nature of the domain name data, searching across multiple registries simultaneously using current indexing methods is not practical.
Generally, indexing is utilized for documents which are more static in nature, and longer periods of time are allowed to fully index the information or updates. In traditional indexing methods, content is extracted from documents, tokenized into words based primarily on whitespace recognition, stop words are removed, and stemming is sometimes carried out before data is added to the index. Matching arbitrary substrings within terms is typically not of much importance. Search results are ordered by decreasing relevance scores based on term frequency (occurrences of the term in the document) and inverse document frequency (rarity of the term) across the set of documents being searched. Indexing such large quantities of data is difficult when the data is updated almost continuously. These traditional methods are not practical for indexing domain names.
In contrast to documents, domain names are sets of concatenated words and numbers, sometimes, but only relatively rarely, delimited by dashes. As such, indexing domain names is more complex than mere whitespace recognition performed in document indexing. Recognition of the separate words in domain names usually requires tokenization. However, tokenization of every domain name into words for indexing is based on computation-intensive dynamic programming algorithms combined with statistical techniques. This is prone to a certain degree of inaccuracy partly because it is based on probabilistic techniques, and partly because of the inherent ambiguity present in human languages. Tokenization success depends on the use of large language corpuses. This method of indexing domain name data may take several days for the com/net registry alone, unless heavily parallelized across many computers. Even then, tokenization may actually end up decreasing the accuracy of arbitrary substring searches.
Furthermore, domain names are short compared to documents like web pages or word processing files. A meaningful text search for domain names must be able to support arbitrary substring matches. But, results would require ordering by relevance in a useful search utility. Unlike document searches, term frequency and inverse document frequency are less useful relevance indicators in domain name searching, because many domain names would typically only have one or no occurrences of a given search term.
As a result of the restrictions discussed above, current WhoIs services offer few interactive search features for users. Search capabilities are less comprehensive due in part to the lack of interactive search features. Traditional search utilities only search for domain names that exactly match the search query, such as a search for “tom” returning “tom.com,” or “no results.” Normally, results to WhoIs searches show general substring matches, leading only matches, or hypen-separated regex matches. Wildcard searches denoted by “*” such as “*tom” may return additional results such as “atom.com,” but the results are not listed in any order of relevancy or ranked by TLD.
In view of the above, a better way is needed to index domain names to support a more usable WhoIs searching utility that can search multiple TLD registries simultaneously. A unified interactive interface can be a powerful adjunct that will allow focused searches as well as cross-registry visibility.
The disclosed embodiments are directed to overcoming one or more of the problems set forth above, and to providing improved WhoIs search techniques.