A network is a collection of links and nodes (e.g., multiple computers and/or other devices connected together) arranged so that information may be passed from one part of the network to another over multiple links and through various nodes. Examples of networks include the Internet, the public switched telephone network, the global Telex network, computer networks (e.g., an intranet, an extranet, a local-area network, or a wide-area network), wired networks, and wireless networks.
The Internet is a worldwide network of computers and computer networks arranged to allow the easy and robust exchange of information between computer users. Hundreds of millions of people around the world have access to computers connected to the Internet via Internet Service Providers (ISPs). Content providers place multimedia information (e.g., text, graphics, audio, video, animation, and other forms of data) at specific locations on the Internet referred to as web pages. Websites comprise a collection of connected, or otherwise related, web pages. The combination of all the websites and their corresponding web pages on the Internet is generally known as the World Wide Web (WWW) or simply the Web.
Some Internet users, typically those that are larger and more sophisticated, may provide their own hardware, software, and connections to the Internet. But many Internet users either do not have the resources available or do not want to create and maintain the infrastructure necessary to host their own websites. To assist such individuals (or entities), hosting companies exist that offer website hosting services. These hosting service providers typically provide the hardware, software, and electronic communication means necessary to connect multiple websites to the Internet. A single hosting service provider may literally host thousands of websites on one or more hosting servers.
Browsers are able to locate specific websites because each website, resource, and computer on the Internet has a unique Internet Protocol (IP) address. Presently, there are two standards for IP addresses. The older IP address standard, often called IP Version 4 (IPv4), is a 32-bit binary number, which is typically shown in dotted decimal notation, where four 8-bit bytes are separated by a dot from each other (e.g., 64.202.167.32). The notation is used to improve human readability. The newer IP address standard, often called IP Version 6 (IPv6) or Next Generation Internet Protocol (IPng), is a 128-bit binary number. The standard human readable notation for IPv6 addresses presents the address as eight 16-bit hexadecimal words, each separated by a colon (e.g., 2EDC:BA98:0332:0000:CF8A:000C:2154:7313).
IP addresses, however, even in human readable notation, are difficult for people to remember and use. A Uniform Resource Locator (URL) is much easier to remember and may be used to point to any computer, directory, or file on the Internet. A browser is able to access a website on the Internet through the use of a URL. The URL may include a Hypertext Transfer Protocol (HTTP) request combined with the website's Internet address, also known as the website's domain name. An example of a URL with a HTTP request and domain name is: http://www.companyname.com. In this example, the “http” identifies the URL as a HTTP request and the “companyname.com” is the domain name.
Domain names are easier to remember and use than their corresponding IP addresses. The Internet Corporation for Assigned Names and Numbers (ICANN) approves some Generic Top-Level Domains (gTLD) and delegates the responsibility to a particular organization (a “registry”) for maintaining an authoritative source for the registered domain names within a TLD and their corresponding IP addresses. For certain TLDs (e.g., .biz, .info, .name, and .org) the registry is also the authoritative source for contact information related to the domain name and is referred to as a “thick” registry. For other TLDs (e.g., .com and .net) only the domain name, registrar identification, and name server information is stored within the registry, and a registrar is the authoritative source for the contact information related to the domain name. Such registries are referred to as “thin” registries. Most gTLDs are organized through a central domain name Shared Registration System (SRS) based on their TLD.
The process for registering a domain name with .com, .net, .org, and some other TLDs allows an Internet user to use an ICANN-accredited registrar to register their domain name. For example, if an Internet user, John Doe, wishes to register the domain name “mycompany.com,” John Doe may initially determine whether the desired domain name is available by contacting a domain name registrar. The Internet user may make this contact using the registrar's webpage and typing the desired domain name into a field on the registrar's webpage created for this purpose. Upon receiving the request from the Internet user, the registrar may ascertain whether “mycompany.com” has already been registered by checking the SRS database associated with the TLD of the domain name. The results of the search then may be displayed on the webpage to thereby notify the Internet user of the availability of the domain name. If the domain name is available, the Internet user may proceed with the registration process. If the domain name is not available for registration, the Internet user may keep selecting alternative domain names until an available domain name is found.
For Internet users and businesses alike, the Internet continues to be increasingly valuable. More people use the Web for everyday tasks, from social networking, shopping, banking, and paying bills to consuming media and entertainment. E-commerce is growing, with businesses delivering more services and content across the Internet, communicating and collaborating online, and inventing new ways to connect with each other. Frequently, e-commerce websites are hosted at domain names containing one or more words that are relevant to the services offered on the website, such as “autoinsurancedeals.com” for automobile insurance services. It would be advantageous to automatically identify one or more categories of a business based on words within a domain name at which the business's website is hosted, in order to offer website, e-commerce, and other business services that are relevant to the business.
It would be further advantageous to categorize all or a subset of the websites on the Internet according to a standardized categorization scheme. Categorizing the websites on the Internet generates data that can be analyzed to determine usage trends of the Internet across substantially any industry and any geographic region. One known method of categorizing websites is directed to a particular set of TLDs and uses the North American Industry Classification System (NAICS) to classify business websites according to the type of economic activity. However, this method only classifies websites to the 3-digit code level of specificity. The 3-digit NAICS code specifies the largest business sector and subsector of the business, but leaves out the remaining 1-3 digits of the NAICS code that include the industry group and NAICS and national industry identifiers. Furthermore, the method requires manual entry of the NAICS code or identifiable business sector information for each website. Due to the large and increasing number of websites, currently over 100 million, manual categorization of even ten percent of existing websites could take several years to complete. An automated and more specific categorization method is needed.