Generally speaking a global computer network, e.g., the Internet, is formed of a plurality of computers coupled to a communication line for communicating with each other. Each computer is referred to as a network node. Some nodes serve as information bearing sites while other nodes provide connectivity between end users and the information bearing sites.
The explosive growth of the Internet makes it an essential component of every business, organization and institution strategy, and leads to massive amounts of information being placed in the public domain for people to read and explore. The type of information available ranges from information about companies and their products, services, activities, people and partners, to information about conferences, seminars, and exhibitions, to news sites, to information about universities, schools, colleges, museums and hospitals, to information about government organizations, their purpose, activities and people. The Internet has become the venue of choice for every organization for providing pertinent, detailed and timely information about themselves, their cause, services and activities.
The Internet essentially is the network infrastructure that connects geographically dispersed computer systems. Every such computer system may contain publicly available (shareable) data that are available to users connected to this network. However, until the early 1990""s there was no uniform way or standard conventions for accessing this data. The users had to use a variety of techniques to connect to remote computers (e.g. telnet, ftp, etc) using passwords that were usually site-specific, and they had to know the exact directory and file name that contained the information they were looking for.
The World Wide Web (WWW or simply Web) was created in an effort to simplify and facilitate access to publicly available information from computer systems connected to the Internet. A set of conventions and standards were developed that enabled users to access every Web site (computer system connected to the Web) in the same uniform way, without the need to use special passwords or techniques. In addition, Web browsers became available that let users navigate easily through Web sites by simply clicking hyperlinks (words or sentences connected to some Web resource).
Today the Web contains more than one billion pages that are interconnected with each other and reside in computers all over the world (thus the term xe2x80x9cWorld Wide Webxe2x80x9d). The sheer size and explosive growth of the Web has created the need for tools and methods that can automatically search, index, access, extract and recombine information and knowledge that is publicly available from Web resources.
As used herein, the following terms have the indicated definitions.
Web Domain
Web domain is an Internet address that provides connection to a Web server (a computer system connected to the Internet that allows remote access to some of its contents).
URL
URL stands for Uniform Resource Locator. Generally, URLs have three parts: the first part describes the protocol used to access the content pointed to by the URL, the second contains the domain directory in which the content is located, and the third contains the file that stores the content:
 less than protocol greater than :  less than domain greater than  less than directory greater than  less than file greater than 
For example:
http://www.corex.com/bios.html
http://www.cardscan.com/index.html
http://fn.cnn.com/archives/may99/pr37.html ftp://shiva.lin.com/soft/words.zip
Commonly, the  less than protocol greater than  part may be missing. In that case, modem Web browsers access the URL as if the http:// prefix was used. In addition, the  less than file greater than  part may be missing. In that case, the convention calls for the file xe2x80x9cindex.htmlxe2x80x9d to be fetched.
For example, the following are legal variations of the previous example URLs:
www.corex.com/bios.html
www.cardscan.com
fn.cnn.com/archives/may99/pr37.html
ftp://shiva.lin.com/soft/words.zip
20 Web Page
Web page is the content associated with a URL. In its simplest form, this content is static text, which is stored into a text file indicated by the URL. However, very often the content contains multi-media elements (e.g. images, audio, video, etc) as well as non-static text or other elements (e.g. news tickers, frames, scripts, streaming graphics, etc). Very often, more than one file forms a Web page, however, there is only one file that is associated with the URL and which initiates or guides the Web page generation.
Web Browser
Web browser is a software program that allows users to access the content stored in Web sites. Modem Web browsers can also create content xe2x80x9con the flyxe2x80x9d, according to instructions received from a Web site. This concept is commonly referred to as xe2x80x9cdynamic page generationxe2x80x9d. In addition, browsers can commonly send information back to the Web site, thus enabling two-way communication of the user and the Web site.
There are many different types of Web sites, based on the type of content they publish, their purpose, or the type of owner (e.g. company, government, educational institution, etc). Identifying the type of a Web site is important for computer programs that traverse, index, or extract information from Web sites (e.g. search engines, Web data mining applications, etc). When the site type is known, these programs can selectively visit only the xe2x80x9cusefulxe2x80x9d parts of the site, while skipping other parts, or even the whole site (e.g. Internet robots that search for company or people information may skip completely porn sites). In addition, the type of Web site is necessary for estimating the frequency of changes in its content, e.g. news sites may change their content daily, whereas organization sites less frequently, and personal sites (owned by individuals) even less frequently. Internet robots can implement appropriate schedules for visiting a site based on this estimate.
Furthermore, identifying the site type is very helpful in deducing the structure of the site. Broad categories of sites share the same meta-structure, for example, company sites usually have the following sections:
xe2x80x9cAboutxe2x80x9d section, with general information and description of the company
xe2x80x9cContactxe2x80x9d section, with contact information
xe2x80x9cProducts/Servicesxe2x80x9d section
xe2x80x9cNewsxe2x80x9d section, with press releases and news articles relevant to the company
xe2x80x9cEmployment opportunitiesxe2x80x9d section, with a list of current job openings in the company
whereas news sites usually include the following sections:
Current news
Local news
World news
Archives (archived news)
Business section (with business news)
Technology section (with technology news)
When the site type is identified, then this general meta-structure provides the blueprint for the expected actual site structure. This blueprint is a significant aid to Web software robots and data extraction tools that visit and extract information from Web sites.
The purpose of this invention is to automatically classify a Web site into an appropriate type. The potential types may vary, depending on the purpose of the classification. For example, when the purpose of classification is to determine visiting frequency for an Internet robot, then the set of potential types will be based on how frequent the site changes its contents, and may be the following:
{Daily, Weekly, Monthly, Bimonthly, Quarterly, Semiannually, Annually}
On the other hand, if the purpose of classification is to guide Internet robots into visiting certain sections of the site while avoiding others, then the set of potential site types may include the following:
{Company, News, Portal, Government, Hospital, University, Military, Personal}
This invention describes the general mechanism for classifying among any given set of potential types.
Examples of applications that benefit directly from automatic Web site classification are Inventions 5 and 6 as disclosed in the related Provisional Application No. 60/221,750 filed on Jul. 31, 2000 for a xe2x80x9cComputer Database Method and Apparatusxe2x80x9d.
A preferred embodiment is a software program formed of a preparation phase, a training phase and a classification phase. During the preparation phase, the user defines the set of Web site types that the invention must recognize, and prepares tests that provide evidence about one or more of these types. During the training phase, the user runs all the tests on a set of Web sites with known site types. Then, the results of the tests are used to calculate statistical conditional probabilities of the form P(Test result|Hypothesis), i.e., the probability that a particular test result will appear for a particular test, given a particular hypothesis. The resulting table with probabilities can then be used for classification. The invention program runs the tests prepared in the preparation phase on a subject Web site with unknown site type and collects the test results. Then, the invention software combines the test results using the probabilities from the training phase and calculates a confidence level for each of the potential site types, as they have been identified during the preparation phase. Finally, the meta-structure of the site is derived based on the most probable site type.
In the preferred embodiment, potential site types include
News provider (e.g. on-line News, magazine, newspaper, newsletter, etc)
Specialized information provider (e.g. weather, traffic, movies, etc)
Company, for-profit organization
Educational institution (e.g. School, University, College, etc)
Medical organization (e.g. Hospital, Clinic, Health center, etc)
Law firm
Religious organization, church
Non-profit organization
Professional association
Political organization
City level local government
State level government
Government organization
Military
Retail, catalog
Portal, directory, search
Fan club of sports, music stars, movie stars
Sport team
Conference, symposium, workshop
Travel agency, airline
Sex
ISP (Internet Service Provider)
Gaming, sports, outdoors
Personal
Hotel, resort
Entertainment (theater, restaurant, bar, club, etc)
On-line entertainment (puzzles, jokes, chat rooms, on-line games, etc)
Reference (dictionaries, thesaurus, yellow pages, places, quotes, etc)
Job listings, classifieds
Event (festival, celebration, etc)
The tests employed in the preferred embodiment examine one or more of the following:
Text in the site""s hyperlinks
Keywords in the site""s URLs
Keywords in page titles
Keywords provided through the HTML  less than META greater than  tag at the home page
Number of external links
Number of internal links
Distribution of internal and external links among pages
Vocabulary used in different parts of the site
Morphology of the site xe2x80x9ctreexe2x80x9d (number of levels, number of pages on each level, etc)
Morphology of the site""s text content (number of headers, paragraphs, lists, tables, sentence length, format, etc)
Distribution of multimedia elements in the site (pictures, audio, video, graphics, etc)