Generally speaking a global computer network, e.g., the Internet, is formed of a plurality of computers coupled to a communication line for communicating with each other. Each computer is referred to as a network node. Some nodes serve as information bearing sites while other nodes provide connectivity between end users and the information bearing sites.
The explosive growth of the Internet makes it an essential component of every business, organization and institution strategy, and leads to massive amounts of information being placed in the public domain for people to read and explore. The type of information available ranges from information about companies and their products, services, activities, people and partners, to information about conferences, seminars, and exhibitions, to news sites, to information about universities, schools, colleges, museums and hospitals, to information about government organizations, their purpose, activities and people. The Internet became the venue of choice for every organization for providing pertinent, detailed and timely information about themselves, their cause, services and activities.
The Internet essentially is nothing more than the network infrastructure that connects geographically dispersed computer systems. Every such computer system may contain publicly available (shareable) data that are available to users connected to this network. However, until the early 1990's there was no uniform way or standard conventions for accessing this data. The users had to use a variety of techniques to connect to remote computers (e.g. telnet, ftp, etc) using passwords that were usually site-specific, and they had to know the exact directory and file name that contained the information they were looking for.
The World Wide Web (WWW or simply Web) was created in an effort to simplify and facilitate access to publicly available information from computer systems connected to the Internet. A set of conventions and standards were developed that enabled users to access every Web site (computer system connected to the Web) in the same uniform way, without the need to use special passwords or techniques. In addition, Web browsers became available that let users navigate easily through Web sites by simply clicking hyperlinks (words or sentences connected to some Web resource).
Today the Web contains more than one billion pages that are interconnected with each other and reside in computers all over the world (thus the term “World Wide Web”). The sheer size and explosive growth of the Web has created the need for tools and methods that can automatically search, index, access, extract and recombine information and knowledge that is publicly available from Web resources.
The following definitions of commonly used terms are used herein.
Web Domain
Web domain is an Internet address that provides connection to a Web server (a computer system connected to the Internet that allows remote access to some of its contents).
URL
URL stands for Uniform Resource Locator. Generally, URLs have three parts: the first part describes the protocol used to access the content pointed to by the URL, the second contains the directory in which the content is located, and the third contains the file that stores the content:
<protocol>: <domain><directory><file>
For example:                http://www.corex.com/bios.html        http://www.cardscan.com/index.html        http://fn.cnn.com/archives/may99/pr37.html        ftp://shiva.lin.com/soft/words.zip        
Commonly, the <protocol> part may be missing. In that case, modem Web browsers access the URL as if the http:// prefix was used. In addition, the <file> part may be missing. In that case, the convention calls for the file “index.html” to be fetched.
For example, the following are legal variations of the previous example URLs:                www.corex.com/bios.html        www.cardscan.com        fn.cnn.com/archives/may99/pr37.html        ftp://shiva.lin.com/soft/words.zipWeb Page        
Web page is the content associated with a URL. In its simplest form, this content is static text, which is stored into a text file indicated by the URL. However, very often the content contains multi-media elements (e.g. images, audio, video, etc) as well as non-static text or other elements (e.g. news tickers, frames, scripts, streaming graphics, etc). Very often, more than one files form a Web page, however, there is only one file that is associated with the URL and which initiates or guides the Web page generation.
Web Browser
Web browser is a software program that allows users to access the content stored in Web sites. Modern Web browsers can also create content “on the fly”, according to instructions received from a Web site. This concept is commonly referred to as “dynamic page generation”. In addition, browsers can commonly send information back to the Web site, thus enabling two-way communication of the user and the Web site.
Every Web site publishes its content packaged in one or more Web pages. Typically, a Web page contains a combination of text and multimedia elements (audio, video, pictures, graphics, etc) and has relatively small and finite size. There are of course exceptions, most notably in pages that contain streaming media, which may appear to have “infinite” size, and in cases of dynamic pages that are produced dynamically, “on the fly”. However, even in those cases, there is some basic HTML code that forms the infrastructure of the page, and which may dynamically download or produce its contents on the fly.
In general, it is more useful for someone to identify the contents of “static” pages, which are less likely to change over time, and which can be downloaded into local storage for further processing. When the contents of a page are known, then special data extraction tools can be used to detect and extract relevant pieces of information. For example, a page identified as containing contact information may be passed to an address extraction tool; pages that contain press releases may be given to search engines that index news; and so on. Furthermore, identifying automatically the content type may be useful in “filtering” applications, which filter out unwanted pages (e.g. porn filters). Simple filters used today work mostly on the basis of keyword searches. The current invention, however, uses a much more sophisticated and generic technique, which combines several test outcomes and their statistical probabilities to produce a list of potential content types, each one given with a specific confidence level.
There are several applications that can significantly benefit from automatic Web page content identification; for example, see Inventions 4, 5 and 6 as disclosed in the related Provisional Application No. 60/221,750 filed on Jul. 31, 2000 for a “Computer Database Method and Apparatus”.