Generally speaking a global computer network, e.g., the Internet, is formed of a plurality of computers coupled to a communication line for communicating with each other. Each computer is referred to as a network node. Some nodes serve as information bearing sites while other nodes provide connectivity between end users and the information bearing sites.
The explosive growth of the Internet makes it an essential component of every business, organization and institution strategy, and leads to massive amounts of information being placed in the public domain for people to read and explore. The type of information available ranges from information about companies and their products, services, activities, people and partners, to information about conferences, seminars, and exhibitions, to news sites, to information about universities, schools, colleges, museums and hospitals, to information about government organizations, their purpose, activities and people. The Internet became the venue of choice for every organization for providing pertinent, detailed and timely information about themselves, their cause, services and activities.
The Internet essentially is nothing more than the network infrastructure that connects geographically dispersed computer systems. Every such computer system may contain publicly available (shareable) data that are available to users connected to this network. However, until the early 1990's there was no uniform way or standard conventions for accessing this data. The users had to use a variety of techniques to connect to remote computers (e.g. telnet, ftp, etc) using passwords that were usually site-specific, and they had to know the exact directory and file name that contained the information they were looking for.
The World Wide Web (WWW or simply Web) was created in an effort to simplify and facilitate access to publicly available information from computer systems connected to the Internet. A set of conventions and standards were developed that enabled users to access every Web site (computer system connected to the Web) in the same uniform way, without the need to use special passwords or techniques. In addition, Web browsers became available that let users navigate easily through Web sites by simply clicking hyperlinks (words or sentences connected to some Web resource).
Today the Web contains more than one billion pages that are interconnected with each other and reside in computers all over the world (thus the term “World Wide Web”). The sheer size and explosive growth of the Web has created the need for tools and methods that can automatically search, index, access, extract and recombine information and knowledge that is publicly available from Web resources.
The following definitions are used herein.
Web Domain
Web domain is an Internet address that provides connection to a Web server (a computer system connected to the Internet that allows remote access to some of its contents).
URL
URL stands for Uniform Resource Locator. Generally, URLs have three parts: the first part describes the protocol used to access the content pointed to by the URL, the second contains the directory in which the content is located, and the third contains the file that stores the content:
<protocol>: <domain> <directory> <file>
where “protocol” may be the type http, “domain” is a domain name of the directory in which a file so named is located.
Commonly, the <protocol> part may be missing. In that case, modern Web browsers access the URL as if the http:// prefix was used. In addition, the <file> part may be missing. In that case, the convention calls for the file “index.html” to be fetched.
For example, the following are legal variations of URLs:                www.corex.com/bios.html        www.cardscan.com        fn.cnn.com/archives/may99/pr37.htmlWeb Page        
Web page is the content associated with a URL. In its simplest form, this content is static text, which is stored into a text file indicated by the URL. However, very often the content contains multi-media elements (e.g. images, audio, video, etc) as well as non-static text or other elements (e.g. news tickers, frames, scripts, streaming graphics, etc). Very often, more than one files form a Web page, however, there is only one file that is associated with the URL and which initiates or guides the Web page generation.
Web Browser
Web browser is a software program that allows users to access the content stored in Web sites. Modern Web browsers can also create content “on the fly”, according to instructions received from a Web site. This concept is commonly referred to as “dynamic page generation”. In addition, browsers can commonly send information back to the Web site, thus enabling two-way communication of the user and the Web site.
Hyperlink
Hyperlink, or simply link, is an element in a Web page that links to another part of the same Web page or to an entirely different Web page. When a Web page is viewed through a Web browser, links on that page can be typically activated by clicking on them, in which case the Web browser opens the page that the link points to. Usually every link has two components, a visual component, which is what the user sees in the browser window, and a hidden component, which is the target URL. The visual component can be text (often colored and underlined) or it can be a graphic (a small image). In the latter case, there is optionally some hidden text associated with the link, which appears on the browser window if the user positions the mouse pointer on the link for more than a few seconds. In this invention, the text associated with a link (hidden or not) will be referred to as “link text”, whereas the target URL associated with a link will be referred to as “link URL”.
As our society's infrastructure becomes increasingly dependent on computers and information systems, electronic media and computer networks progressively replace traditional means of storing and disseminating information. There are several reasons for this trend, including cost of physical vs. computer storage, relatively easy protection of digital information from natural disasters and wear, almost instantaneous transmission of digital data to multiple recipients, and, perhaps most importantly, unprecedented capabilities for indexing, search and retrieval of digital information with very little human intervention.
Decades of active research in the Computer Science field of Information Retrieval have yielded several algorithms and techniques for efficiently searching and retrieving information from structured databases. However, the world's largest information repository, the Web, contains mostly unstructured information, in the form of Web pages, text documents, or multimedia files. There are no standards on the content, format, or style of information published in the Web, except perhaps, the requirement that it should be understandable by human readers. Therefore the power of structured database queries that can readily connect, combine and filter information to present exactly what the user wants is not available in the Web.
Trying to alleviate this situation, search engines that index millions of Web pages based on keywords have been developed. Some of these search engines have a user-friendly front end that accepts natural languages queries. In general, these queries are analyzed to extract the keywords the user is possibly looking for, and then a simple keyword-based search is performed through the engine's indexes. However, this essentially corresponds to querying one field only in a database and it lacks the multi-field queries that are typical on any database system. The result is that Web queries cannot become very specific; therefore they tend to return thousands of results of which only a few may be relevant. Furthermore, the “results” returned are not specific data, similar to what database queries typically return; instead, they are lists of Web pages, which may or may not contain the requested answer.
In order to leverage the information retrieval power and search sophistication of database systems, the information needs to be structured, so that it can be stored in database format. Since the Web contains mostly unstructured information, methods and techniques are needed to extract data and discover patterns in the Web in order to transform the unstructured information into structured data.
Examples of some well-known search engines today are Yahoo, Excite, Lycos, Northern Light, Alta Vista, Google, etc. Examples of inventions that attempt to extract structured data from the Web are disclosed in sections 5, 6, and 7 of the related U.S. Provisional Application No. 60/221,750 filed on Jul. 31, 2000 for a “Computer Database Method and Apparatus”. These two separate groups of applications (search engines and data extractors) have different approaches to the problem of Web information retrieval; however, they both share a common need: they need a tool to “feed” them with pages from the Web so that they can either index those pages, or extract data. This tool is usually an automated program (or, “software robot”) that visits and traverses lists of Web sites and is commonly referred to as a “Web crawler”. Every search engine or Web data extraction tool uses one or more Web crawlers that are often specialized in finding and returning pages with specific features or content. Furthermore, these software robots are “smart” enough to optimize their traversal of Web sites so that they spend the minimum possible time in a Web site but return the maximum number of relevant Web pages.
The Web is a vast repository of information and data that grows continuously. Information traditionally published in other media (e.g. manuals, brochures, magazines, books, newspapers, etc.) is now increasingly published either exclusively on the Web, or in two versions, one of which is distributed through the Web. In addition, older information and content from traditional media is now routinely transferred into electronic format to be made available in the Web, e.g. old books from libraries, journals from professional associations, etc. As a result, the Web becomes gradually the primary source of information in our society, with other sources (e.g. books, journals, etc) assuming a secondary role.
As the Web becomes the world's largest information repository, many types of public information about people become accessible through the Web. For example, club and association memberships, employment information, even biographical information can be found in organization Web sites, company Web sites, or news Web sites. Furthermore, many individuals create personal Web sites where they publish themselves all kinds of personal information not available from any other source (e.g. resume, hobbies, interests, “personal news”, etc).
In addition, people often use public forums to exchange e-mails, participate in discussions, ask questions, or provide answers. E-mail discussions from these forums are routinely stored in archives that are publicly available through the Web; these archives are great sources of information about people's interests, expertise, hobbies, professional affiliations, etc.
Employment and biographical information is an invaluable asset for employment agencies and hiring managers who constantly search for qualified professionals to fill job openings. Data about people's interests, hobbies and shopping preferences are priceless for market research and target advertisement campaigns. Finally, any current information about people (e.g. current employment, contact information, etc) is of great interest to individuals who want to search for or reestablish contact with old friends, acquaintances or colleagues.
As organizations increase their Web presence through their own Web sites or press releases that are published on-line, most public information about organizations become accessible through the Web. Any type of organization information that a few years ago would only be published in brochures, news articles, trade show presentations, or direct mail to customers and consumers, now is also routinely published to the organization's Web site where it is readily accessible by anyone with an Internet connection and a Web browser. The information that organizations typically publish in their Web sites include the following:                Organization name        Organization description        Products        Management team        Contact information        Organization press releases        Product reviews, awards, etc        Organization location(s)        . . . etc . . .        