In recent years, there has been a tremendous proliferation of computers connected to a global network known as the Internet. A "client" computer connected to the Internet can download digital information from "server" computers connected to the Internet. Client application software executing on client computers typically accepts commands from a user and obtains data and services by sending requests to server applications running on server computers connected to the Internet. A number of protocols are used to exchange commands and data between computers connected to the Internet. The protocols include the File Transfer Protocol (FTP), the Hypertext Transfer Protocol (HTTP), the Simple ail Transfer Protocol (SMTP), and the "Gopher" document protocol.
The HTTP protocol is used to access data on the World Wide Web, often referred to as "the Web." The World Wide Web is an information service on the Internet providing documents and links between documents. The World Wide Web is made up of numerous Web sites around the world that maintain and distribute Web documents. A Web site may use one or more Web server computers that store and distribute documents in one of a number of formats including the Hypertext Markup Language (HTML).
A HTML document contains text and tags. HTML documents may also contain metadata and metatags. Metadata is data about data and metatags define the meta-data. Examples of metatags that identify meta-data are "author," "language," and "character set." HTML documents may also include tags that contain embedded "links" or "hyperlinks" that reference other data or documents located on the same or another Web server computer. The HTML documents and the document referenced in the hyperlinks may include text, graphics, audio, or video in various formats.
A Web browser is a client application that communicates with server computers via HTTP, FTP, and Gopher protocols. Web browsers receive Web documents from the network and present them to a user. Internet Explorer, available from Microsoft Corporation, Redmond, Wash., is an example of a popular Web browser application.
An intranet is a local area network containing Web servers and client computers operating in a manner similar to that of the World Wide Web described above. Typically, all of the computers on an intranet are contained within a company or organization.
Web crawlers are computer programs that automatically retrieve numerous Web documents from one or more Web sites. A Web crawler processes the received data, preparing the data to be subsequently processed by other programs. For example, a Web crawler may use the retrieved data to create an index of documents available over the Internet or an intranet. A "search engine" can later use the index to locate Web documents that satisfy a specified search criteria.
It is desirable to have a mechanism in the crawler that allows the crawler to feed to client applications, like an indexing engine, a stream of data not directly present in the "crawled" documents. Preferably, such a mechanism would have the ability to modify data retrieved from Web documents with active components in order to allow the retrieved data to be processed more efficiently and accurately by the client application. The mechanism of the invention would also preferably have the ability to exclude a document from being indexed based on its content and properties. The present invention is directed to providing such a mechanism.