This invention relates generally to information retrieval in a distributed computer environment. More particularly, it relates to a thorough search of plain text and compressed documents in a distributed database such as an Internet environment.
It is well known to connect a plurality of computer systems into a network of computer systems. In this way, the collective resources available within the network may be shared among users, thus allowing each connected user to enjoy resources which would not be economically feasible to provide to each user individually. With the growth of the Internet, sharing of computer resources has been brought to a much wider audience; it has become a cultural fixture in today's society for both information and entertainment. Government agencies employ Internet sites for a variety of informational purposes. For many companies, their Internet sites are an integral part of their business; they are frequently mentioned in the companies' television, radio and print advertising.
The World Wide Web, or simply "the web", is the Internet's multimedia information retrieval system. It is the most commonly used method of transferring data in the Internet environment. Other methods exist such as the File Transfer Protocol (FTP) and Gopher, but have not achieved the popularity of the Web. Client machines accomplish transactions to Web servers using the Hypertext Transfer Protocol (HTTP), which is a known application protocol providing users access to files, e.g,, text, graphics, images, sound, video, using a standard page description language known as the Hypertext Markup Language (HTML). HTML provides basic document formatting and allows the developer to specify "links" to other servers and files. In the Internet paradigm, a network path to a server is identified by a Uniform Resource Locator (URL) having a special syntax for defining a network connection.
Retrieval of information is generally achieved by the use of an HTML-compatible "browser", e.g., Netscape Navigator, at a client machine. When the user of the browser specifies a link via a URL, the client issues a request to a naming service to map a hostname in the URL to a particular network IP address at which the server is located. The naming service returns a list of one or more IP addresses that can respond to the request. Using one of the IP addresses, the browser establishes a connection to a server. If the server is available, it returns a document or other object formatted according to HTML.
One of the frustrations of the web is that although there is a cornucopia of information stored in the various documents, it is often very difficult to locate. There are a variety of search engines both of a general nature such as Alta Vista, HotBot and Excite which search a plurality of different web sites as well as the search engines on the web sites themselves. The results for these search engines are unpredictable and vary from search engine to search engine.
One of the problems is that the search engines ignore the many zip files, PDF files or otherwise compressed files on the server. At most, a name of the file is searched or a one line description of the file in an index. This is usually insufficient. The compression of the files is important to reduce the load on the Internet as well as provide a better chance that a user will receive a file without transmission errors in a reasonable amount of time. However, many times the information for which the user is searching is located in one of the compressed files. Due to the nature of prior art search engines, the user is forced to manually decompress and search the files. This is unacceptable.
The present invention provides a solution to this problem.