1. Field of the Invention
The present invention relates to resource discovery on the Internet, especially to the functions currently performed by search engines and directories.
2. Description of Related Art
Search engines play a vital role in finding things in cyberspace. The first of these search engines, Archie, maintained a database of approximately 1,500 host computers, which housed files accessible through the file transfer protocol (FTP) space of the Internet. FTP sites were feasible in the early days of the Internet when the number of users and host computers was relatively small. The next advancement in search engine technology produced a more user friendly system called Gopher, a subject-based, menu-driven guide for finding information on the Internet. Gopher searched all the files located on a particular host; however, it was impossible to know if all information relevant to a search resided on one host. Visiting all 6,000 Gopher servers would be an incredibly time-consuming task. Therefore, the search tool Veronica was developed to search Gopher space.
With the advent of the World Wide Web, it became necessary to create a new tool for cataloging information found on this portion of the Internet. Unlike a real-world landscape, the contours of the Web constantly shift as sites come and go, necessitating constant ‘reconnaissance’ in order to provide users with an accurate map of its offerings. This “map”, which is really a catalog, is at the core of the search engine. When a user inputs a query into a dialog box, the user is not searching the Web per se. Rather, the user is searching an index of the Web created and continually updated by the search engine.
The first Web search engine was World Wide Web Wanderer. It used an automatic search agent called ‘robots’ to track the Web's growth. Then there was a search engine called ALIWEB. ALIWEB didn't use a search agent. Instead, it asked people to write descriptions of their Web service and register at ALIWEB. ALIWEB then periodically retrieved information from registered Web servers and combined them into a search database. Soon robots-based search engines and editor-based directories became the popular tools for Internet search. The search agent works like a chain reaction. It starts from one Web page and follows the out-links to other Web pages and then repeats the same pattern on each Web page it finds.
Early search engines focused on ‘breadth first’ and ‘depth first’ search agents. Referred to colloquially as ‘robots’ or ‘spiders’, these agents were set loose onto the Web to index HyperText Mark-up Language (HTML) files residing on the myriad of servers connected to the Internet.
The ‘breadth’ first approach works by ‘hydroplaning’ over the expanse of the Web, taking note of any hyperlink references found in a file, but deferring any deeper inquiry in favor of moving on to cover as much territory as possible. In contrast, a ‘depth first’ approach works by honing in on a site, dropping anchor and thoroughly exploring every pointer leading from the file, drilling down until the search agent finds a file with no links outward bound.
Search engines operate in a three-step procedure. First, specialized types of software, referred to as ‘robots’, ‘spiders’ or ‘crawlers’, go out and retrieve information about Web sites. The search engine either has found these Web sites itself or the Web sites (through their Webmasters, those in charge of Web site management) have asked to be indexed. Some search engines tend to “index” (record by word) all of the terms on a given web site. Some may index only the terms within the first few sentences, the web site title, or the site's metatag, which is not viewable on the actual page and contains a short summary description provided by the site designer. Search engines must re-sample the web sites periodically to detect any change since last indexing.
The second step is to store the indexes in a database with ranking for each web page. The rank reflects the relevance of the web pages to certain keywords. A proprietary algorithm is used to evaluate the index. Since different search engines employ different algorithms, the ranking results are not consistent from all the search engines. When an Internet user types a keyword or keywords as search query, search engines retrieve indexed information which matches the query from the database. This last step completes the service of search engines.
Internet directories operate on a different principle. They require human editors to view an individual web site and determine its placement into a subject classification scheme or taxonomy. Once done, certain keywords associated with those sites can be used for searching the directory's database to find web sites of interest, or people can follow the structure of the directory to find the information located under the directory structure.
Coverage on the search engines and directories affects the Internet usage significantly. People are heavily relying on the search engines and directories to use the Internet. According to a recent study, two-thirds to three-quarters of all users cite finding information as one of their primary uses of the Internet and more than 98% of active Web users rely on the Internet to find reference material, 30% on a daily basis and a further 40% on a weekly basis. The major Internet search engines—HotBot, Northern Light and AltaVista—individually catalog at most 16% of the Internet's sites. As the amount of web pages increase, the coverage in the past two years has showed a decline. Combined, the results from all search engines the total Internet coverage is only about 42%. Due to the cost and time in individually assigning sites to categories and the editorial policy used by directory companies, lack of coverage is also a problem for Internet directories.
Although some search engines companies (Google and Inktomi) claimed their coverages are over 1 billion Web pages now, there is more content than current search engine companies can cover. There are more than one million new pages everyday. There are more non-HTML-text contents, e.g. Adobe's portable document format (PDF) and formatted files and multimedia files created. Also there are many non-crawlable contents, such as sites that have no links pointing to them, sites screened by a login, corporate intranets, sites that use robots.txt scripts to bar search robots, and deep content. Studies show that the “invisible Web” contains deep contents, nearly 550 billion Web pages, and most of which are open to the public but never touched by search engines. It is estimated that more than 100,000 deep Web sites exist. Another reason that search engines have difficulty finding all information on the Web is the structure of the Web, a bow tie shape according to a recent study. There is a large cluster of the Web that contains Web pages that cannot be reached by links.
Some researchers see the coverage problem as damage to the intention of the Internet as a public good. The Internet as a public community embodies the ideals of a liberal democratic society. It is a rich array of commercial, political, academic, and artistic activities that fosters associations and communications of all people around the world, and provides a virtually endless supply of information. As technology progresses, it is certain that there will be more Internet applications. If trends on Internet directories and search engines lead to a narrowing of options, the Internet as the kind of public good that many people envisioned will be seriously undermined.
The information retrieved from search engines doesn't satisfy relevance very well. The indexing methods used by current search engines often misrepresent the contents in the indexed Web sites. Web site builders don't have much control on what they want Web users to know about their Web site. To increase a Web site's chance to be indexed correctly and to be placed on a higher spot in the “found sites” list, Web designers need to spend extra efforts to make a Web site suited for the search engines. This is always a confusing job because each search engine uses a different algorithm for indexing, and many keep it a secret.
Since the relevance is poor, as Web users conduct searches by using search engines, they suffer so-called “information overload”, i.e. too much irrelevant information and no efficiency. In the worst cases, submitting broad query terms to search engines can result in literally hundreds of thousands of potential Web pages identified. Many times users also get the same Web site and/or pages repeatedly appearing in the found result. To find what they want, users usually need to visit several search service Web sites.
Recency is poor. 50% of Internet users cite as one of their typical search problems as searches that turn up broken links. The bigger the search engine service is, the higher percentage of the dead links it has. It seems that there is a trade off between comprehensiveness and recency. Reducing the time between re-sampling is a big challenge for search engines. It will also unreasonably increase a visited Web site server's load. There is a considerable backlog on the directory service; for example, it can take six months for Yahoo! to put a site under its directory, if the editor decides the Web is suitable to be included. Therefore, recency will be a serious problem as the Web increases with a fast speed.
For current search engines and directories, Web users don't have much to say at getting a better search service. Because the Internet grows so rapidly, a self-improving search service is necessary for Web users.
Metadata, structured data about data, as a way to improve Internet searching has been proposed by Dublin Core Metadata Initiative since 1995. World Wide Web Consortium (W3C), under the leadership of Tim Berners-Lee, also proposed Resource Description Framework (RDF) for broader Internet applications including resource discovery. However, these metadata standards have to be recognized by search engines. Currently without support by any of the major search engines, there is no reason for Web site builders to put them into the Web pages and the Internet community cannot benefit from them.
The related art includes articles that call for improvements in searching on the Internet. Searching the Web: General and Scientific Information Access and Accessibility of Information on the Web by Steve Lawrence et al., and Defining the Web: The Politics of Search Engines by Lucas Introna et al. discuss limitations of current search engines.
Inventions of interest, as depicted in patents, include U.S. Pat. No. 5,283,731, issued on Feb. 1, 1994 to James E. Lalonde et al., which describes a computer based classified ad system and method.
U.S. Pat. No. 5,319,542, issued on Jun. 7, 1994 to John E. King, Jr. et al., describes an electronic catalog ordering process and system.
U.S. Pat. No. 5,649,186, issued on Jul. 15, 1997 to Gregory J. Ferguson, describes a system and computer based method providing a dynamic information clipping service.
U.S. Pat. No. 5,721,910, issued on Feb. 24, 1998 to Sandra S. Unger et al., describes a database system and a method of producing a database which can be used to assign scientific or technical documents, such as patents and/or technical or scientific publications and/or abstracts of these patents or publications, to one or more scientific or technical categories within a multidimensional hierarchical system which reflects the business, scientific or technical interests of a business, scientific or technical entity or specialty.
U.S. Pat. No. 5,727,156, issued on Mar. 10, 1998 to Dirk Herr-Hoyman et al., describes a method and apparatus for posting hypertext documents to a hypertext server so as to make the hypertext documents accessible to users of the hypertext document system while securing against unauthorized modification of the posted hypertext documents.
U.S. Pat. No. 5,745,882, issued on Apr. 28, 1998 to Matthew J. Bixler et al., describes an interface for an electronic classified advertising system that includes the capability for the user to enter search criteria for an item of interest, to save the search criteria and to be notified by the system when an item matching the search criteria is entered into the system.
U.S. Pat. No. 5,794,236, issued on Aug. 11, 1998 to Joseph P. Mehrle, describes a computer based system that will classify a legal document into a location within a legal hierarchy.
U.S. Pat. No. 5,799,284, issued on Aug. 25, 1998 to Roy E. Bourquin, describes a computer system that utilizes client/server software to allow users of the client software to log into a server and publish information about a product or service.
U.S. Pat. No. 5,855,013, issued on Dec. 29, 1998 to Dave C. Fisk, describes a method and apparatus for creating and maintaining a computer database using a virtual index system.
U.S. Pat. No. 5,870,717, issued on Feb. 9, 1999 to Charles F. Wiecha, describes a system for ordering items over a computer network using an electronic catalog.
U.S. Pat. No. 5,963,951, issued on Oct. 5, 1999 to Gregg Collins, describes a computerized on-line dating service that provides user-controlled perusal of search results.
U.S. Pat. No. 5,974,409, issued on Oct. 26, 1999 to Sankrant Sanu et al., describes an enhanced find system and method for locating offerings within an interactive on-line network.
U.S. Pat. No. 6,009,410, issued on Dec. 28, 1999 to Suzanne L. LeMole et al., describes a method and system for presenting customized advertising to a user on the World Wide Web.
International Patent document WO 98/19417, published on May. 7, 1998, describes an integrated computer-implemented corporate information delivery system.
None of the above inventions and patents, taken either singly or in combination, is seen to describe the instant invention as claimed.