The present invention relates to methods for searching for information in a plurality of information sources connected to a computer information network and specifically to searching databases on the Internet.
The ARPANET, a predecessor to what is now called the Internet, was started in the late 1960's under J. C. R. Licklider at the Defense Department as a way for a government research funding agency to save costs and to allow its users to share information by having its researchers share computers rather than each institution having its own. Using hardware and software protocols developed for this purpose, users could be at their own terminals but access a computer anywhere on the network as if it were in their own location. The targeted use was remote terminal access, but other uses (such as electronic mail) quickly became widely used. In those early days the number of computers was small and many of the researchers knew each other, because the only computers allowed on the network were those belonging to institutions funded by the Advanced Research Projects Agency of the United States Department of Defense (“ARPA”, later called “DARPA”). Over time many more computers (and users) were added, and access to the Internet became more widespread as the National Science Foundation allowed its researchers and others affiliated with its initiatives to connect to the DARPA network (which later became known as the Internet).
In the 1990's two major evolutions occurred that helped lead to the explosive growth of the Internet. The first was that commercial enterprises were allowed to connect their computers to the Internet without the prior requirement of having a government-funded research project. And second, the World-Wide-Web (the “Web”) protocols and software were created.
From the earliest days a few commercial enterprises were allowed to connect to the ARPANET. However, this was carefully controlled by DARPA, which allowed only those at the cutting-edge of computer research (such as Xerox's Palo Alto Research Center and the computer laboratories of the Massachusetts Institute of Technology) to connect. Later other companies' internal networks were inter-connected (such as IBM's BITNET) as the value of being able to communicate rapidly among companies, government agencies, and educational institutions became clear. But there was little incentive for most other companies and institutions to connect. The ideas underlying the Web had been germinating for some time. Tim Berners-Lee managed the research group at CERN that introduced the Web protocols and software in 1990 that is credited with making these concepts practical and accessible to millions of people. This was done just at the right time to capture the attention of most of the computing world, including the information technology industry and the media. The Web embodied, simultaneously, easy-to-use software and the promise of universal access to a large variety of information. Building on years of Internet protocol development, creation of freely-accessible content, the evolution of low-cost computer networking, and a growing desire for operating system and hardware-independent standardized access, the Web quickly became a dominant computer phenomena of the second half of the 1990's.
This has resulted in explosive growth of the Internet. The number of computers attached to the Internet is estimated to have increased from fewer than one million in January, 1993 to over 40 million six years later. It is also estimated that the number of users with access to the Internet will have increased from 20 million in 1996 to 140 million in 2002.
Along with the explosive growth in computers and users came an even more explosive growth in the information available to those users. Each computer connected to the Internet is potentially a source of information (although many are accessible only to people within a security perimeter, because they are inside a corporate or other institution). Each computer may contain thousands or millions of documents and information files. This is in contrast to the early days of the network, when only a limited number of computers contained documents and files relevant to a limited a set of research subjects. Compounding the issue is the great diversity of information available. Many of the information nuggets available do not fit neatly into the world's standard information classification schemes (e.g., the Dewey Decimal System or the Standard Industrial Classification codes for companies).
The Internet is a great advance for the communications ability of individuals and organizations, because many individuals and most organizations have the financial means to connect to the Internet. Furthermore, a great deal of the world's explicit information (that is, information that is written, graphic, audio or visual) is also available on the Internet. But this very success has caused a major problem that is slowing the usefulness of the Internet itself. The problem is the difficulty of locating relevant information in answer to any particular query.
Current technologies for search-and-retrieval all suffer from problems which cause retrievals to contain irrelevant, non-existent, and out-of-date references, and additionally to contain so many references that the retrievals overwhelm the capacity of a person to find the particular information sought.
Prior art information retrieval processes typically use the measures of “Recall” and “Precision” to assess the efficacy of an approach. Today, the immense size and dynamic nature of the Internet, which has become the database of choice of most people and which is searched most frequently, requires the additional evaluation measures of Ranking and Timeliness.