Computer users are increasingly finding navigating document collections to be difficult because of the increasing size of such collections. For example, the World Wide Web on the Internet includes millions of individual pages all dealing with varied content. Moreover, large companies' internal Intranets often includes repositories filled with many thousands of documents, i.e., an example of “local” content.
It is frequently true that documents (content) on the Web and in local content repositories are not very well indexed. Consequently, finding desired information in such a large collection, unless the identity, location, or characteristics of a specific document are well known, can be much like looking for a needle in a haystack.
The World Wide Web is a loosely interlinked collection of documents (mostly text and images, collectively known as content) located on servers distributed over the Internet. Generally speaking, each document has an address, or Uniform Resource Locator (URL), in the exemplary form “http://www.server.net/directory/file.html.” In that notation, the “http:” specifies the protocol by which the document is to be delivered, in this case the “Hypertext Transport Protocol.” The “www.server.net” specifies the name of a computer, or server, on which the document resides; “directory” refers to a directory or folder on the server in which the document resides; and “file.html” specifies the name of the file.
Most documents on the Web are in HTML (Hypertext Markup Language) format, which allows for formatting to be applied to the document, external content (such as images and other multimedia types) to be introduced within the document, and “hotlinks” or “links” to other documents to be placed within the document, among other things. Although this provides some capability of embedding one form of information into another, hotlinking is a static process that does not involve content collaboration in any significant degree.
In particular, content collaboration might be thought of as a resource pool that contains a collection of information that all relates to the same subject or might be defined as belonging to a particular interest category. All of the various locations of concerts being given by a popular musical group such as Pink Floyd might be representative of such a resource pool. Conventional web pages might contain information about a single concert location, i.e., a New York concert, but might not be able to give a user full information on all concert locations throughout the world.
Additionally, content specific web pages that might present a listing of certain restaurants in a particular geographic locale are often incomplete in many respects, since they are a collection established and maintained by a particular content source. A user is therefore limited only to the restaurants collected by that particular content source. Since content sources typically represent information belonging to the same category (such as music or restaurants, for example) using different content formats, it is extremely difficult for content sources to exchange collaborative information. For example, when a user desires to find information on the Internet (or other large network) the user will frequently turn to a “search engine” to locate the information.
The real utility in the search engine will be understood when it is realized that the Web is much like an extremely large library, in that there are literally millions of information objects in existence, and if one knows the URL, one is able to access them. Since the Web has multiple listings of books, movies, restaurants, and the like, the number of things that a user is able to lookup, typically includes all of the contents of a library, in addition to the contents of a video store and might even be extended to include the contents of a typical Yellow Pages.
The difficulty with finding information on the Web is that very little of the information contained therein is referenced to metadata. Accordingly, most searching is done using brut-source techniques, conventionally supplied by various Web Robots of search engines such as AltaVista, Infoseek and Excite. Cites of this type perform the equivalent of reading every book in a library and allowing a user to look things up based on the words in the text. Not surprisingly, Web search results are often poorly presented and have very little relation to what a user was searching for. Additionally, search results are only presented on a page-by-page or object-by-object basis. With the exception of embedded links, and the like, similar material from disparate cites have never been collected and presented in a single document.
Those who have considered these issues generally agree that the Web urgently requires metadata as a means of simplifying information search and recover procedures. Given universal metadata, a set of lookup fields, such as author, title, date, subject and the like, might be appended to all forms of textual information such that information relating to a given author, for example, might be easily extracted. Additionally, search engine details, such as how a Web cite might package and interchange metadata would also need to be standardized or unified, such that all metadata using facilities would be simply and easily accessible regardless of minor perturbations in structure, form and format.
Accordingly, there is a need for both systems and methodologies by which a unitary set of lookup keys, values and software may be developed such that there exists some form of organizing directorate for content.