Search engines allow an application to make its content searchable. A known approach for making data searchable is by using seedlists, also know as sitemaps. In this approach, an application provides the search engine with a list of data items that needs to be made searchable, the seedlist. Crawling happens in two phases. First, the search engine crawls the seedlist information containing metadata about the data items that need to be crawled. Then, in a second phase, the pieces of content from the application themselves are retrieved, and the data is made searchable. Note that this second phase is optional and there are cases in which only the metadata in the seedlist is retrieved and made searchable.
An application of this approach is Google Sitemap Protocol (Google is a trade mark of Google, Inc.) that allows a webmaster to inform search engines about pages on their sites that are available for crawling. In its simplest form, a sitemap is an XML (Extensible Markup Language) file that lists URLs (Uniform Resource Locators) for a site along with additional metadata about each URL (for example, when it was last updated, how often it usually changes, and how important it is relative to other URLs in the site) so that search engines can more intelligently crawl the site.
Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support sitemaps to pick up all URLs in the sitemap and learn about those URLs using the associated metadata.
Seedlists are a generalization of sitemaps since sitemaps contains only URLs and very little metadata about them while seedlists contains a lot of metadata about the data items including for example, security information. The term seedlist as used herein should be interpreted as including sitemaps.
In some types of application, content is not created in the application but instead existing content from other sources is commented upon, annotated, tagged, or classified, etc. For example, applications that aggregate content and annotations already exist in the web, such as, http://answers.shopping.com. This application aggregates data of products and in particular prices from various web sites. However, the content of such applications is obtained through traditional web crawling and content extraction or using web services APIs (Application Programming Interfaces). Additionally, this aggregation is usually not used to improve the searchability of the items, only to add more metadata (comments, ratings, etc.) and this aggregation is done explicitly with a knowledge of the domain and the metadata expected.
Another example of an application that references and adds metadata to existing content from other sources is an application for social bookmarking, such as Dogear (a trade mark of International Business Machines Corporation) or del.icio.us (a trade mark of Yahoo!Inc.). Social bookmarking lets users centrally store, categorized and share a set of personal web bookmarks with others. Tags are stored in relation to content from other sources on the web or in an Intranet. Thus, bookmarks are easier to find since they can be retrieved using the tags they have been associated with.
These applications which provide metadata in the form of annotations to content from external sources are referred to as annotating applications. Currently, each such annotating application manages the searchability of its annotations independently from the searchability of the external content itself.
Enterprise search engines aim to make all the content or data available in an enterprise searchable. The problem arises as to how to manage content and annotations applied in an independent application in the context of an enterprise search engine.