Enterprise search and discovery systems typically interact with complex and highly diverse information sources and entities (e.g., people, paper documents, static and dynamic web pages, files, emails, multimedia files, and the like.) An enterprise knowledge and document search system reliably discovers, combines, and ranks for relevance structured (e.g., relational or geographic database), semi-structured (e.g., web, email, other XML files), and unstructured information (e.g., flat text documents). Moreover, the search system can employ context and scope to help disambiguate search queries as well as support necessary enterprise requirements for fine-grained access control for security and multi-language support.
For example, to maximize likelihood of locating relevant information amongst an abundance of data, search engines are often employed to search the entire world-wide web or a distinguished subset of sites on the web. In some instances, a user is aware of the name of a site, server, or URL to the site that the user desires to access. In such situations, the user can access the site, by simply entering the URL in an address bar of a browser and connecting to the site. However, in most instances, the user does not know the URL or site name that hosts the desired content/information. To locate a site or corresponding URL of interest, users often employ a search engine to facilitate locating and accessing sites based on user-entered keywords and operators.
A search engine is a tool that facilitates web navigation based on entry of a search query comprising one or more keywords. Upon receipt of a query, the search engine retrieves a list of website resources matching the keywords, typically ranked based on relevance to the query. To enable this functionality, the search engine must typically generate and maintain a supporting infrastructure. Agents for such search engines (e.g. spiders or crawlers) navigate websites in a methodical manner and retrieve information stored on sites visited. For example, a crawler can make a copy of all or a portion of websites and related information. The search engine subsequently analyzes the content captured by one or more crawlers to determine how a page or document will be indexed. Indexing transforms website data into a form, the index, which can be employed at search time to facilitate identification of content. Some engines will index all text on a website's resources while others may only index terms associated with particular components (e.g., title, header, or meta-tag). Crawlers must also periodically revisit web pages to detect and capture changes thereto since the last indexing.
Upon entry of one or more keywords as a search query, the search engine retrieves information that matches the query from the index, ranks the resources that match the query, generates a snippet of text associated with matching sites and displays the results to a user. Furthermore, advertisements relating to the search terms can also be displayed together with the results. The user can thereafter scroll through a plurality of returned resources, ads and the like in an attempt to identify information of interest. However, this can be an extremely time-consuming and frustrating process as search engines can return a substantial number of resources. More often then not, the user is forced to narrow the search iteratively by altering and/or adding keywords and operators to obtain the identity of websites including relevant information. Web pages themselves have become dynamic and even more complex over time and have even challenged the smartest of the search crawlers. Employment of scripting and other automated means have generally left the average search crawlers misinterpreting and/or missing entirely the information on some Web pages. A search crawler typically looks at textual data and associated resource data to index.
Likewise, enterprise search solutions rely to a large extent on traditional Information Retrieval (IR) paradigms based on match query and document keywords, and/or categories using formal or informal taxonomies. In general, such approach focuses on text-based keyword tokens that are matched using variations of Boolean, vector space, or probabilistic models, augmented by additional document- or context-derived metadata, complex heuristics, or classification schemes.
Such solutions typically fail to address additional explicit and implicit metadata (user and community or automated tags, entity semantic structure, and the like). In addition, opinion and experiences of other users (e.g., experts, communities, informal roles, trustworthiness, and the like) who have performed similar searches are not efficiently employed in these solutions.