The present invention relates generally to generating query results, as well as to information management systems that can be used with a heterogeneous enterprise environment, which can include structured data in a relational database as well as unstructured data stored in document images and document management applications. Embodiments also relate to applying ranking query results such as support vector machine (SVM) methods and query relaxation methods to search secure data repositories that contain documents or other data items belonging to numerous heterogeneous enterprise environments. Still further embodiments relate to adjusting rank functions using machine learning methods to automatically train ranking functions to obtain improved results in a query, such as in an enterprise search system able to crawl and search heterogeneous enterprise content.
Typically in an enterprise relational database, query generators are used to construct database queries which are then sent to a database for execution. A user constructs a query by selecting items from a drop down list of items displayed on the screen. The items may represent data or documents which are to be obtained from a database or a URL, or alternatively the items may represent operations that are to be performed on these data, Once the items have been selected, the query generator then generates a query, usually in Structured Query Language (SQL), for execution by the database. Often, the query consists of keyword searches for documents or simply from the drop down list for structured data. However, when an enterprise corpora consists of heterogeneous applications or where the same document has differing attributes emphasized in heterogeneous applications, keyword searches and drop down list searches of structured data will not meet the needs of a sophisticated enterprise search end user.
An end user in an enterprise environment will also frequently search huge databases, sometimes external to the heterogeneous enterprise corpora environment. For example, Internet search engines are frequently used to search the entire World Wide Web. Information retrieval systems are traditionally judged by their precision and recall. Large databases of documents, especially the World Wide Web, contain many “low quality” documents where the relevance to the desired search term is extremely low or non-existent. As a result, searches typically return hundreds of irrelevant or unwanted documents that camouflage the few relevant documents that meet the personalized needs of an end user. In order to improve the selectivity of the results, common techniques allow an end user to modify the search, or to provide additional search terms. These techniques are most effective in cases where the database is homogeneous and already classified into subsets, or in cases where the user is searching for well known and specific information. In other cases, however, these techniques are often not effective.
A typical enterprise has a large number of sources of data and a large number of different types of data. In addition, some data may be connected to proprietary data networks, while other data sources may be connected to and accessible from public data networks, such as the Internet. More particularly, information within a single enterprise can be spread across Web pages, databases, mail servers or other collaboration software, document repositories, file servers, and desktops. In enterprise search, different system deployments or different corpora may require different ranking algorithms to return a customized listing of hits to an end user. Providing a simple and intuitive way to allow customers ranking search results in heterogeneous enterprise environments is critical to improve user flexibility and personalization.
One approach to search heterogeneous enterprise corpora utilizes a secure enterprise search (SES) system, such as may be found in the Oracle® Secure Enterprise Search product from Oracle Corporation of Redwood Shores, Calif., a standalone product or integrated component that provides a simple yet powerful way to search data across an enterprise. An SES system can crawl and index any content and return relevant results in a way that is familiar to users, such as is returned for typical Internet-based search results. SES also can provide a query service API, for example, that can easily be plugged into various components in order to obtain a search service for those components.
A SES enterprise search system can utilize the text index of a database. In one embodiment, a database application accepts documents and generates the lists and other elements useful for text searching. An application programming interface (API) allows a user to submit queries, such as text queries, to search documents based on, for example, keywords.
A query layer can be configured to receive queries from users, applications, entities, etc. These can be any appropriate queries, such as simple text queries entered through a search box or advanced queries. The query layer can convert a user query into the appropriate text queries, making sure security, authorization, authentication, and other aspects are addressed, such that the results are returned to the user based on what the user is allowed to access across the enterprise.
In other applications, search engines typically provide a source of indexed documents (from the Internet or an intranet) that can be rapidly scanned in response to a search query submitted by a user. As the number of documents accessible via an enterprise intranet or the Internet grows, the number of documents that match a particular query becomes unmanageable. Previous approaches for prioritizing searches have involved keyword priorities and pairs of keywords leading to some limited search results improvement. However, not every document matching the query is likely to be equally important from the user's perspective. As a result, a user may still be overwhelmed by an enormous number of documents returned by a search engine, unless the documents are ordered based on their relevance to the user's specific query and not merely limited to keywords or pairing of keywords.
Another problem is that differing deployments in a heterogeneous enterprise environment may want to emphasize different document attributes, creating a difficult task for a user attempting to return results from such a document. Often, the results of such a search will be that the desired document hit is at the end of several pages of results.
One way to order documents is to create a page rank algorithm. Many search engines also provide a relevance ranking, which is a relative numerical estimate of the statistical likelihood that the material at a given URL will be of interest in comparison to other documents. Relevance rankings are often based on the number of times a keyword or search phrase appears in a document, its placement in the document and the size of the document. However, in the context of differing attributes for the same document in a heterogeneous enterprise environment, such relevance ranking tools do not offer an end user the desired level of configurability and customization currently desired.
Ranking functions that rank documents according to their relevance to a given search query are known, and while useful in some settings, these functions do not allow a consistent user in a heterogeneous enterprise environment to personalize ranking results based on an end user set of preferences, either globally or for a single instance. Therefore, efforts continue in the art to develop ranking functions that provide better search results for a given search query compared to search results generated by search engines using known ranking functions. The ability to allow an enterprise end user to change ranking functions to customize the ranking of query results returned in heterogeneous enterprise environment to return personalized rankings of content for a single instance within the enterprise has remained unsolved.
Another way to improve query results is to utilize an applied ranking support vector machine (SVM) to change ranking functions. In applications related to Internet Web search, for example, ranking functions need to be changed frequently to handle search requirements. In enterprise search, different deployments of search systems or different type of search corpora may require different ranking functions. A query relaxation process can improve query efficiency and generate a hit list with higher relevancy if a feature vector formation for an early ranking function is cheaper and an early feature vector and ranking function in combination actually generates more relevant hits. Where both of these conditions are met, providing a query relaxation technique coupled with a machine learning ranking SVM method yields improved query results to an end user in a heterogeneous enterprise environment.
Therefore, a simple, intuitive, and heuristic method to allow an end user to apply ranking SVM in query relaxation to meet global or single instance search requirements in a heterogeneous enterprise environment query is needed.