Embodiments in accordance with the present invention relate generally to electronic searching of documents and data, and more particularly relate to automatically determining acronym and synonym pairs useful for obtaining more accurate query results.
An end user in an enterprise or web environment frequently searches huge databases. For example, Internet search engines are frequently used to search the entire world wide web. Information retrieval systems are traditionally judged by their precision and recall. Large databases of documents, especially the World Wide Web, contain many low quality documents where the relevance to the desired search term is extremely low or non-existent. As a result, searches typically return hundreds of irrelevant or unwanted documents which camouflage the few relevant ones that meet the personalized needs of an end user. In order to improve the selectivity of the results, common techniques allow an end user to modify the search, or to provide different or additional search terms. These techniques are most effective in cases where the database being searched is homogeneous or structured and already classified into subsets, or in cases where the user is searching for well known and specific information. In other cases, however, these techniques are often not effective.
When attempting to locate information such as electronic documents, it is common for a user to enter search terms into a search engine interface, whereby the engine can utilize those terms to search for documents that have matching keywords, text, titles, etc. One problem with such an approach is that there might be multiple ways to express a given term, such that a relevant document might not match a given term. For example, a user searching for the term “real application clusters” might search by a common industry term such as “RAC,” which would result in finding only documents that use that particular acronym and not documents that use the full term “real application clusters”. Given a corpus of documents, then, it can be desirable to utilize acronyms and synonym pairs to build a thesaurus, whereby relationships between terms can be used by applications such as text mining applications, search engines, etc.
In enterprise searching, for example, different system deployments or different corpora may define the same terms differently, thus making it difficult to return a customized listing of hits to an end user. Providing a simple and intuitive way to allow customers to improve search results in heterogeneous enterprise environments is critical to improve user flexibility and personalization. One way to improve search results in such an environment is to define and maintain a list of acronym and synonym pairs from disparate sources of data. However, this task is complicated where the context of a term may be different in heterogeneous applications, and where there many be numerous such terms. A customized thesaurus could be manually built for a given corpus of focus, but such efforts would be time consuming and expensive.
Therefore it is desirable to provide a simple, intuitive, and heuristic method to allow an end user to automatically define and find acronym and synonym pairs to meet global or single instance requirements in a heterogeneous enterprise environment query.