1. The Field of the Invention
This invention relates to systems, methods, and computer program products for improving computerized search functions by synchronizing document metadata using directory services.
2. Background and Relevant Art
Computerized environments have increased the efficiency by which people perform a wide variety of tasks. For example, computers and computer networks have vastly improved the speed and capabilities by which people communicate ideas to each other. Computerized systems also provide people with enhanced tools for fixing varyingly complex thoughts into an easily accessible medium, which provide far more options than typewriters, pens, pencils, and notepads. Thus, computerized systems greatly enhance information access, and authoring power. In these regards, the advantages of computerized systems are well known.
With regard to information creation, one can author (or create) information as simply as by typing one or two basic text paragraphs in a document that the author may wish to send to another over electronic mail (E-Mail). In other cases, one or many authors may generate thousands of pages in a word processing document, where the word processing document may include several spoken languages, may contain several graphics and other multi-media content, and may comprise a wide variety of electronic formats. In any case, common electronic tools such as a word processor, a web page creator, a text form, and so on help authors affix huge amounts of information into a wide variety of accessible electronic media.
Computerized systems have also enhanced the speed and ability to locate and access this information created by others. Information is accessed and distributed using any one of a number of different techniques and applications, including electronic mail; distributed networks including the Intranet and corporate intranets; database storage and access systems; and the like. However, the overwhelming amount of data and information that is accessible has given rise to problems. In particular, the ability to specifically locate a relevant piece (or pieces) of information, such as a document, from a large and distributed database of information, such as the Internet or a large corporate intranet, has proved to be increasingly difficult.
To address this problem, various types of search tools, sometimes referred to as “search engines,” have been developed. While any one of a different number of techniques and search algorithms are used, in general users of a search engine typically enter one or more search terms, and search results corresponding to those terms are returned by the search engine. In a typical implementation, a person may visit an Internet or local webpage that employs a functional text-input box. The person enters one or more search terms into the text box and the search engine may return to the person one or more related documents, depending on the specificity and nature of the search request.
Search engine implementations vary in complexity and capability. A very simple search engine, for example, may only search for exact spellings of certain words within an opened document. Thus, if a user were to type the misspelled word “medixcal” into a search box, the simple search engine will not likely return any results, or point the user to any meaningful point in the document, unless that exact misspelling has been made within the document. A more complex search engine, however, may allow a user to search millions upon millions of documents based on a wide variety of criteria, even allowing the user to add detailed restrictions, all the while compensating for misspellings. For example, a complex search engine may allow a user to search millions of documents on a local or wide area network for the terms “Fyre Engine”+“fire pole”+“Dennis Finch”, with the restriction that all results must be in English, and that the resultant web page be created after the year 1998. In some cases, the search engine may even correct the spelling of “Fyre Engine” to “Fire Engine”, prior to executing the search.
FIG. 1 illustrates a prior art depiction of one example of an implementation of a search engine. In this example, the search engine algorithm first obtains several documents (or any analogous discrete unit of information) 105, 110 into a database such as index service 100. In one approach, a user may select and enter documents 105 and 110 manually into the index service 100 to be processed. Alternatively, the index service (or related search service) may have a function that automatically locates and obtains documents. This function is sometimes referred to as “crawling,” where the service continually “crawls” across multiple documents on the network by following document reference links within certain documents, and then processing each document as found. The index service 100 processes the documents by identifying key words or general text in the documents 105, 110, and then creating an inverted list 120 (more generally, an “index”).
An exemplary inverted list can be one or more electronic reference documents having a column containing a list of key words, a column containing one more documents containing the key word, a column for the number of occurrences of that key word in the respective document, and a column with an address for each associated document. For the purposes of illustration, however, a more simple inverted list 120 is shown having a column of words (A, B, C, etc.), and a column indicating in which document those words can be found.
When a user enters in one or more search terms (e.g., “Request for A” 132), a typical search engine 130 will employ an algorithm that first finds the one or more terms among the key words in the inverted list 120, and then weighs the resultant documents associated with any found words in the list. The search engine can then return one or more of the associated document references as results 136 to the user, depending on how the search algorithm is configured (i.e., documents having the most occurrences of the word), or depending on any restrictions the user places on the search (i.e., requiring an exact phrase match). Consequently, search engines can be quite useful for locating and accessing information contained within, for example, a distributed network environment.
Search engines such as the foregoing, however, tend to have certain limitations. Since the typical such search engine relies on a generated index to locate documents, the relevance of a give search result is highly dependent on the document content that is used to ultimately construct the index. For example, a document containing only the words “whale,” “fish,” “ocean,” and “ferry” would not be found by some search engines if a user entered the terms “orca,” “tuna,” “sea,” and “transport.” This is because, in general, search engines of the type described do not generate alternate word relationships when building an inverted list. While this type of search engine may provide automatic spell-checking of search terms, they do not automatically search word variants, synonyms, and homonyms, unless the user specifically enters them.
In addition, there are other problems that can complicate the amount and quality of data that a search engine can return to a user seeking information. For example, a large organization may have thousands upon thousands of internal documents on various topics posted on various servers on the local or wide area network. While each posted document may contain different metadata corresponding to metadata concepts (i.e., document identifiers) such as author, date created, size, title, etc., each document may be created with different programs that identify metadata properties differently, or describe the underlying data differently. For example, one document might include author metadata as: “Author=‘Heather F. Pettingill’” while another document's metadata might designate the author as: “By=‘H. Pettingill’” while yet another document might contain no author metadata and merely include the phrase “H. F. Pettingill” centered at the top of the first page. Thus, if an index were created that includes author metadata, a subsequent search may not locate some of these documents if a search were performed for the author “H. F. Pettingill.”
Even if the metadata format is standardized within the organization, the underlying data values that employees may use to classify documents within a general concept in the organization can often undergo several changes. For example, employees may refer to several documents under the classification of “Product Design” one year, and then “Manufacturing Policies” the next year when referring to the same general concept or classification. Similarly, a person's name or contact information may change several times over the course of their employment (i.e., due to name changes, email alias changes, new email domain name, new preferences, new office, new workgroup, etc.).
As such, this can degrade the effectiveness of searches, for example, for all documents authored by “Heather Pettingill,” or for all documents discussing product release policies over the last three to five years. For example, with specific reference to FIG. 1, if there were no direct correlation made between the values A and X on inverted list 120 such that A=X (e.g., “Heather Pettingill”=“Heather Martin” due to a marital name change), a normal search for A or X would only return “Doc” or “Doc2” (but not both) as a result 136. Typically, the only way the search engine might return both documents is if the user searched for both terms based on prior knowledge of the term correlation. Of course, this approach is limited by the fact that the user may not realize the correlation, though the user wishes to have all documents authored by the person in question.
Accordingly, there is a need for more robust systems, methods, and computer program products that relate the types of information available to a search engine so that more accurate search results can be obtained, without requiring a user to iteratively search several variations of the same terms and phrases. In addition, there is a need for robust systems, methods, and computer program products that allow users to search returned results for additional relationships, such as by metadata concepts, or classification data.