1. Field of the Invention
The present invention relates generally to data searching methods and systems, and more particularly, to enterprise searching.
2. Discussion of the Related Art
As computer systems track and store more and more data in databases or other digital formats, search technology for searching through and finding items within massive quantities of stored has become essential for data-driven systems. Enterprise searching as commonly known in the art is the practice of identifying and enabling specific content (files) across multiple enterprise-type sources, such as databases and intranets, to be indexed, searched, and displayed to authorized users.
Stored enterprise-type files may include many different file formats, such as HTML, PDF, XLS, DOC, PPT, TXT, JPG, PNG, TIF, etc. Microfiche is also still in use but has mostly been converted to other formats such as JPG and PDF. Each file may contain information of potential interest and needs to be searchable regardless of format. Many of these file formats are not readily searchable in their native format.
A currently common protocol for storing data on a computer systems is to scan a hardcopy document into PDF form using a photocopier or other scanner. Using this method the text on the document pages is captured as an image file. This process destroys the ability to retrieve the text without running the imaged document through an optical character recognition (OCR) converter.
Databases may be very large: for example, only one year's worth of data included in an exemplary database may be over 100,000 files. Over many years, millions of files may accumulate that require search capability.
Data is stored in three basic types: structured, semi-structured and unstructured. Structured data includes most data found in fields of a typical Structured Query Language (SQL) database. SQL has been the primary database technology of the last 30+ years. For instance, a field in an SQL database might be called zipcode and another field might be called paydate. The zipcode field will only contain zipcodes. The paydate field will only contain dates. This guarantee of field contents gives structure to the data, and makes it possible to run unambiguous SQL commands against the data with a high degree of certainty as to the meaning of the results.
Text is one example of unstructured data. Text can contain many different types of information, and is ambiguous in the following sense: there are many different word combinations that can express the same information. Whenever a database has a text field (like a comment field), or has a document attachment, the text contained therein is considered unstructured. Unstructured text is difficult to process into information as compared to structured data.
Semi-structured data is a combination of structured and unstructured data. An example of semi-structured data is when an organization attempts to structure text passages by imbedding metacodes within the text. Metacodes can allow the text to be searched more easily, assuming the metacodes are accurately chosen and placed in the text. Metacoding can be very tedious to implement.
There are five levels of technology that can be applied to the general search problem. Some are easy and commonly applied. Others are somewhat difficult, and others are so difficult or expensive that they are rarely found in applications.
The first level of search technology is databases. When databases came into common usage in the 1970s, it was a great boon to business: it was possible to have electronic invoicing, payroll, etc. These were, at first, highly structured. And as long as the database was highly structured, relevant data could be retrieved with straightforward search commands and no ambiguity in the results.
Later, databases were used to store text comments, descriptions, etc. Most of these text fields were printed on forms, or perhaps used by online customer service employees. The need to search through this information was minimal.
In the 80s and 90s, document control systems were developed to help manage the fast growing quantity of documents, reports, publications, etc. These databases stored pointers to electronic versions of written documents. This started the wave of unstructured data. Searching this large quantity of unstructured text proved to be difficult.
Metacoding may be used to provide some structure to the unstructured text, but is time consuming and tedious procedure to implement as a human must read each document and code it with keywords which are then put into a database to facilitate searching.
The second level of search technology is text indexing. Text indexing allows users to rapidly find, for instance, all documents that have a given word. To form the index, a computer goes through all the text in the documents of the corpus and creates an index for each found word. For example, an index allows the computer to rapidly return all the documents that contain the exact words “valve” and “failure”. This is a big step forward, as it reduces the pile of potential documents by 90% to 99% on average: a user does not have to read through each document looking for the search terms.
Text indexing is the current, normal ‘state of the art’ for most search operations. It is a common stopping place, technologically speaking: when you see a search box, say, on a web site, you are likely using a search technology at this level.
There are very serious limits to text indexing, as it will miss a lot of relevant results, but even more seriously, it will include many completely irrelevant results.
The third level of search technology is variation indexing, i.e. including variations of a word. A variation of a word is called a ‘stem’ of that word. For instance, a stem of the word ‘valve’ is ‘valves’. Stems of the word ‘failure’ include ‘fails’, ‘failing’, ‘failed’, ‘failures’, etc. If you have a text index to a corpus, including stems of search terms before retrieval will return files that would be missed with the simpler text indexing.
However, the number of irrelevant results will also increase. For example, if a user is searching for a document regarding failure of a valve, searching for documents including stems of “valve” and “failure” may return many files including those variations but not actually including information relevant to valve failure.
Although word stem search is relatively easy to implement, it is not seen often except in connection with higher search technologies, as variation search increases the number of results returned, but also increases the number of irrelevant search results.
The fourth level of search technology is word frequency indexing. The next level of technology that can be applied to search is to analyze word frequencies. For example, if a text document uses the word ‘nuclear’ several times, it is likely that at least a portion of the document has something to do with nuclear substance, and if the document also includes certain words multiple times like ‘plant’, ‘engineering’, ‘energy’, and/or ‘reactor’, then the document can be classified to a high degree of specificity. Implementation of word frequency indexing is more difficult, because some words like ‘the’, ‘and’, ‘a’ are used frequently, so a great deal of statistical work is required to make this level of technology function properly.
In fact, this level of search technology is difficult enough that often users are encouraged/required to metacode their documents so that a lower level of search technology can still produce good results.
The fifth level of search technology is co-occurrence indexing. If words are occur in proximity to each other, i.e., exhibit co-occurrence, then they provide context and meaning to each other. This requires a large amount of processing power, as for each document not only are all the words analyzed for frequency, but their location must be analyzed relative to all the other words in the document. It is difficult to pre-compute co-occurrences (as pre-computing co-occurrences would require an index for all co-occurring word pairs) so the co-occurrence indexing must be computed at the time of the search.
Latent Semantic Analysis, or LSA, has been used for analyzing ‘co-occurred’ terms. LSA works well for windowing (e.g. excluding documents that include “valve” and “failure” but are not about “valve failure”). LSA also works well for different words which actually are synonymous in a given context (e.g. “failure” and “leaking”). LSA works by performing mathematical processing on word set under the general idea that a word is modified or defined by the words surrounding it, i.e. its context. LSA requires a large amount of computer processing capability and also data scientists to create, develop, manage and deploy LSA solutions. LSA does not address the context problem of words with different meanings in context (e.g. “ladder of success” vs ‘ladder accident”)
Thus, a novel solution is needed that addresses all high-level search requirements without requiring high computing power or high human involvement.