Enterprises continue to store and manage their data in a variety of disparate manners. One manner by which enterprises store their data is within relational databases using relational database management systems (RDBMSs). The tabular, normalized data stored in such RDBMSs is commonly referred to as structured data. For example, an enterprise may format, cleanse, conform, and store its sales records and customer information as structured data within an RDBMS. A variety of well-known tools have been developed in the art for intelligently accessing such structured data, typically based on standardized data languages such as the Structured Query Language (SQL).
However, it is commonly estimated that such tabular structured data represents only a tiny fraction of the totality of an enterprise's stored data. The remainder of the stored data is typically comprised of unstructured data whose storage is usually spread out amongst a variety of different file systems and storage means within the enterprise. An explosion of unstructured objects and documents has left many enterprises with a serious case of “information overload”. Intelligent and unified access to all of this structured and unstructured data has posed a difficult challenge. Contributing to this difficulty is the fact that, with many enterprises, storage of an enterprise's unstructured data is managed separately from the databases, often by different organizational units. A huge challenge that many organizations face is to efficiently and effectively integrate their structured data in relational databases with the rest of this relatively unorganized mass of other unstructured data including blobs. Structured data can provide answers to relatively straight-forward questions like “what?”, “where?”, “when?”, “who?”; by using text analytics, unstructured data can answer more complex questions like “why?”.
FIG. 1 illustrates this problem. With many enterprises, there is very little organization as to where all of the documents are located amongst a number of different servers spread throughout the enterprise. For example, the storage space 102 within which an enterprise stores its data may be spread amongst separate components such as a Document Management System A 104, a Network File Server B 106, and an Application Server C 108. To gain access and locate desired documents within this storage space, a user 100 will likely be forced to use different tools to access each of the different components (e.g., using a custom application to access system 104, using a software product such as Windows Explorer to access server 106, and using a custom Application Programming Interface (API) to access server C). To conduct a search for data on the Internet 110, still another tool would likely be used (e.g., a web search tool such as Google). With such a jumble of document locations and access means, the user must not only be knowledgeable as to where within the storage space 102 the documents of interest are located but also proficient in working with a number of different tools for accessing the disparate components 104, 106 and 108. Further still, through enterprise search capabilities like the one depicted in FIG. 1, the user does not have the ability to directly access and correlate his or her searches with other enterprise data that is stored in relational databases.
When a user's search includes some form of full-text search, the software that supports such full-text querying will often take a relatively long time to complete, particularly when the query requires scanning the entire bodies of many large documents. This slowness is due, in part, to inherent constraints on the performance of general purpose processors (GPPs) when executing traditional software. Current indexing techniques have important limitations to yielding “find-ability”. Although indexing can be somewhat helpful in locating relevant documents, the task of searching for mis-spellings, alternate spelling variations, regular expressions, or searching for a large number of terms are problems not easily or quickly solved with current indexing solutions, and the time to create an effective index often becomes intractable. To state it differently, in order to build an effective index to help find something, it must be known beforehand what one is trying to find. One example of a shortcoming in conventional systems is that there is no easy or standard way to search for mis-spellings. These problems are compounded in situations where the data are dynamic or constantly changing.
With respect to structured data, SQL has enjoyed widespread deployment within industry because of its ability to provide a standardized, consistent programming interface to many relational databases. However, the inventors herein recognize that current attempts to standardize the integration of SQL for structured data with full-text search capabilities (or other processing capabilities such as text analytics and text mining) on unstructured data have shown a need for improvement. The implementations of these attempts often evidence performance bottlenecks. Several efforts have arisen to extend standard SQL to integrate structured, tabular data and various forms of unstructured data. For example, SQL/XML for relational access to semi-structured XML data, SQL/MM for unstructured multimedia data, SQL/MED for unstructured external data, and XQuery 1.0 and XPath 2.0 Full-Text 1.0 for searching XML data using regular expressions, wildcards, stemming, thesaurus and boolean operations. The inventors herein believe that these SQL extensions' abilities to deal with unstructured data largely represent an inconsistent and mixed jumble of dialects, which has hindered their widespread adoption in the IT industry. In the inventors' opinion, it is likely that serious performance issues have often slowed up these standardization efforts.
The widespread adoption of SQL has also lead to the development of a number of business intelligence (BI) reporting tools. The inventors believe that reporting tools' functionality for supporting unstructured text analysis is relatively limited and that a need exists in the art for improvements in this area. Most of these software tools have relatively modest abilities to perform full-text searches on unstructured data and other advanced text mining and analytics. The inventors reiterate their belief that the tools' performances have not been particularly efficient.