Recently, the ability of devices to collect large amounts of data has increased dramatically. This includes consumer devices such as electronic cameras, video recorders and cellular telephones, but also devices embedded within assemblies, and sensors used for tracking product and asset movement. Additionally, the amount of information publicly available has increased geometrically with the Internet representing the largest distributed collection of data currently available. In addition to the sheer quantity of information, much information now has temporal attributes and spatial attributes such that the value of the information is far greater when placed in context with other information. For example, the use of instant location sensing for advertising purposes on cellular telephone networks, or the ability to mark electronic photographs with the time and location of their creation.
While the data collection and retrieval capabilities of devices and the Internet has grown tremendously, the ability to perform data manipulation and retrieval operations on that data has been slow to catch up. In particular, ad-hoc queries of unstructured information are resource-intensive and require large capital investment in centralized, massively parallel database systems with specialized software. The present invention creates a means by which structured and unstructured data may be stored, retrieved and searched using a highly expandable architecture that retains high performance even as it grows to large sizes, and which provides for managing data sets using resources that may not be constantly or reliably available, or which for security reasons must be duplicated to prevent loss of data. Typical approaches for solving scalability, availability and cost issues include:
Using multiple copies of a database to allow parallel searches on separate systems and thus increase performance and reduce the potential for loss of data. This approach is used by the popular open-source database “MySQL”, wherein a master database server uses a series of read-only “slave” servers to distribute query loads. In FIG. 8, a block diagram of this arrangement is shown. SQL commands 810 are received from a client application 801, and transmitted through a network 802 to be received by a master SQL server 803. The master SQL server 803 dispatches the SQL commands 811 received to one of the slave servers 805-807, normally attempting to find the slave with the lowest overall load factor. Updates received by the master SQL server 803 are applied in parallel to all slave SQL servers 805-807.
Distributing known portions of a database across many systems, using a central index to determine the system on which the portions of the database required are located. This approach is used by many Internet search engines, most notably Google, AltaVista and FAST. FAST in particular presents a highly parallel system in which each individual server contains a series of proprietary embedded processors with private memory, each searching a small fragment of the total database. Google uses a massive number of low-cost servers, each of which also routes data packets to reduce network infrastructure loads when switching points in the network must be used. In all cases, a central metadata index is retained which directs applications requesting data to the servers containing that data. FIG. 9 is a block diagram of the functional elements in a typical Internet search engine 900. On the searching side, a web server 901 generates requests from an HTML form which are then handled by query processors 902-904. The query processors 902-904 generate search requests which are then dispatched in parallel to a series of database servers 905-910, each of which contains a subset of the total collection of documents or data being searched. Each database server 905-910 subsequently returns a small set of matching documents; all sets are then merged by the query processors 902-904 and returned to the web server 901 for presentation. On the data collection side, a web crawler 914 collects the text from web pages scattered across the Internet, forwarding these to a series of parsers 911-913, which reduce each page to a set of unique words. These unique words are then stored by the database servers 905-910 for later searches.
Distributing subsets of a database across many systems. This is done in large SQL databases such as Oracle, where a server or set of servers contains a single column or index associated with the rows in a relational database. FIG. 10 is a block diagram of the functional elements within a distributed relational database 1000. In this arrangement, an SQL interpreter and query processor 1001 is used to forward queries to an index database server 1002 which returns the row numbers for all matching rows in a table. The SQL query processor 1001 subsequently retrieves the columns for the matching tables from the database servers 1003-1006 that contain the individual columns of each table.
In addition to the difficulties represented by managing large data sets, present database engines are not typically designed to allow flexible data retrieval using unstructured source data, such as word processing documents, images, audio files, and other types of information that do not fall within the traditional model of a relational database. In an unstructured database system, objects with a series of attributes are stored, and the attributes are made searchable for later retrieval.
Internet search engines represent a special case of the unstructured search engine, where an inverted index of the content of web pages and word processing documents is created. However, each web page has more attributes associated with it than the words within the page and the frequency of occurrence of those words. There are also attributes such as the date and time of creation, the total size of the page, implied and implicit concepts in the content, abstracts, the physical and logical location of the page, and other information. In most systems, this information is not stored.
Another difficulty experienced by current unstructured databases is the lack of a temporal or physical capability; that is, current systems do not identify the physical location of a document, or the content of the document through successive revisions. For example, Google provides a searchable index of the World Wide Web as of the last indexing operation; historical content is not maintained. Even if such information were maintained by a search engine provider as Google, the methods employed by Google for data storage and retrieval would not be capable of managing the ever-expanding storage requirements.
There is a need for a scalable, robust and flexible database architecture which provides a means of storing large quantities of unstructured data.