1. Field of the Invention
The invention relates to computer-based searching, information retrieval, indexing, and storage.
2. Description of Related Art
Decades ago, large amounts of data were stored in a variety of different formats, depending on the application programs that were intended to access the data, the data types, and the preferences of the programmers who created the programs. In 1970, E. F. Codd, working at the IBM Research Laboratory, wrote a seminal paper, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM 13:6 (June 1970) proposing a new, relational way of storing large amounts of data. (That paper is incorporated by reference in its entirety.) Codd suggested that the formats in which data was stored should be independent of particular application programs and consistent across different types of programs. The relational database was born.
In a relational database, data is stored in so-called “tables,” with certain fields in each table acting as searchable index fields and allowing a searcher to “relate” the information in one table with the information in another table. For example, assume that one table, “USERS” identifies users of a shared computing system and contains fields including first name, last name, gender, age, and identification number. Another table in the same database, “USAGE,” contains information on users' use of a resource by identification number, including the fields identification number and usage amount. In that case, the “identification number” field links and provides a relationship between the two tables, such that an interested person could, for example, easily query the database for the names of all users whose usage exceeds a desired threshold, in which case the set of results would include selectively concatenated information from both tables.
Over the years, the use of relational databases in all sectors of industry exploded. Structured Query Language (SQL) evolved to allow database users to make very sophisticated queries of relational databases; essentially, the SQL language acts as an interface to most modern relational databases. Oracle, Inc. was one of the first and most prominent purveyors of enterprise-grade relational database systems, although many competitors emerged. As the Internet age dawned in the 1990s, relational databases became ubiquitous and open-source (i.e., user developed, readily shared, and typically low cost) relational database programs, like MySQL, emerged alongside offerings by major corporations, with the SQL language itself becoming more and more standardized. Ultimately, SQL databases have been used to handle the back-end processing for most major websites, and continue to be popular solutions.
The advantages of relational databases in general and SQL databases in particular are well documented in the literature. As Codd described, they are independent of the particular application programs that create and access them. The structure of the databases typically provides for relatively fast searching, and their ubiquity makes them easier to create and maintain and provides a variety of software options in the marketplace. Moreover, with recent relational database software, the tables defined in relational databases can often store not only textual data, but other forms of data, including various image, audio, and video files.
As the Internet has grown to maturity, the amount of data stored in and processed by computer systems has increased to the point where a single dataset may involve terabytes or even petabytes of information. Google, Inc., the Internet search company, has been one of the leaders in the science and mechanics of processing large data sets. Google's fundamental innovation in World Wide Web searching was to decide which pages were most relevant or authoritative by measuring how many other pages “linked” to them. By that algorithm, pages that were linked to more frequently were considered to be more authoritative and were presented earlier in the list of search results under most circumstances.
In 2004, two Google engineers, Jeffrey Dean and Sanjay Ghemawat, published a paper entitled “MapReduce: Simplified Data Processing on Large Clusters” describing a generalized, two-step method for processing a large dataset. That paper is incorporated by reference in its entirety. In a first step, a “map” function parses a dataset to obtain a set of associated data values and a “reduce” function parses that distributed set to output a final value or set of values. As one example given in the paper, the map-reduce method may be used to count uniform resource locator (URL) access frequency associated with an Internet site. In that case, the map function would process a log of web page requests and output <URL, 1> each time the particular URL in question is found in the log. The corresponding “reduce” function would count the output of the “map” function and output the data <URL, total count>. The MapReduce paper provides significant guidance in how to distribute map and reduce operations across a number of networked machines to successfully parse very large datasets.
So-called “NoSQL” or “unstructured” databases have developed in parallel with MapReduce and other large dataset processing techniques. These databases deviate from the traditional relational databases that use SQL for an interface either by using an interface other than SQL (e.g., JavaScript, XML, etc.), or by not storing data in tables and thus deviating entirely from the relational database model. These databases may be particularly suited for handling large datasets and for facilitating particular large-scale MapReduce operations on stored data. However, their feature sets may not be as robust or as standardized as SQL-based relational databases.
While the tools for processing large datasets have improved, and techniques for distributing processing tasks over large numbers of networked computers are now well described and commonly used, current information processing techniques are still not very good at facilitating deeper understanding of the information that is processed, e.g., at automatically making connections not only between related points or pages in a dataset, but between related concepts reflected in the dataset.