As computing devices continue to become less expensive, more and more powerful, and as capacity of data storage devices continues to rapidly increase, more and more data is being generated and stored, oftentimes as structured or semi-structured datasets. A dataset is a collection of data that conforms to either a formal schema (in the case of conventional relational databases), or to an informal conceptual model of the contents (in the case of NoSQL databases, including loose-schemata, semi-formal-schemata, and schema-free conceptual models), wherein the formal schema and/or conceptual model is conventionally defined by the producer or maintainer of the dataset. As used herein, the term “schema” is intended to encompass both a formal schema as well as an informal conceptual model of contents of a dataset. As will be understood by one skilled in the art of dataset generation/maintenance, a schema defines the structure and content of the dataset.
As datasets are continually developed, the data community has moved towards making them available online; e.g., providing users with access to datasets by way of the World Wide Web. Currently, many of such datasets are made available or reside in the deep Web, which is not indexed by standard search engines. Accordingly, these datasets remain unavailable to most users. The United States government, however, has issued a mandate to allow access to government-related datasets to the general population. Further, the academic world is gradually transferring large datasets to publicly accessible computer clouds and university clusters (such as genome projects that aim to determine a complete genome sequence of an organism). Still further, private industry has developed new business models that make datasets available to others for certain fees or in exchange for other services. Moreover, technologies have been developed that facilitate sharing data with numerous applications or users that can process such data. An example of such technology is the Open Data Protocol (oData), which supports consumer querying of a dataset over the HTTP protocol and the provision of results of the query to a consumer in a variety of formats.
While allowing access to various datasets to the general population can be beneficial, the “drowning in data” issue faced by all is exacerbated. That is, there currently is no effective mechanism that allows users to quickly locate datasets that are of interest to the users.