1. Field of the Invention
Exemplary embodiments of the present invention are directed to enhancing interfaces for accessing hidden databases, such as mobile-access interfaces to forms.
2. Description of the Related Art
Today a file system with billions of files, millions of directories and petabytes of storage is no longer an exception [32]. As file systems grow, users and administrators are increasingly keen to perform complex queries [40], [50], such as “How many files have been updated since ten days ago?”, and “Which are the top five largest files that belong to John?”. The first is an example of aggregate queries which provide a high-level summary of all or part of the file system, while the second is top-k queries which locate the k files and/or directories that have the highest score according to a scoring function. Fast processing of aggregate and top-k queries are often needed by applications that require just-in-time analytics over large file systems, such as data management, archiving, etc. The just-in-time requirement is defined by two properties: (1) file-system analytics must be completed with a small access cost—i.e., after accessing only a small percentage of directories/files in the system (in order to ensure efficiency), and (2) the analyzer holds no prior knowledge (e.g., pre-processing results) of the file system being analyzed. For example, in order for a librarian to determine how to build an image archive from an external storage media (e.g., a Blue-ray disc), he/she may have to first estimate the total size of picture files stored on the external media—the librarian needs to complete data analytics quickly, over an alien file system that has never been seen before.
Unfortunately, hierarchical file systems (e.g., ext3 and NTFS) are not well equipped for the task of just-in-time analytics [46]. The deficiency is in general due to the lack of a global view (i.e., high-level statistics) of metadata information (e.g., size, creation, access and modification time). For efficiency concerns, a hierarchical file system is usually designed to limit the update of metadata information to individual files and/or the immediately preceding directories, leading to localized views. For example, while the last modification time of an individual file is easily retrievable, the last modification time of files that belong to user John is difficult to obtain because such metadata information is not available at the global level.
Currently, there are two approaches for generating high-level statistics from a hierarchical file system, and thereby answering aggregate and top-k queries: (1) The first approach is to scan the file system upon the arrival of each query, e.g., the find command in Linux, which is inefficient for large file systems. While storage capacity increases at approximately 60% per year, storage throughput and latency have much slower improvements. Thus the amount of time required to scan an off-the-shelf hard drive or external storage media has increased significantly over time to become infeasible for just-in-time analytics. The above-mentioned image-archiving application is a typical example, as it is usually impossible to completely scan an alien Blue-ray disc efficiently. (2) The second approach is to utilize prebuilt indexes which are regularly updated [3], [7], [27], [35], [39], [43]. Many desktop search products belong to this category, e.g., Google Desktop [24] and Beagle [5].
While this approach is capable of fast query processing once the (slow) index building process is complete, it may not be suitable or applicable to many just-in-time applications. For instance, index building can be unrealistic for many applications that require just-in-time analytics over an alien file system. Even if index can be built up-front, its significant cost may not be justifiable if the index is not frequently used afterwards. Unfortunately, this is common for some large file systems, e.g., storage archives or scratch data for scientific applications scarcely require the global search function offered by the index, and may only need analytical queries to be answered infrequently (e.g., once every few days). In this case, building and updating an index is often an overkill given the high amortized cost.
There are also other limitations of maintaining an index. For example, prior work [49] has shown that even after a file has been completely removed (from both the file system and the index), the (former) existence of this file can still be inferred from the index structure. Thus, a file system owner may choose to avoid building an index for privacy concerns.
Similar to a large file system, a hidden database with billions of tuples, each defined by thousands of attributes is now common. As hidden databases grow, users and administrators are increasingly keen to perform complex queries such as aggregate and top-k queries which provide a high-level summary of all or part of the database. Fast processing of aggregate and top-k queries are often needed by applications that require just-in-time analytics over large hidden databases. Just-in-time analytics, therefore, are desirable for hidden databases as well as large file structures.
Crawling and Sampling from Hidden Databases:
There has been prior work on crawling as well as sampling hidden databases using their public search interfaces. Several papers have dealt with the problem of crawling and downloading information present in hidden text based databases [1, 8, 23]. [2, 20, 25] deal with extracting data from structured hidden databases. [11] and [24] use query based sampling methods to generate content summaries with relative and absolute word frequencies while [17, 18] uses two phase sampling method on text based interfaces. [10, 12] discuss top-k processing which considers sampling or distribution estimation over hidden sources. In [13, 14] techniques have been developed for random sampling from structured hidden databases leading to the HIDDEN-DB-SAMPLER algorithm. Techniques to thwart such sampling attempts have been developed in [15].
Sampling and Size Estimation for Search Engine's Corpse:
The increase in popularity of search engines has motivated the research community to develop techniques to discover its contents. [21, 28] studied the estimation by capture-recapture method to identify the index size of a search engine. [7] employed Monte Carlo methods to generate a near-uniform sampling from the search engine's corpus, while taking into consideration the degrees of documents and cardinalities of queries. With approximate document degrees, techniques for measuring search engine metrics were proposed in [5]. Sampling online suggestion text databases were discussed in [6] to significantly improve the service quality of search engines and to study users' search patterns.
Information Integration and Extraction for Hidden Databases:
A significant body of research has been done on information integration and extraction over deep web data sources such as hidden databases—see tutorials in [29, 32]. Nonetheless, to the best of our knowledge, the only prior work which directly tackles the attribute domain discovery problem is [25]. In particular, it proposes a crawling-based technique, the disadvantage of which has been extensively discussed in subsection 3.2.3. Much other work though is related but orthogonal to attribute domain discovery. Since there is no space to enumerate all related papers, we only list a few examples closely related to this section. Parsing and understanding web query interfaces has been extensively studied (e.g., [33, 40]). The mapping of attributes across different web interfaces has also been addressed (e.g., [36]). Also related is the work on integrating query interfaces for multiple web databases in the same topic-area (e.g., [34, 35]). This section provides results orthogonal to these existing techniques as it represents the first formal study on attribute domain discovery over hidden databases.
Data Analytics Over Hidden Databases:
There has been prior work on crawling, sampling, and aggregate estimation over the hidden web, specifically over text [6, 8] and structured [25] hidden databases and search engines [5, 21, 28]. In particular, sampling-based methods were used for generating content summaries [11, 18, 37], processing top-k queries [10], etc. Prior work (see [30] and references therein) considered sampling and aggregate estimation over structured hidden databases. A key difference between these techniques and this section is that the prior techniques assume full knowledge of all attribute domains, while this section aims to integrate domain discovery with aggregate estimation. As we demonstrated in subsection 3.6, our integrated approach significantly outperforms the direct application of previous techniques [30] after domain discovery.
Enhancement of Web Interfaces:
It is quite common for web databases to provide proprietary form-based interfaces, which may include control elements (e.g., textboxes, drop-down boxes, etc.) that allow users to enter data. For example, the NSF Fastlane Award database provides a search form (available at http://www.nsf.gov/awardsearch/tab.do?dispatch=4) having twenty-two control elements, including six drop-down boxes and nine textboxes. Although such forms are typically easy to complete using a conventional computer, such as a desktop or laptop, such forms are much more difficult to complete using smaller mobile devices, (e.g., personal data assistants (PDAs), smart phones, etc.), due to limitations such as smaller screen, limited keyboard size, touch-screen keyboard, and the like.
Although attempts have been made to address those limitations of mobile devices, those attempts have not adequately addressed all of the problems associated with accessing hidden databases via mobile devices. For example, FIGS. 1A and 1B illustrate two conventional techniques for form field input on a mobile device, which in the example is an Apple iPod Touch®. FIG. 1A illustrates the use of an enlarged spinning-wheel rendered to ease finger scrolling for the drop-down box element “PI State.” Although the spinning-wheel provides for easier viewing and selection of the options from the drop-down box compared to the original web page, it also limits the number of options displayable at any given time. For example, in the the NSF FastLane form, there are seventy-four selectable values for the “PI State” drop-down box, which could require a mobile device user to perform a large number of screen swipes to select a value that is lower in the alphabet, such as “Tennessee” or “Texas.” Furthermore, a close investigation of the database reveals that the popularities of selectable values may differ significantly. For instance, a search on a value such as “US Minor Islands” and “Palau” returns no tuples, whereas a search on a value such as “Pennsylvania” or “Texas” returns over 2000 tuples.
FIG. 1B is a screen shot of an auto-complete suggestion technique for a textbox. As illustrated in FIG. 1B, the dictionary-based auto-complete suggestion for the letters “warc” is “Warcraft”, which is not an appropriate suggestion for the form field “Program Manager”. Instead, the form field is requesting the entry of a person's name.