Database systems may be configured to store data that is organized hierarchically. Examples of such hierarchically-organized data include a file systems or repositories where files are organized in a folder tree, and xml data where xml nodes are organized as parent and child nodes, etc. The elements of hierarchically-organized data are herein referred to as “resources”, non-limiting examples of which include files, folders, and xml nodes. Resources that refer to other resources, or that are parents of other resources, are referred to herein as “container resources”, or simply “containers”. An identifier may be associated with a particular resource, which uniquely identifies the resource from among a group or collection of resources that includes the particular resource.
Hierarchical data within a database system may be exposed to queries in any number of ways. For example, the Oracle XML DB exposes hierarchical data using predefined public views, called RESOURCE_VIEW and PATH_VIEW. These public views are described in more detail in the Oracle XML DB Developer's Guide, 10 g Release 2, Part Number B14259-02, Chapter 22, accessed on Jul. 9, 2009, at http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14259/xdb18res.htm#sthref2107, the contents of which are incorporated by reference in their entirety for all purposes as if fully set forth herein.
The following example query selects resources from a public view, which exposes hierarchical data to queries in a manner that may be similar to RESOURCE_VIEW described above. The example query selects resources that are authored by SCOTT and that are in a subtree rooted at “/public”.
select extractvalue(v.res, ‘/Resource/DisplayName’) from view v where   under_path(v.res, ‘/public’)=1   and extractValue(v.res, ‘/Resource/Author’)=‘SCOTT’;
If the database system managing the data for this example query includes an index on the ‘Author’ property, then the query optimizer of the database system may choose one of two possible ways to execute the example query. First, the database system may perform an index scan on the ‘Author’ property, and then determine whether each of the resulting resources, i.e., resources authored by SCOTT, falls under the subtree rooted at ‘/public’. Alternatively, the database system may first enumerate the resources in the given subtree, and then determine which of the resources in the subtree are authored by SCOTT, according to the respective ‘Author’ property for each resource.
To determine which plan to choose, the query optimizer generally requires the cost and selectivity of each of the two predicates of the example query, i.e., the ‘under_path’ predicate, and the ‘Author’ property predicate. For the predicate on the ‘Author’ property, the cost and selectivity of the predicate is efficiently determined using existing relational statistics.
However, because the under_path predicate determines which resources are included in a subtree rooted at a given path in a hierarchical collection, the cost and selectivity determinations for this predicate are based on statistical information about resources in the hierarchical collection. Examples of statistical information that may be used by a query optimizer to determine the most efficient means of accessing hierarchical data include the number of non-container resources under a container resource, the total number of container resources under a container resource, the total number of resources in a subtree, the number of data blocks occupied by a subtree, the average length of resource names in a subtree, etc.
Traditionally, a database system inspects every resource of a hierarchical collection to gather statistical information for the collection. Resources in such a collection are changed, i.e., added, removed, and renamed, on a regular basis. Therefore, statistics for the collection should also be gathered regularly, e.g., daily, to ensure that the query optimizer has current statistical information for the collection, which allows the query optimizer to effectively choose optimal query plans for data in the collection.
However, collecting such statistical information for a collection consumes system resources and time, which can interfere with other processing on the collection. This problem is exacerbated in hierarchical collections containing large amounts of data, since the amount of system resources and time attributable to collecting statistical information usually increases as the amount of data in a collection increases.
Statistical information is often gathered during scheduled system “down times” or during times of minimal user activity to reduce interference with client processing. However, the task of collecting statistical information that is initiated during a down time cannot always be completed during the allocated time period. When the collection of statistics cannot be completed during the allocated time period, either the collection task is allowed to continue until the task is completed, or the collection task is prematurely terminated. Allowing the collection of statistics to continue beyond scheduled down times can interfere with other time-critical processing. On the other hand, prematurely terminating the collection of statistical information can be problematic because collecting statistics generally cannot be stopped and restarted at a later time since the data for which statistics are being collected may change before the next scheduled down time. As a result, a prematurely terminated collection of statistical information is completely re-executed at a later time.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.