The present invention relates generally to automating data analysis tasks and, more particularly, to analysis tasks that require navigation between dynamic data that has dissimilar structures.
The present invention provides for systems and methods for automatically navigating between dynamic data that has dissimilar structures. The term xe2x80x9cdynamicxe2x80x9d as used herein refers to frequent change with respect to data, a characteristic that affects the efficiency of navigation techniques. The term xe2x80x9cdissimilar structurexe2x80x9d as used herein refers to a data structure containing information that is not present in another data structure. Thus, it is said that the first data structure is dissimilar with respect to the second data structure. A problem in the management of distributed systems is described below to illustrate the prior art background. However, it is to be appreciated that the invention has broader applications.
Rapid improvements in both hardware and software have dramatically changed the cost structure of information systems. Today, hardware and software account for a small fraction of these costs, typically less than 20 percent (and declining). The remaining costs relate to the management of information systems, such as software distribution, providing help desk support, and managing quality of service (QoS).
Decision support is critical to the management of information systems. For example, in software distribution, we need to know: (i) which machines require software upgrades; (ii) what are the constraints on scheduling upgrades; and (iii) the progress of upgrades once installation has begun. In QoS management, decision support detects QoS degradations, identifies resource bottlenecks, and plans hardware and software acquisitions to meet future QoS requirements.
Accomplishing these tasks requires a variety of information, such as, for example, QoS measurements, resource measurements (e.g., network utilizations), inventory information, and topology specifications. Collectively, we refer to these information sources as data. Much of this data is dynamic. Indeed, measurement data changes with each collection interval. Further, in large networks, topology and inventory information change frequently due to device failures and changes made by network administrators.
We use the term xe2x80x9cdatasetxe2x80x9d to describe a collection of data within the same structure. For example, a dataset might be organized as a relational table that is structured so that each row has the same columns. Here the data is structured into rows such that each row has a value for every column. A dataset contains multiple xe2x80x9cdata elementsxe2x80x9d (hereinafter, just elements), which are instances of data structured in the manner prescribed by the dataset (e.g., a row in a relational table). A group of elements within the dataset is called an xe2x80x9celement collection.xe2x80x9d An element collection is specified by a xe2x80x9ccollection descriptorxe2x80x9d (e.g., SQL where-clause for a relational table or line numbers for a sequential file). A collection descriptor consists of zero or more xe2x80x9cconstraintsxe2x80x9d that describe an element collection. A constraint consists of an xe2x80x9cattributexe2x80x9d (e.g., a column name in a relational table or a field in a variable-length record), a relational operator (e.g., =,  less than ,  greater than ), and a value.
Due to the diversity of software tools, administrative requirements, and other factors, data is typically grouped into multiple datasets. Thus, decision support often requires navigating from an element collection in one dataset to one or more element collections in other datasets. We refer to these as the xe2x80x9csource element collection,xe2x80x9d xe2x80x9csource dataset,xe2x80x9d xe2x80x9ctarget element collectionsxe2x80x9d and xe2x80x9ctarget datasets,xe2x80x9d respectively.
With this background, we state one of the problems addressed by the present invention. We are given a source element collection and multiple target datasets. The objective is to find the target element collection that xe2x80x9cbest matchesxe2x80x9d the source element collection. By best matches, it is meant that the structure and content of the source element collection is the most similar to that of the target element collection.
To illustrate the problem addressed, we describe a scenario in QoS management. Considered is a situation in which end-users experience poor quality of service as quantified by long response times. The objective is to characterize the cause of long response times by: (i) when they occur; (ii) who is affected; (iii) which configuration elements are involved; and (iv) what components of the configuration element account for most of the delays.
The analyst starts with a dataset containing end-to-end response times. The dataset is structured into the following columns: shift, hour, subnet, host, user""s division, user""s department, user name, transaction issued, and response time. The analyst proceeds as follows:
Step 1. The analyst isolates the performance problem. This may be done in any conventional manner, such as, for example, is described in R. F. Berry and J. L. Hellerstein, xe2x80x9cA Flexible and Scalable Approach to Navigating Measurement Data in Performance Management Applications,xe2x80x9d Second IEEE Conference on Systems Management, Toronto, Canada, June, 1996. In the example, isolation determines that poor response times are localized to the element collection described by the constraints: shift=1, hour=8, subnet=9.2.15, division=25, department=MVXD, user=ABC, and transaction=_XX. At this point, the analyst has characterized when the problem occurs, who is affected, and which configuration elements are involved.
Step 2. To determine what components of the configuration element account for most of the delays, the analyst must examine one or more other datasets. After some deliberation and investigation by the analyst, the analyst selects a dataset of operating system (OS) measurements that are structured as follows: hour, minute, shift, subnet, division, department, efficiency, waiting time, CPU waits, I/O waits, page waits, and CPU execution times.
Step 3. The analyst selects the subset of the OS data that best corresponds to the response time data. Doing so requires dealing with two issues: (i) the source and target datasets are structured somewhat differently in that the first has transaction information (which the second does not), and the second reports time in minutes (which the first does not); and (ii) the second dataset does not have records for user ABC, the user for which a problem was isolated. To resolve the first problem, the analyst decides to use only the information common to both datasets. So, transaction information and minutes are ignored when navigating from the response time data to the OS data. The second problem is resolved by assuming that users within the same department are doing similar kinds of work. Thus, the target element collection is described by the constraints: shift=1, hour=8, subnet=9.2.15, department=MVXD, and user=ABC.
Step 4. The analyst uses the OS data to characterize the host component that contributes the most to response time problems. This characterization reveals that paging delays account for a large fraction of end-to-end response times.
Steps 1 and 4 employ similar problem isolation logic. Indeed, automation exists for these steps. Unfortunately, in the prior art, steps 2 and 3 are performed manually. As such, these steps impede the automation, accuracy and comprehensiveness of problem isolation. This, in turn, significantly increases management costs. The challenges raised by steps 2 and 3 above are modest if there are a small number of measurement sources. Unfortunately, the number of measurement sources is large and growing.
Disimilarities in the structure of datasets typically arise because measurements are specific to the measured entity. Hence, heterogeneous equipment means heterogeneous measurement sources. Heterogeneity includes the different types of devices (e.g., routers versus file servers), different vendors, and different versions of the same product from the same vendor. With rapid changes in information technology, acquisition cycles are now much longer than technology cycles. For example, depreciation times for personal computers are typically 3 to 5 years, but the technology changes every 9 to 12 months. Further, customers typically upgrade hardware and software in an incremental fashion. The combination of these factors means that customers have an increasingly diverse collection of hardware and software. As such, the diversity of measurement sources is rapidly increasing.
One proposed way to attempt to avoid the problem of heterogeneous measurement sources is to build a data warehouse that integrates data with dissimilar structure, as described in R. Kimbell, xe2x80x9cThe Data Warehouse Toolkit,xe2x80x9d John Wiley and Sons, 1996. This is accomplished by employing tools that translate data formats and semantics into a common structure. Such an approach works well for fairly static data that is analyzed frequently in that the cost of building and maintaining the data warehouse is amortized over a long time window and a large number of data accesses. However, in systems management applications, the data is dynamic, such as QoS measurements that change every minute. Also, the detailed data used for solving QoS problems is only needed when a problem arises. Thus, the cost of building and maintaining the data warehouse far outweighs the benefits provided.
Existing art for navigating between datasets is specific to the manner in which the data is organized, that is, the conceptual model employed. We consider three such conceptual models: relational data (with variations), multidimensional databases (MDDB), and text documents. While other organizations may exist (e.g., graphical structures, such as hyper linked documents), similar issues arise in these organizations as well.
Considered first is data organized as relational tables. Here, a dataset is a table, an element is a row in a table, and a collection descriptor is an SQL where-clause. Navigation between datasets is accomplished through SQL queries that use the join operation. A join requires specifying the table to navigate to (e.g., in the from clause of SQL queries) and the join attributes used in the where clause. However, often times situations exist in which neither the table to navigate to nor the choice of join attributes are known. Thus, while it may be useful to employ join operations in an attempt to achieve automated navigation, the join operation itself does not solve the existing problems.
Considered next is data organized as MDDB, as described in R. F. Berry and J. L. Hellerstein, xe2x80x9cA Flexible and Scalable Approach to Navigating Measurement Data in Performance Management Applications,xe2x80x9d Second IEEE Conference on Systems Management, Toronto, Canada, June, 1996. Conceptually, such an organization can be viewed as a layer on top of the relational model. The MDDB structures attributes into dimensions. Within a dimension, attributes may be further structured into a directed acyclic graph (DAG). Here, a dataset is a cube (a MDDB schema along with its base data), an element is a cell within a cube, and a collection descriptor is a where clause that abides by the hierarchical structure imposed by the MDDB. In the example above, there might be dimensions for Time, Configuration Element, Workload and Metric. In the source dataset, the Workload dimension may contain the attributes division, department, user, and transaction, ordered in this manner. Thus, the coordinate for this dimension would be division=25, department=MVXD, user=ABC, and transaction=_XX.
Navigation within a multidimensional database is accomplished by drill-down and drill-up operations. Drill-down adds constraints in one or more dimensions such that the attribute of the constraint is at the next lower level in the dimension""s DAG. Drill-up is the inverse operation. It removes constraints in one or more dimensions. The attributes of the constraints removed are at the next higher level in the dimension hierarchies. For example, consider the constraints for the Workload dimension: division=25, department=MVXD, user=ABC, and transaction=_XX. Drill-up yields the constraints division=25, department=MVXD, and user=ABC.
Navigation between cubes can be accomplished in many ways. One approach is to use SQL queries, as described in R. Kimbell, xe2x80x9cThe Data Warehouse Toolkit,xe2x80x9d John Wiley and Sons, 1996. However, this suffers from the same drawbacks as described above.
Another approach to automated navigation between cubes is to employ a drill-through operation. With drill-through, a source cell is associated with one or more target cells. This is either done by specifying an explicit association or by specifying a program that computes the association dynamically. The former is poorly suited to dynamic environments in which new cubes are added and others are deleted or their structure is modified. The latter provides a mechanism for dealing with these dynamics, but in and of itself, providing programmatic control does not solve the problems associated with determining which target data to navigate to.
A third way of organizing data is as text documents. Here, the navigation is from keywords (e.g., as specified in an Internet search engine) or whole documents to other documents. This is accomplished by preprocessing the documents to extract keywords (and keyword sequences) to provide an index. Such an approach works well for fairly static data since the index structures are computed infrequently. However, it works poorly for dynamic data.
Further, it is known that data structured as comma-separated-values (or other separators) can readily be treated as relationally structured data. This is accomplished by: (a) describing each column in terms of the distribution of its element values, and (b) using a similarity metric to find comparable columns in other datasets.
Still further, existing art on federated and multidatabase systems, as described in M. T. Ozsu and P. Valduriez, xe2x80x9cPrinciples of Distributed Database Systems,xe2x80x9d Prentice Hall, 1991, as well as schema integration, as described in J. A. Larson, S. B. Navathe, R. Elmasri, xe2x80x9cA Theory of Attribute Equivalence in Databases with Application to Schema Integration,xe2x80x9d IEEE Transactions on Software Engineering, vol. 15, no. 4, April 1989, teaches how to address problems with heterogeneous names and semantics of columns in relational databases. These approaches allow for performing SQL queries in which the tables referenced in the from clause may have different relational schema. However, these approaches do not teach how to automate the selection of relational tables to use in the from clause, nor do they address how to determine the target element collection that is closest to the source element collection.
The present invention provides systems and methods that aid in decision support applications by automatically selecting data relevant to an analysis. This is accomplished by using the structure of the source dataset in combination with the content of the source element collection to identify the closest element collections within one or more target datasets.
Particularly, the invention is implemented in a form which includes certain functional components. These components, as will be explained, may be implemented as one or more software modules on one or more computer systems. A first component is referred to as an inter-dataset navigation engine (IDNE). The IDNE is invoked by analysis applications to automate the selection of related data. The IDNE makes use of another component referred to as dataset access services. The dataset access services component knows the accessible datasets and their structures, creates and manipulates collection descriptors, and provides access to elements within a dataset that are specified by a collection descriptor.
In one embodiment, automated navigation according to the invention may be accomplished in the following manner. First, the IDNE iterates across all target datasets to do the following: (a) use the structure of the source and target datasets to transform the source collection descriptor into a preliminary collection descriptor for the subset of the target dataset that is closest to the source element collection; (b) construct the final collection descriptor by transforming the preliminary collection descriptor until it specifies a non-null subset of the target dataset; and (c) compute a distance metric representing how close the source element collection (or collection descriptor) is to the target element collection (or collection descriptor). The IDNE then returns a list of triples including a name of the target dataset, a target collection descriptor, and a value of the distance metric for each target dataset.
The list may be presented to an end-user who then selects the preferred target dataset. Alternatively, the list may be provided to a program that does additional processing. The list may be sorted by descending value of the distance metric so as to provide a ranking of the target datasets and their target element collections.
It is to be appreciated that the systems and methodology of the present invention advantageously eliminate the drawbacks (e.g., accuracy, comprehensiveness, etc.) that exist in the manual navigation approach and other prior art approaches described above. To illustrate the operation of this methodology, we apply it in the context of the previously presented exemplary scenario. Recall that the collection descriptor of the elements in the source dataset is: shift=1, hour=8, subnet=9.2.15, division=25, department=MVXD, user=ABC, and transaction=_XX. We also use the operating system (OS) data previously introduced. This data is structured into the following columns: shift, hour, minute, subnet, division, department, efficiency, waiting time, CPU waits, I/O waits, page waits, and CPU execution times.
The invention performs steps (a) through (c) above as follows. A preliminary collection descriptor is constructed for the OS data by transforming the source collection descriptor (step (a)). In particular, the constraints such as transaction=_XX that have an attribute that is not present in the target dataset are addressed. One approach to resolving this is to remove such constraints. Doing so yields: shift=1, hour=8, subnet=9.2.15, division=25, department=MVXD, and user=ABC.
Next, the final collection descriptor in the target dataset is constructed (step (b)). This can be achieved by doing the following. First, the element collection specified by the preliminary collection descriptor is retrieved. If this collection is empty, one or more constraints from the collection descriptor are removed. This is repeated until a non-null element collection is obtained. In the exemplary scenario, there is no data for user ABC. So, the constraint user=ABC is removed. Thus, the final target collection descriptor is shift=1, hour=8, subnet=9.2.15, division=25, and department=MVXD.
Lastly, the metric for the distance between the source and target element collections is computed (step (c)). Note that in the above construction, the target collection descriptor always has a subset of the constraints in the source collection descriptor. Thus, a convenient distance metric is the difference between the number of constraints in the source and target collection descriptors. In the exemplary scenario, this value is two.
Accordingly, the present invention provides automation for selecting datasets relevant to analysis tasks. Such automation is crucial to improving the productivity of decision support in systems management applications. The automation enabled by the invention provides value in many ways. For example, the invention makes the novice analyst more expert by providing a list of target datasets and collection descriptors that are closest to an element collection at hand (i.e., the source element collection). As a result, the novice focuses on the datasets that are most likely to be of interest in the analysis task. By way of further example, the invention makes expert analysis more productive. This is achieved by providing the target collection descriptor for each target dataset thereby enabling the construction of a system in which analysts need only click on a target dataset (or collection descriptor) in order to navigate to its associated element collection.
The techniques employed today for dataset selection (e.g., drill-through in MDDBs) embed fixed associations between datasets or require special purpose programs that must be maintained if datasets and/or attributes are added or deleted. In contrast, the invention uses the structure of the data itself to select relevant data. Such an approach adapts automatically to changes in the structure and content of the data being analyzed.
In another embodiment of the invention, the set of attributes considered when transforming the source collection descriptor into the target collection descriptor may be constrained. Indeed, the exemplary scenario does not consider the attributes response time, efficiency, waiting time, CPU waits, I/O waits, page waits, and CPU execution time.
In yet another embodiment of the invention, different levels of importance may be assigned to attributes. For example, a match on the attribute subnet may be considered more important than a match on the attribute division. This may be implemented by changing the manner in which the distance metric is computed so that it includes weights assigned. In this way, differences in the values of more important attributes result in larger distances than do differences in less important attributes.
Automated navigation according to the invention can be applied in many domains. For example, in analysis of manufacturing lines, measurement datasets exist for machines in the manufacturing line as well as for the interconnection of these machines. Automated navigation according to the invention can aid with decision support for scheduling and planning based on this data. By way of a further example, in transportation systems, datasets exist for measurements taken by road sensors and traffic reports. Automated navigation according to the invention can aid in planning highway capacity over the entire network of roadways. It is to be understood that the above applications are merely exemplary in nature and not intended to limit the applicability of the invention. Furthermore, it is to be appreciated that automated navigation according to the invention can be accomplished centrally at a server or in a distributed manner amongst several smaller server machines.