Technical Field
The following disclosure relates to the field of information technology, and in particular, to methods, systems, and apparatuses for identifying data tables and determining a degree of association between data tables.
Description of the Related Art
Industry has identified the “3V” characteristics for big data, namely: volume, velocity, and variety. Due to increased attention in recent years, storage and computing power of big data hardware and software has achieved acceptable results in many scenarios, but, the variety of big data is still the most pressing issue in big data applications.
In order to meet the requirements of varied big data, one solution is data exchange. In general, data exchange may be carried out between different companies or between different business departments in the same company. The specific form of data exchange is mutual access between different data tables in a data warehouse or cloud computing environment. In the process of daily business and in order to meet the requirements of various services for varied big data, the composition of one resulting data table may need to depend on data tables of multiple business departments and even data tables opened by different companies. However, in data exchange and mutual access, different data tables may have different importance levels for the resulting data table meeting the service requirements. The identification of data tables having high importance to be given priority in operation and maintenance then becomes an important task in the big data era. Since the identification of the importance of data tables is mainly determined through a degree of association of the data tables, determining the degree of association of data provided by departments and companies with the resulting data table meeting the service requirements is key to measuring the value of data exchange in mutual access of data.
Usually, the storage of data tables is implemented via a data warehouse. A data warehouse often stores thousands of data tables, and each data table contains dozens—or hundreds—of fields. In some scenarios, to meet analysis requirements, dependency relationships between multiple data tables are represented by a complex directed graph, as shown in FIG. 1.
FIG. 1 illustrates a schematic diagram of a directed acyclic graph having data tables A, B, C, D, and E as nodes. In FIG. 1, a circle represents a data table, and a letter in the circle represents the name of the data table, for example, data table A, data table B, etc. Letters in an annotation box beside the circle represent field names in the data table, for example, data table A contains fields a1, a2, a3, and a4. A directed line segment between two circles represents that two data tables have a mapping/dependency relationship, for example, the arrow from data table A to data table C represents that data table A contributes two fields (a1 and a2), to the data table C. That is, generation of data table C depends on fields a1 and a2 of data table A.
In the prior art, a degree of association between two data tables is calculated in two cases: in one case, two data tables have a direct dependency relationship (e.g., data table A and data table C in FIG. 1) and in the other case two data tables have an indirect dependency relationship (e.g., data table A and data table E in FIG. 1).
For data tables having a direct dependency relationship (e.g., data tables A and C in FIG. 1), a degree of association is calculated using current techniques according to a proportion of contributed fields. For example, in FIG. 1, when calculating a degree of association between data table A and data table C, first, it is determined that the data tables on which data table C depends include data table A and data table B, where data table A contributes two fields (a1, a2) to data table C, while data table B contributes only one field (b1) to data table C. Thus, the ratio between degrees of association of data table A and data table B with data table C is 2:1. That is, the degree of association of data table A with data table C is 2/3, while the degree of association of the data table B with the data table C is 1/3.
For data tables not having a direct dependency relationship (e.g., data tables A and E in FIG. 1), current techniques calculate a degree of association by converting the indirect dependency relationship into data links having a direct relationship through an intermediate data table. For example, for a degree of association of data table A with data table E in FIG. 1, a degree of association of data table A with data table C and a degree of association of data table C with data table E need to be calculated first. Since the degree of association of data table A with data table C is 2/3 (as discussed previously), and the degree of association of data table C with data table E is 1/4 (calculated using the process discussed previously), the degree of association of the data table A with the data table E is 2/3*1/4=1/6.
However, the degree of association between data tables calculated according to the above-mentioned current techniques can only be accurate to the granularity of data tables and cannot be specifically accurate to the granularity of fields of a data table. In practice, there are great differences in importance between different data fields in one data table, and current techniques cannot reflect such differences.
Secondly, for parent and child tables having a direct dependency relationship, a proportion of fields contributed by one child table to a parent table is simply taken as the degree of association in current techniques. However, this factor is too simplistic, and differences in actual implementations cannot be completely and precisely reflected.
Thirdly, a degree of association between parent and child tables having only an indirect dependency relationship is converted into a product of degrees of association between data tables having a direct dependency in calculation in current techniques, causing the degree of association between data tables separated by one or two layers to decrease exponentially; the degree of association attenuates too rapidly, and the real contribution between the data tables cannot be reflected. Therefore, the result of identifying the importance of data tables according to the prior art is not accurate.