Most financial and information technology companies now rely on various database management systems (DBMS) to store and manipulate “big data,” that is, large and complex collections of data, in order to conduct business. For example, companies may create databases in connection with their DBMSs consisting of structured sets of these data collections. The DBMSs may use database applications to operate on the databases to perform many complex calculations for their customers. Programming languages such as SQL are often used by these applications to create, manage, and use the databases. For these companies, accuracy is often an important factor in the operation of a DBMS, because inaccurate calculations performed in the applications may lead to various negative business and legal outcomes. However, accurate, efficient, and fast calculations may lead to positive outcomes.
Thus, many financial companies seek to improve the operation of their DBMSs by identifying and quickly resolving various data defects related to calculations performed by database applications. Often, this involves determining the “data lineage” of calculations of interest. Data lineage includes identifying the hierarchy, discovering the location, and monitoring changes of data elements within a database used in a calculation. However, DBMSs themselves are often unable to provide this data lineage information; thus, accurate and reliable solutions are needed.
To address this issue, third parties have provided diagnostic applications for “parsing,” that is, dividing an SQL file into smaller portions by following a set of rules to understand the structure and “grammar” of the SQL code. These software applications, such as ZQL™, JSglParser™, or General SQL Parser™, work by externally analyzing SQL code in SQL files and providing an output. Once an SQL file is parsed, these applications attempt to identify data lineage information based on the parsed SQL.
However, these third parties' applications are often inadequate for determining data lineage for many SQL files. For example, they cannot determine the data lineage for complex SQL files embodying advanced but common SQL concepts, such as “select all” statements, orphaned columns, column aliases, multiple dependent queries, etc.
In view of these and other shortcomings and problems with database management systems and third party software applications, improved systems and techniques for lineage detection and code parsing are desirable.