RDBMSs (Relational Database Management Systems) form a critical part of the infrastructure of business enterprises today. From day-to-day operations to analytic forecasting of the business, RDBMSs are at the heart of most of today's enterprises. An important feature of RDBMSs is that a single database can be spread across multiple tables that are related to one another. This differs from flat-file databases, in which each database is self-contained in a single table. In RDBMSs, relationships between tables can be specified at the time of creating the tables.
SQL is the primary interface to RDBMSs, and SQL queries that are logged by these databases provide a wealth of information about data access patterns in the organization. The insights gleaned from analyses of these database logs can be important. For example, the marketing department in an organization might be interested in data related to usage of a product and querying questions to product usage data stored in a database. The engineering department in the organization can also be making similar queries to the database. In such a situation, it might be possible to generate a pre-made report identifying usage patterns that can be distributed to both departments, thereby saving valuable computation time in the database. Understanding patterns in SQL queries can also aid in developing database optimization techniques such as indexes or materialized views, specific to a set of SQL queries.
For these and several other reasons, identifying similar SQL queries from SQL logs can be of great interest to database architects. However, analyzing SQL logs can pose several challenges. For example, the number of queries executed by a modern enterprise database can easily run into the tens of millions or more. When faced with such a huge amount of SQL queries, it can be very difficult for a database architect to get answers to questions such as: “how many queries are exact duplicates of each other?”, “are all these queries unique?”, or “how many queries are similar, even though they might not be exact duplicates?”.