The present invention relates generally to information processing environments and, more particularly, to employing methods for eliminating duplicate rows or tuples in a tuple stream occurring in a data processing system, such as a Database Management System (DBMS).
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as "records" having "fields" of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of a database management system is known in the art. See e.g., Date, C., An Introduction to Database Systems, Volume I and II, Addison Wesley, 1990; the disclosure of which is hereby incorporated by reference.
DBMS systems have long since moved from a centralized mainframe environment to a de-centralized or distributed environment. One or more PC "client" systems, for instance, may be connected via a network to one or more server-based database systems (SQL database server). Commercial examples of these "client/server" systems include Powersoft.TM. clients connected to one or more Sybase SQL Server.TM. database servers. Both Powersoft.TM. and Sybase SQL Server.TM. are available from Sybase, Inc. of Emeryville, Calif.
Today, there exists great interest in optimizing system performance in database servers, for instance, by increasing the speed at which query processing occurs. In routine database use, one is often faced with sets of duplicate records, either in one's underlying tables or in result tables (i.e., tables generated on-the-fly by joining tables together). Although relational database systems have been designed with the premise that tables would comprise unique records or rows, one generally finds that users of such systems have tables (either system-created and/or user-created) which contain duplicate rows. Nevertheless, often a user will desire to impose a mask on the data so that the system returns only one row or record for each duplicate. Moreover, users want this task performed efficiently and expeditiously.
Present-day techniques for eliminating duplicates from a tuple stream or table resort to sorting operations. Even if duplicates are received in order (e.g., from a clustered-indexed table having a non-unique index), a sort approach entails creation of a result table in which to store the results of the sort (for eliminating duplicates). Since the approach leads to substantial I/O (input/output) overhead, it is not performance-optimized.
Another approach to eliminating duplicates in a tuple stream is to store a history list or structure. For instance, one could create a hashtable for storing key values already encountered in a tuple stream. To determine whether a new tuple were unique, the system need only index into ("hash") the appropriate hashtable entry, for determining whether the key value had already been encountered. Although that approach will work, it is far too slow to be practical. Quite simply, the overhead of such a lookup mechanism (which requires extensive I/O activity) would impede system performance to the point where it would be unacceptable to users.
What is desired is an approach which allows the system to "throw away" duplicates as they occur in a tuple stream. At the same time, however, such an approach should not incur costly lookup or other list-processing operations. The present invention fulfills this and other needs.