The present invention relates to computer software, and in particular, to profiling in a parallel processing environment such as a massive parallel processing (MPP) environment.
Large sets of data may initially be organized into rows each associated with a certain subject such as a customer, employee, vendor, contributor, investor, tax payer, item of real or personal property, etc., and columns each associated with a certain attribute of the subjects. When a company wishes to know specifically what the salary and start date of a certain employee is, then the particular employee identification number, or employeeID, can be entered and these and other information will be available in the row associated with that employeeID. The data itself may be stored in files of, for example, ten employees each, so that a first file includes information relating to employees 1-10, a second file includes information relating to employees 11-20, and a third file includes information relating to employees 21-30, etc. If the specific employeeID is 22, then the third file will be retrieved containing the relevant information about the employee having employeeID 22. As it is often the case that the company will wish to find specific information in this way, then the organization of the data according to subjects in rows and attributes in columns is efficient.
However, efficiencies of specific searching through data files organized with subjects as rows and attributes as columns breaks down when it is desired to profile data. In a profiling process, it is typically desired to search the data in a particular column to check whether it matches one or more input criteria. For example, the subjects may be employees, and the criteria may be employees making over a certain salary or it may be desired to find the minimum, maximum or average salary of a group of employees, or it may be desired to determine which employees were hired after a certain date, among many other potential queries. Now, if the data is organized into the first, second and third files as the first ten, second ten and third ten employeeIDs, then all of the files would be retrieved including all of the data about every employee just to retrieve all of the data in one column.
Thus, there is a need for a more effective and efficient way to profile data. The present invention solves this problem by providing an efficient and effective method of profiling in a MPP environment.