The present invention relates generally to computer database mining, and more particularly to sequential pattern mining.
The volume of data stored in electronic format has increased dramatically over the past two decades. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data. Data storage is becoming easier and more attractive to the business community as the availability of large amounts of computing power and data storage resources are being made available at increasingly reduced costs.
With so much attention focused on the accumulation of data, there has arisen a complimentary need to focus on how this valuable resource can be utilized. Businesses have recognized that valuable insights can be gleaned by decision-makers who make effective use of the stored data. By using data mining tools that are effective to obtain meaningful data from millions of bar code sales transactions, or sales data from catalog companies, it is possible to gain valuable information about customer buying behavior. The derived information might be used, for example, by retailers in deciding which items to shelve in a supermarket, or for designing a well targeted marketing program, among others. Numerous meaningful insights can be unearthed from the data utilizing proper analysis techniques.
One analysis technique involves discovering frequent sequential patterns from a large database of sequences. A major problem users experience attempting to use this technique is the lack of user-controlled focus in the pattern mining process. Typically, the interaction of the user in a pattern mining technique is limited to specifying a lower bound on the desired support for the extracted patterns. An appropriate mining algorithm typically returns a very large number of sequential patterns, only some of which may be of actual interest to the user. Despite its conceptual simplicity, this xe2x80x9cunfocusedxe2x80x9d approach to sequential pattern mining suffers from two major drawbacks.
The first major drawback is a disproportionate computational cost for selective users. Given a database of sequences and a fixed value for the minimum support threshold, the computational cost of the pattern mining process is fixed for any potential user. The problem here is that despite the development of efficient algorithms, pattern mining remains a computation-intensive task typically taking hours to complete. Thus, ignoring user focus can be extremely unfair to a highly selective user that is only interested in patterns of a very specific form.
The second major drawback is the overwhelming volume of potentially useless results. The lack of tools to express user focus during the pattern mining process means that selective users will typically be swamped with a huge number of frequent patterns, most of which are useless for their purposes. Sorting through this morass of data to find specific pattern forms can be a daunting task, even for the most experienced user.
Thus, a need has been recognized in conjunction with database mining that improves upon the shortcomings of previous efforts in the field, including those discussed above.
The present invention broadly contemplates a system and method for mining frequent sequential patterns under structural constraints on the interesting patterns. The novel pattern mining techniques of the present invention enable the incorporation of user-controlled focus in the mining process. To achieve this, two subsidiary problems are addressed. First, is a need for a flexible constraint specification language that allows users to express the specific family of sequential patterns that they are interested in. Second, is a need for novel pattern mining algorithms that can exploit user focus by pushing user-specified constraints deep inside the mining process. The present invention exploits pattern constraints to prune the computational cost and ensure system performance that is commensurate with the level of user focus (i.e., constraint selectivity), as selective users should not be penalized for results that they did not ask for.
In accordance with the present invention, a Regular Expression (RE) is used for identifying the family of interesting frequent patterns. A family of methods that enforce the RE constraint to different degrees within the generating and pruning of candidate patterns during the mining process is utilized. This is accomplished by employing different relaxations of the RE constraint in the mining loop. Those sequences which satisfy the given constraint are thus identified most expeditiously. Experimental results demonstrate that speedups of more than an order of magnitude are possible when Regular Expression constraints are pushed deep inside the mining process in accordance with the present invention.
Method steps of the present invention can appropriately and advantageously be carried out using a suitably programmed general purpose computer. Moreover, these steps may also be implemented on an Integrated Circuit or part of an Integrated Circuit. Thus, it is to be understood that the invention may be implemented in hardware, software, or a combination of both. Accordingly, the present invention includes a program storage device readable by machine to perform any of the method steps herein described for sequential pattern mining with regular expression constraints. Again, it is to be emphasized that any of the method steps, in any combination, can be encoded and be tangibly embodied on a program storage device in accordance with the present invention.