1. Field of the Invention
The present invention relates to discovering trends in text databases. More particularly, the invention concerns the analysis of databases to find user specified trends in documenting text by employing phrase identification using sequential patterns and trend identification using shape queries.
2. Description of the Related Art
Database technology has been used with great success in traditional business data processing. However, there is a increasing desire to use this technology in new application domains. For example, one such application domain that has acquired considerable significance is that of database text analysis (sometimes referred to as "mining").
Several approaches to different database content analysis techniques have been proposed as discussed in Feldman et al., "Knowledge Discovery in Textual Databases (KDT)", Proc. of the 1st Int'l. Conf. on Knowledge Discovery in Databases and Data Mining, 1995; Feldman et al., "Mining Associations in Text in the Presence of Background Knowledge", Proc. of the 2nd Int'l. Conf. on Knowledge Discovery on Databases and Data Mining, 1996; Renouf, A., "Making Sense of Text: Automated Approaches to Meaning Extraction", 17th Int'l. On-Line Information Meeting Proceedings, 1993a; Srikant et al., "Mining Sequential Patterns: Generalizations and Performance Improvements", Proc. of the 5th Int'l. Conf. on Extending Database Technology (EDBT), 1996. As new database content analysis techniques are discovered, an increasing number of organizations are creating ultra large databases (measured in gigabytes and even terabytes) of business data, such as consumer data, transactional histories, sales records, and historical documents. For example, U.S. Patents dating from 1970 may now be found in a computer database which forms a potential gold mine of valuable business information.
A few suggestions have been made by database content analysis practitioners concerning discovering interesting patterns and trend analyses on text documents. For example, analyzing trends involving the comparison of concept distributions using old data with distributions using new data has been suggested in Feldman, 1995, supra. In Feldman, 1996, supra, associations between the key words or concepts labeling documents using background knowledge about relationships among the key words is described. The knowledge base is used to supply unary or binary relations amongst the key words labeling the documents.
More specifically, using words and phrases to describe themes and concepts in text documents is now being studied by the information retrieval community. For example, mathematical models treating word associations as weighted vectors that represent "concepts" found within documents has been proposed. This "vector" approach allows a query to identify and retrieve a document even when the query and the document share no words, but do share a similar concept. The technique is referred to as Latent Semantic Indexing (LSI) and is discussed in Deerwester et al., "Indexing by Latent Semantic Analysis", Journal of the American Society for Information Science, 41(6):391-407, 1990. However, one problem with the LSI model is the amount of time it takes to "build" the model.
The use of words and phrases to build more advanced queries to discover trends in databases is of recent advent. Various techniques, such as identifying phrases as concepts and as relationships between concepts, where the quality of text categorization is improved by using word clusters and phrases, has been proposed. However, one problem in implementing such phrase-based database content analysis techniques is their implementation in existing databases. The database systems of today offer little functionality to support such "mining"applications, and machine learning techniques perform poorly when applied to very large databases. The difficulty in implementation of a phrase-based analysis method is one reason why the discovery of trends in text databases has not evolved as quickly as might be expected.
Although these trend-finding methods constitute a significant advance and in some instances enjoy commercial success today the assignee of the present application has continually sought to improve the performance and efficiency of these data analysis systems. The problem with presently known methods is that trends in databases may not be easily and efficiently discovered using current techniques.