The present invention relates to a method and a system for detecting important patterns from a set of data with complex structure, in particular, for rapidly and automatically detecting said patterns while considering the data structure.
Data that has structure but does not have clearly fixed structure is increasing lately, such as data with complex links on World Wide Web (WWW) and collections of documents (text database) written in natural languages. Such data that does not have clearly fixed structure is called semi-structured data. As semi-structured data is unknown as to its structure itself, a fixed view cannot be given for detecting patterns of structure. In general, data that has complex structure can be represented by using graphs comprising vertices and edges. For instance, a sentence written in a natural language can be expressed as a network among concepts through parsing and semantic analysis. Also, organization and employees of a company can be represented as a graph combining related entities. When a set of such graphs is given as a database, it is possible, if subgraphs frequently appearing as patterns therefrom can be detected, to extract important concepts contained in the set of data. Nevertheless, an algorithm for solving this problem at high speed is not known.
Data mining is known as a method for detecting patterns of data structure from a set of data. Association rules is well known as a pattern that can be detected by this data mining. Conventionally, when detecting association rules, a database is taken for a set of transactions and a transaction is considered as a set of items. And sets of items frequently appearing (co-occurring) in transactions in the database are acquired, and association rules are derived therefrom, In this case, subject data is sets of simple items, and the derived patterns are also sets of simple items. Besides, complex data structure can be simplified by having a view given by a user. Conventionally, in the case of application of data mining to complex data, it has been applied in fact to data that is simplified by giving such a fixed view. It is limited, however, to the cases where a fixed view can be given, and thus such simplification is not easy in a situation in which it is not known in advance as to where in the complex structure attention should be paid. Conversely, there is a problem in data mining that it is difficult to give a view fixed in advance since its purposes are detection of unknown knowledge and analysis of data. In addition, because the data structure is decomposed and converted into data in a flat format in advance in order to detect patterns by paying attention to a view fixed in advance, patterns cannot be detected by effectively using the data structure. Moreover, as a matter of course, the portions to which attention was not paid are not covered by detection.
Accordingly, an object of the present invention is to provide a method and a system for detecting important patterns contained in a set of data.
Another object is to provide a method and a system for detecting important patterns contained in a set of data at high speed and in a short time.
A further object is to provide a method and a system for detecting important patterns contained in a set of data by effectively using the data structure.
A still further object is to provide a method and a system for mining a set of data without giving a fixed view.
The present invention is a system, to solve the above problems, for detecting frequent association patterns from databases of data with tree-structure by using candidate patterns for counting, comprising:
(1) means for counting patterns matching with candidate patterns from the databases;
(2) means for detecting frequent patterns from the result of the counting;
(3) means for generating candidate patterns for next counting from the frequent patterns detected.
FIG. 1 shows an overview of the present invention. To detect frequent association patterns from databases of data with tree-structure, it consists of block 110 for counting patterns matching with candidate patterns from the databases, block 120 for detecting frequent patterns from the result of the counting, and block 130 for generating candidate patterns for next counting from the frequent patterns detected.
As another form, the present invention is a text-mining system for extracting useful concepts from a large volume of text data, comprising;
(1) means for parsing sentences in the text data;
(2) means for generating structured trees based on the results of the parsing;
(3) means for creating databases comprising sets of the structured trees;
(4) means for counting candidate patterns for counting from the databases;
(5) means for detecting frequent patterns from the result of the counting;
(6) means for generating candidate patterns for next counting from the frequent patterns detected.