Information retrieval from a database is popularly used in various fields. Information retrieval techniques may be implemented using various algorithms such as a tree model or vector space model, etc. Among various methods, the vector space model has been used for retrieving, clustering and tracking the information from very large databases.
Conventionally, the information retrieval depends on the vector space model usually applied to documents with rigidly determined keywords and with a standard format for generating keyword-document vectors. Documents including different keywords which refer to the same contents or semantics tend to pose problem when forming clusters. In the commercial or other sophisticated databases, keywords may be rigidly determined or selected using proper rules when the documents are accumulated in the database. However, a database accumulating chat, mail, free postings for particular issues or on-line discussion flows may comprise documents or information with unlikely or different keywords, though the keyword sets suggest the same issues, topics, or items by using semantics, synonyms, or parts of the key words.
In such information, typical cluster search algorithms perform badly in the formation of clusters because the keywords are different while the keywords relate to the same items. Also, information tracking in such loosely controlled and uncontrolled documents suffers similar difficulty in identifying focused topics or items with respect to time evolution of the documents.
In addition, typical vector space model algorithms consume huge hardware resources such as CPU time and memory resources, and sometimes the computation of the dimension reduction consumes long CPU time. The cluster formation based on the vector space model further requires extra algorithms for generating clusters. In addition, such cluster formation may not have sufficient relevancy to the items that change or evolve with respect to elapsed time.
Even in a database in which documents are accumulated with respect to time evolution with the documents originating from heterogeneous sources, it is useful and necessary to retrieve, search, or track focused items or matters of the documents with respect to the time dependent accumulation of the documents. Such an information retrieval algorithm will provide some predictions of items included in the documents with respect to time lapse of the accumulation.
For example, such analysis may be useful to predict stock price prediction, product-trend prediction, market research, trend search of academic or patent publication or item prediction which will be focused in the next stage depending on the accumulated document and/or text transmitted between some parties, but not limited thereto.
Detailed algorithms of the vector space model and their particular implementation, which supports basis technologies of the present invention, are reviewed in the following patent and non-patent literature: Japanese patent application JP2001-312505, JP2002-024268, JP2002-030222, JP2003-141160 and non-patent literature including an article by Mei Kobayashi, Masaki Aono and Michael E. Houle, entitled “Mining overlapping major and minor clusters in massive databases”, Invited Talk, Industry Day, Special Technologies Workshop #6, organized by Noel Barton, International Conference on Industrial and Applied Mathematics (ICIAM), Sydney, Australia, 2003 and an article by Mei Kobayashi and Masaki Aono, Vector space models for search and cluster mining, in Michael Berry (ed.), entitled “Survey of Text Mining: Clustering, Classification, and Retrieval”, Springer, N.Y., USA, 2003, pp. 101-122.