The present invention relates to detecting and tracking evolution of new events and/or classes of documents in a large database, and more particularly relates to a method, a system, and a program product for detecting and tracking the evolution of the new events and/or classes of the documents in a very large database by simultaneously taking into account a temporal parameter such as time, a date, or a year and any combinations thereof in a vector modeled document.
Recent database systems must handle increasingly large amounts of data, such as news data, client information, stock data, etc. Users of such databases find it difficult to search desired information quickly and effectively with sufficient accuracy. Therefore, timely, accurate, and inexpensive detection of new topics and/or events from large databases may provide very valuable information for many types of businesses including, for example, stock control, futureS and options trading, news agencies which may afford to quickly dispatch a reporter without affording a number of reporters posted worldwide, and businesses based on the Internet or other fast paced actions which need to know major and new information about competitors in order to succeed thereof.
Conventionally, detection and tracking of new events in enormous databases is expensive, elaborate, and time consuming work, because a searcher of the database usually needs to hire extra persons for monitoring thereof.
Recent detection and tracking methods used for search engines mostly use a vector model for data in the database in order to cluster the data. These conventional methods generally construct a vector q (kwd1, kwd2, . . . kwdN) corresponding to the data in the database. The vector q is defined as the vector having the dimension equal to numbers of attributes, such as kwd1, kwd2, . . . kwdN which are attributed to the data. The most commonly used attributes are keywords, i.e., single keywords, phrases, names of person(s), place(s). Usually, a binary model is used to create the vector q mathematically in which the kwd1 is replaced to 0 when the data do not include the kwd1, and the kwd1 is replaced to 1 when the data include the kwd1. Sometimes, a weight factor is combined to the binary model to improve the accuracy of the search. Such weight factor includes, for example, appearance times of the keywords in the data.
In such vector model of the database, conventionally the clustering of the data in the database is first carried out based on the keywords. The procedure of the clustering mostly uses the scalar product of the vector q. In the clustering of the data, each vector corresponding to the data in the database is categorized into some clusters having a predetermined range of the scalar product. Then the clusters are further clustered using a date/time stamp attributed to the data for detecting and tracking the new event. The conventional search method uses a two-step clustering process for detecting and tracking the new events as described above, and therefore, the search procedure becomes elaborate and expensive work.
Therefore, there are needs for providing a system implemented with a novel method for detecting new events and/or classes and tracking evolution of the new events in an inexpensive and automatic manner.
In xe2x80x9cMaximizing text-mining performancexe2x80x9d, IEEE Intelligent Systems, July/August, 1999, pp. 1307-1313 by S. Weiss et al. at IBM T. J. Watson Laboratory, a method for detecting and tracking new events, which uses a combination of decision tree algorithms and adaptive sampling, is disclosed. The method disclosed by Weiss et al. may provide a method for detecting and tracking new events, but has the disadvantage of requiring training sets of sample documents to compile a dictionary.
In xe2x80x9cTopic detection and tracking pilot study final reportxe2x80x9d, Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, February, 1998, Morgan Kaufmann San Francisco, pp. 194-218, 1998, by J. Allan et al., at University of Massachusetts, Amherst, CMU, xe2x80x9cDragon Systemsxe2x80x9d a probabilistic (Hidden Markov Model) approach is used to cluster documents based on words and sentences in articles. In the xe2x80x9cDragon Systemsxe2x80x9d, there is also the disadvantage of requiring a training set to start the system. UMass (University of Massachusetts) uses a content based LCA (local content analysis) method, and this method is very slow so that the search speed becomes unacceptably slow. The Carnegie-Mellon University""s system is directed to search multimedia data such as audio news and video data. It is based on probabilistic methods.
In xe2x80x9cIntelligent Information Retrievalxe2x80x9d, IEEE Intelligent Systems, July/August, 1999, pp. 30-31 by Y. Young et al., a method which uses a group average clustering and an independent time stamp-weighting factor is disclosed. The weighting factor is also disclosed in xe2x80x9cClustering algorithmsxe2x80x9d, pp. 419-442 in W. Frakes and R. Baeza-Yates (Editor), xe2x80x9cInformation Retrieval: data structures and algorithmsxe2x80x9d, Prentice-Hall, Englewood Cliffs, N.J., 1992, and E. Rasmussen and xe2x80x9cRecent trends in hierarchic clustering: a critical reviewxe2x80x9d, Information Processing and Management, Vol. 24, No. 5, pp. 577-597, 1988.
In xe2x80x9cCMU Infomedia-KNN-based Topic Detectionxe2x80x9d:
http://www.informedia.cs.cmu.edu./HDWBerk/tsld001.htm, a training index with pre-labeled topics is provided.
The detail is:
45000 broadcast News stories from 1995 to 1996,
3178 different news topics occurring appeared larger than 10 times
Search for top 10 related stories in training index
Lookup topics for related stories
Re-weight topics by story relevance (select top 5)
At 5 topics, Recall is reported to be 0.491 and Relevance is reported to be 0.482
In xe2x80x9cNIST Topic Detection and Tracking Evaluation Projectxe2x80x9d:
http://www.itl.nist.gov/iaui/894.01/proc/darpa98/index.htm, U.S. National Institute of Standard and Technology (NIST) discloses the results conducted in 1997Xg as listed in Table I.
An object of the present invention is to provide a novel method for detecting new events and/or classes of the documents and tracking evolution thereof in a database.
Another object of the present invention is to provide a novel system for detecting new events and/or classes of the documents and tracking evolution thereof in a database.
Further, another object of the present invention is to provide a novel program product for detecting new events and/or classes of the documents and tracking evolution thereof in a database.
The present invention essentially utilizes a novel method for detecting and tracking of the new events and/or classes of the documents in a very large database simultaneously taking into account a time stamp parameter such as date and time in a vector modeled document.
In a first aspect of the present invention, a method for detecting new events and/or classes of documents and tracking evolution thereof in a database, said new event and/or classes of said documents being added to said database, said documents including attribute data related to a temporal parameter, said method comprises steps of:
providing vectors of said documents based on attribute data simultaneously including said temporal parameter included in said document, and
detecting said new events and/or classes of said documents and tracking evolution thereof simultaneously using said vectors.
In the first aspect of the present invention, said attributed data may include at least one keyword, and said keyword is weighted with respect to a frequency of appearance in said document.
In the first aspect of the present invention, said detecting and tracking step may further include a step of providing a temporal window such that said detecting and tracking step is executed using said temporal window.
In the first aspect of the present invention, said temporal window may be a delta function with respect to a specific date.
In the first aspect of the present invention, said temporal window may be a symmetric Gaussian function.
In the first aspect of the present invention, said temporal window may be a step function
In the first aspect of the present invention, said temporal window may be formed interactively by a user on a display window.
In the first aspect of the present invention, said temporal parameter may be further weighted with respect to time elapse about a specific date, and a weight of said temporal parameter may be less than the total weight of said keywords.
In the first aspect of the present invention, said temporal window may be normalized before dimensional reduction for said vectors may be carried out if the number of said keywords in each document is relatively constant in said database and the same temporal window is used for all of the documents.
In the first aspect of the present invention, several different temporal windows may be provided so that the relative weights between said keywords and said temporal parameter becomes relatively constant from document to document if the number of said keywords in each document varies greatly.
In a second aspect of the present invention, a computer system including a database to which new events and/or classes of documents are added, said documents including data related to a temporal parameter, and detecting new events and/or classes of said documents and tracking evolution thereof being executed in said computer system comprises:
means for providing vectors of said documents based on attribute data simultaneously including said temporal parameter included in said document, and
means for detecting said new events and/or classes of said documents and tracking evolution thereof simultaneously using said vectors.
In the second aspect of the present invention, said attributed data may include at least one keyword, and said keyword is weighted with respect to a frequency of appearance in said documents.
In the second aspect of the present invention, said detecting and tracking means may further include means for providing a temporal window such that said temporal window is used by said detecting and tracking means.
In the second aspect of the present invention, said temporal window may be a delta function with respect to a specific date.
In the second aspect of the present invention, said temporal window may be a symmetric Gaussian function.
In the second aspect of the present invention, said temporal window may be a step function.
In the second aspect of the present invention, said temporal window may be formed interactively by a user on a display window.
In the second aspect of the present invention, said temporal parameter may be further weighted with respect to time elapse about a specific date, and a weight of said temporal parameter may be less than the total weight of said keywords.
In the second aspect of the present invention, said temporal window may be normalized before dimensional reduction for said vectors is carried out if the number of said keywords in each document is relatively constant in said database and the same temporal window is used for all of the documents.
In the second aspect of the present invention, several different temporal windows may be provided so that the relative weights between said keywords and said temporal parameter may become relatively constant from document to document if the number of said keywords in each document varies greatly.
In the second aspect of the present invention, said computer system may comprise a server and at least one client, and said detection and tracking step may be requested from at least one client computer which transmits to said server and receives a result from said detection and tracking step.
In a third aspect of the present invention, a program product for detecting new events and/or classes of documents and tracking evolution thereof in a database, said new events and/or classes of said documents being added to said database, said documents including attribute data related to a temporal parameter, said method comprising steps of:
providing vectors of said documents based on attribute data simultaneously including said temporal parameter included therein, and
detecting said new events and/or classes of said documents and tracking evolution thereof simultaneously using said vectors.
In the third aspect of the present invention, said attributed data include at least one keyword, and said keyword may be weighted with respect to a frequency appeared in said documents.
In the third aspect of the present invention, said detecting and tracking step may further include a step of providing a temporal window such that said detecting and tracking step is executed using said temporal window.
In the third aspect of the present invention, said temporal window may be a delta function with respect to a specific date.
In the third aspect of the present invention, said temporal window may be a symmetric Gaussian function.
In the third aspect of the present invention, said temporal window may be a step function.
In the third aspect of the present invention, said temporal window may be formed interactively by a user on a display window.
In the third aspect of the present invention, said temporal parameter may be further weighted with respect to time elapse from a specific date, and a weight of said temporal parameter may be less than the total weight of said keywords.
In the third aspect of the present invention, said temporal window may be normalized before dimensional reduction for said vectors is carried out if the number of said keywords in each document is relatively constant in said database and the same temporal window is used for all of the documents.
In the third aspect of the present invention, several different temporal windows may be provided so that the relative weight between said keywords and said temporal parameter may become relatively constant from document to document if the number of said keywords in each document varies greatly.
The present invention will be further understood by explaining the following non-limiting embodiments of the present invention along with drawings thereof.