As Internet increasingly grows into the major medium for people to release and exchange information, the focus as well as hot social issues that people attend can be better reflected through the Internet information. Therefore, it has become an understandable demand to monitor the hot issues and focus incidents reflected in Internet information. Both common users and professionals expect an automatic tool or a method to assist their real-time tracking of latest and hottest issues or news in the fields they are interested in, so that they can grasp the latest development in these fields.
It is not difficult noticing that in most cases, the concentrated and intensive emergence of some keywords in Internet information corresponds to the occurrence of some hot news or focus incident, while the massive text containing related keyword will emerge in Internet intensively once extensively attended news or incidents occur. Therefore, great change in the amount of hot keywords in Internet generally reflects occurrence or cool-down of hot social news or incidents, and Internet text reflecting hot social news or incidents will in turn promote vast Netizens' interests and opinions in relevant news and incidents. In other words, exceptionally high frequency of keywords is somewhat related to significantly hot news and incidents. Therefore, prediction of issues with less change in keyword frequency is avoided in this invention; instead, only exceptionally high frequency changes of keywords are concerned. This invention is a very valuable tool for Internet survey institution as well as institution laying focus on hot social news and incidents to trace the emergence frequency of hot words automatically.
All words involved in the method discussed here refer to keywords in Internet information.
Different words have different emergence frequencies, while words with different emergence frequencies but an identical emergence frequency on a specific day connote different meanings. For a frequently used word, both the historical average and the historical standard deviation of its emergence frequency are significant (for instance, they are 500 times per day and 350 times per day, respectively). If its emergence frequency on Internet is increased by 300 times on some day and becomes 800 times, i.e., almost doubled, then it is still regarded as normal; however, if its emergence on Internet becomes 1200 times, i.e., almost doubled, then it may indicate occurrence of corresponding hot news or incidents.
For a word used less frequently, its daily emergence frequency on Internet and standard deviation are low. For instance, they are 20 times and 15 times, respectively. If on one day, its emergence frequency on Internet is increased by 30 times and becomes 50 times, i.e., almost doubled, then it is generally regarded as still normal; however, if its information amount on Internet on some day is increased by 300 and becomes 320, then it indicates occurrence of corresponding hot incidents or news.
In other words, the same increment of 300 times is normal for high-frequency words yet indicates occurrence of abnormal incidents for low-frequency words, i.e., the criteria for determining words with different emergence frequencies are different.
For low-frequency words, the above emergence frequency (300 times) is known as an unusually high increment of word frequency. The main aim of this invention is to monitor the unusually high increment of word frequency and in turn predict the occurrence or cool-down of hot-spot information in Internet as well as send alarm if necessary.
Khoo K. B. et al. brought forward a method to trace hot issues in 2001. They made periodical Stat. on the emergence frequency of some terms in some fixed-point websites or web pages and obtained hot issues at that time by calculating the weight of each term at that time based on Formula tfidf (Khoo K. B., Mitsuru I. Emerging Topic Tracking System. Advanced Issues of E-Commerce and Web-Based Information Systems, WECWIS 2001, Third International Workshop on. Feb. 11, 2001.), hereafter referred as existing technology 1. Its contribution lies in that it provides a standard formula to calculate the current weight of each term, and such a weight will change with time and reflect the variation of hot spots in Internet information. Its main drawbacks lie in that the historical average and the historical standard deviation of each term are not considered, so abnormal hot spots cannot be determined accurately based on the historical records of high-frequency words and low-frequency words; instead, only transverse comparison can be performed on each term.