1. Field
One embodiment of the invention relates to a method and apparatus, for example, favorable for collecting Web pages that match a user's designated search condition and generating time-series data, which is divided into clusters, from the Web pages.
2. Description of the Related Art
An information processing apparatus such as a personal computer generally has a Web browser. A Web browser is used to browse Web pages that are public on the Internet by way of a Web server. Recent information processing apparatuses are able to start a search engine from a Web browser according to a user's operation. The search engine receives a user's designated keyword (search condition) and collects a set of Web pages associated with (conforming to) the keyword. The search engine simply collects Web pages on the basis of the degree of association with the user's designated keyword. In other words, the search engine cannot collect Web pages in view of the degree of time association between the Web pages or arrange the Web pages associated with each other.
Jpn. Pat. Appln. KOKAI Publication No. 2002-297883 (referred to as document 1 hereinafter) discloses a knowledge information management apparatus for storing conversation streams of business operations which are exchanged among the traders concerned through a network. This apparatus also stores objects necessary for carrying out the business operations in a process from the occurrence of a problem to the solution of the problem. The apparatus associates any one of the stored conversation streams and any one of the objects with each other. With this association, the apparatus can output information about the conversation stream and the object associated with each other.
The knowledge information management apparatus disclosed in document 1 collects specific conversation streams. To do so, an area from which the conversation streams are output needs to be specified in advance. When a conversation stream is not explicit, the apparatus cannot collect it.
Jpn. Pat. Appln. KOKAI Publication No. 2004-139376 (referred to as document 2 hereinafter) discloses a technique of monitoring a word-of-mouth site and analyzing the frequency with which a specified word-of-mouth is used at the word-of-mouth site. The fluctuations in the frequency during a specified time period of a notable event are analyzed. However, the technique disclosed in document 2 makes it impossible to know the degree of association in which the progression of a plurality of notable events over time is considered.
Jpn. Pat. Appln. KOKAI Publication No. 2004-185572 (referred to as document 3 hereinafter) discloses a word-of-mouth information analysis apparatus for extracting user information, time information and sentence information from collected sentences for each of articles. This apparatus can divide sentence information into words and combine these words with the user information and time information into data. In document 3, however, the time information is simply used as one value that features an article. It is thus impossible to understand the degree of association of articles in which the progression of the articles over time is considered.
Jpn. Pat. Appln. KOKAI Publication No. 2003-242165 (referred to as document 4 hereinafter) discloses a potential target extraction apparatus. The extraction apparatus acquires a time-series pattern having effective customer characteristics in consideration of time-series customer data in the field of communication service and the like. The extraction apparatus divides a plurality of quantitative attributes, which make up time-series data (customer data), into some sets of attributes in advance. The extraction apparatus performs clustering for the sets of attributes (i.e., attribute values of elements that make up time-series data). Quantitative time-series data is therefore converted into qualitative time-series data that is featured by clustering. The extraction apparatus classifies the qualitative time-series data into data (subscriber data) of subscribers for a specific service and data of nonsubscribers (nonsubscriber data). The apparatus extracts a pattern having a time-series characteristic of a specific set of attributes from the subscriber data. The apparatus extracts time-series data of nonsubscribers, which is similar to the extracted pattern, from the nonsubscriber data and determines the nonsubscribers (customers) as potential customers.
As the Internet becomes widespread, a number of topics are developed on, for example, a bulletin board of the Web day to day. Most of the topics are insignificant. Even though nobody notices a topic on the bulletin board, it is not so important to specific persons or organizations. However, some of the topics may cause a disadvantage to an individual and an organization and cause them to miss an opportunity to make a profit.
No notification about the above topics is always made to their related persons or organizations. The sites of the topics are not limited to a specific bulletin board. These topics vary from specific person to specific person or from specific organization to specific organization. On the other hand, a large number of topics are developed on a number of bulletin boards. It is therefore very difficult to check all of the topics and determine whether the topics are advantageous to specific persons and organizations.
It is thus required that data items including user's notable topics be collected from a plurality of sites scattered on the Web and their related data items be sorted in consideration of a lapse of time. However, none of documents 1 to 3 teach obtaining the degree of association in which the progression of a plurality of notable topics (events) over time is considered.
Document 4 discloses a technique of extracting a pattern having a time-series characteristic of a specific set of attributes from the results of clustering for customer data (i.e., time-series data made up of a plurality of quantitative attributes) in the field of communication service and the like. In document 4, clusters are generated by clustering for attribute values of elements that make up time-series data.
The type, number or location of attribute values included in the Web data collected from a plurality of sites (Web sites) scattered on the Web is not fixed, unlike those of attribute values included in the customer data. Clustering as disclosed in document 4 is difficult to perform for the attribute values of the Web data. In document 4, one qualitative time-series data item is generated from one quantitative time-series data item. In this generation, data items (topics) associated with data including user's notable topics collected from a plurality of Web sites are difficult to sort in consideration of the progression of the data items (topics) over time.