Much of the content on the Web is available through syndication channels which need to be actively monitored to maintain an up-to-date profile of their published information over time. Such monitoring is essential for next generation of Web 2.0 applications that provide sophisticated search and discovery services over Web information channels. A channel's profile can change over time due to the dynamic nature of the channel. Therefore, maintaining a fresh channel profile is extremely difficult, especially under the constraint of a limited monitoring budget.
The number of diverse information channels available on the Web is rapidly increasing. It spans many different knowledge domains such as news, stock and market reports, auctions, and more recently channels containing data gathered from Blogs or Wikis. Recent advances in Web technology, such as the improved access capabilities to channels and the availability of new data delivery mechanisms for disseminating the channel content, have resulted in the emergence of more advanced client-side Web applications.
These applications require sophisticated manipulation of channels on the Web including the discovery, search, and recommendation of relevant channels. Such applications include various Web 2.0 mashups, and situational applications in general, which integrate data that is gathered from several different, possibly inter-related, channels. An imperative task for developers of such applications is to locate relevant channels that will maximize the benefit gained from their applications.
A crucial step toward the support of such advanced services over channels is the ability to capture the essence of each Web channel. This can be done using channel profiles. A channel profile is a compact representation of the channel content, which can be used to summarize and capture the main characteristics of the content published on this channel. Profiles can simplify the way relevant channels can be located and can be used to match application requirements against the available set of channels managed by the system.
Maintaining channel profiles is challenging due to several reasons. First, channel content is usually dynamic, as in the case of Web feeds where the content is continuously changing, sometimes at a daily or even hourly rate. Because the profiles of such channels may continually change over time, capturing the dynamic trends of the channel content is extremely difficult.
Second, the majority of channels on the Web are available nowadays for access via pull-only protocols, while most servers refrain from supporting push protocols due to scalability issues. Previous work on novelty detection in data streams, and data stream summarization, assume that the stream of updates to a channel is pushed into the system. By contrast, in a pull based scenario each channel is required to be actively monitored in order to maintain enough snapshots to construct a fresh and reliable profile of its content. The freshness of maintained profiles therefore directly depends on the rate at which channels are monitored. Moreover, different channels may have different rates at which novel content is being published on them; thus profiles of different channels may change at a different (possibly even non-regular) rate.
Third, in the pull-based scenario, channels may be volatile, meaning that novel content published over time has a limited lifespan during which it is available on the channel. Such data volatility is very common in Web feeds, where channels have a limited capacity for the number of feed entries that are maintained on the feed. Such limitation is further determined by the feed popularity and the feed provider update policy (e.g., an overwrite policy for which the provider maintains only the last newest entries of the feed). Therefore, monitoring the channel profiles in a pull setting is challenging, where it is hard to predict the moments when novel content, which may result in a significant profile change, is published on such volatile channels and is still available for access.
Finally, channel monitoring can be constrained, either due to limited system resources such as bandwidth, memory, or CPU (central processing unit), or due to monitoring restrictions set by the channel providers themselves (sometimes termed “politeness constraints”) due to heavy workloads imposed by multiple client access. Therefore, the number of channels that can be monitored in parallel is further limited and requires efficient utilization of the allocated resources for the maintenance of fresh profiles.