The Internet is a source of much useful information. However, the content of the internet is also polluted with spam data and duplicated content. Many useful websites represent the legitimate source of content such as news items, articles, comments, user posts, and the like. Social network and blog sites are also a rich source of online content. A blog is a discussion or collection of information published on the Internet, typically formatted as a series of discrete entries, called posts, usually displayed in reverse chronological order so that the most recent post appears first. Webcrawlers can obtain updated information from a blog through its Rich Site Summary (RSS) feed. An RSS feed normally includes summarized text, the publication date of a post and the name of the author. Thus, webcrawlers can analyze RSS data to characterize, index, and otherwise process blog site content (and other website content).
Marketing campaigns use information mined from the web (using, for example, a webcrawling system) to assist in meeting the needs of their customers. However, more than one-third of the content on the web is duplicated or copied content. Duplicate or near-duplicate posts are known as aggregated content, and such duplicate or near-duplicate content is often found on aggregator websites (or, simply, aggregators). Most aggregated content is generated automatically by stealing original content from legitimate sources (original sources or legitimate “republication” sources). In order to provide high quality content to end users, it is important to identify and eliminate aggregators and/or aggregated content when crawling the web.
Accordingly, it is desirable to have a computer implemented methodology for detecting the presence of aggregated online content. In addition, it is desirable to provide and maintain a system that is capable of dynamically responding to aggregated content in an efficient and effective manner. Furthermore, other desirable features and characteristics will become apparent from the subsequent detailed description and the appended claims, taken in conjunction with the accompanying drawings and the foregoing technical field and background.