With the advent of the Internet, information is easily disseminated to a user via websites. The content on most websites such as, news sites, product information pages, medical websites, sports websites, weblogs, podcasts, video sites, etc., often change on an unpredictable schedule. Therefore, a user who wishes to check for updates has to repeatedly check each website for the updates. Moreover, with the innumerable number of websites available on the Internet today, manually checking each website for updates proves to be very tedious. This problem has spurred the adoption of data feeds that collate the content from various websites into a common, convenient web syndication format such as Really Simple Syndication (RSS) or Atom. Each of the web syndication formats follows a general structure which includes multiple items. Examples of multiple items include but may not be limited to a link, a title, and a meta-data such as HyperText Markup Language (HTML).
The data feeds associated with a website enable people to keep up with their websites in an automated manner rather than manually checking and pulling content from individual websites. However, processing of the data feeds is computation intensive and requires HTML parsing and custom data transformations in accordance to an application. Moreover, processing of the data feeds becomes very challenging in the presence of tens of thousands of data feeds and records in the data feeds.
Conventional solutions for processing data feeds include scheduling the various data feeds through a scheduler. The scheduler may be a system-wide scheduler or a language based scheduler. Examples of the system-wide scheduler include but not may be limited to a cron and hcron. The term ‘cron’ refers to a time-based job scheduler that enables users to schedule jobs to run periodically at certain times or dates. Examples for the language based scheduler include but not limited to a Java® based scheduler, Python® based scheduler and Ruby® based scheduler. Examples of the Java® based scheduler include but may not be limited to Quartz®, Essiembre, and Fulcrum®. The Python® based scheduler may be an Advanced Python® scheduler. Examples of the Ruby® based scheduler includes but may not be limited to Rufus, and Delayed Job.
In other conventional approach, a method that uses large scale feed processing using large server farms using a Pacman framework or a Pepper framework which runs on Java® based Hadoop® is disclosed. Hadoop® refers to a software framework that allows for the distributed processing of large data sets across clusters of computers.
However, most of the conventional approach describes synchronous methods of processing the data feeds. In these synchronous methods of data feeds, the data feeds are processed sequentially and does not enable simultaneous processing of the data feeds. The conventional approach cannot keep up with the processing when the burst rate of the data feeds is very high. For example, when the burst rate of the data feeds is greater than thousand requests per minute, it becomes extremely difficult to keep up with the processing of the data feeds without leading to inconsistency of data, losing some data feed requests, crashing of servers and the like.
In light of the above stated discussion, there is need for a method and a system which will overcome the above stated problems. In addition, the method and system should use asynchronous methods of processing the data feeds, thereby enabling processing of a large number of data feeds.