Emerging technologies focusing on user generated content have greatly increased the amount of data transmitted across the Internet on a daily basis. The growth of services allowing users to publish streams of data has allowed for expansive coverage of current events, news, topics and other data. However, the explosive influx of user-generated content provides significant problems in data analysis and aggregation. Furthermore, this influx of data provides for novel issues in extracting relevant topics due to the diversity of language, culture, slang and various other factors that affect the semantics of user-generated streams of data.
User generated data streams, when aggregated, allow for efficient discovery of hot news and trending topics. Previous efforts in aggregating user generated data streams have been to trend keywords in the data stream. This technique does not give a full view of why users are generating given keywords. The generated streams usually tend to break up the story depending on the user mood. For example, user generated streams directed to the same topic may vary as follows:
Stream 1: “eBay expected to announce deal to sell Skype”Stream 2: “Ebay will announce deal to sell Skype to a group of investorson Tuesday”Stream 3: “Big news tonight - “BREAKING: eBay to Announce Deal toSell Skype”
Currently, the trends in these user generated streams are surfaced as “ebay”, “announce deal”, “sell skype”, because users write about the same topics differently. The current state of the art fails to cohesively analyze user-generated streams to account for the variance in terminology used across a diverse data set. The present invention provides a solution allowing a system to intelligently parse and identify key trending topics and store topics or stories for subsequent analysis and retrieval.