This application claims priority to U.S. Provisional Patent Application Ser. No. 62/272,793 filed on Dec. 30, 2015, the complete disclosure of which is incorporated by reference herein.
The subject invention relates to a system and means for generating synthetic social media data, and in particular, to a system and means for generating large volumes of synthetic data for social media, such as from microblogging or social networking services. The system and means can generate synthetic graph structures and text features and combine them to produce large-scale and anonymized social media data that is similar to input social media (in terms of statistical and application-level properties).
Due to advances in computer and communication technologies, social media has been growing at a fast pace with various microblogging and social networking services that have been generating large-scale data. Social media provides rich data feeds that can be utilized for different purposes such as marketing, advertisement, and analysis and forecast of social events. However, the full capture and understanding of social media is largely missing due to the growing number and type of data feeds with complex interactions among social network actors. Therefore, social media analytics (algorithms analyzing social media data, including graph, machine learning and natural language processing algorithms) are needed to systematically extract, capture and analyze social media data.
In turn, research and development progress in social media analytics depends on the availability of social media data to model and analyze social behavior. Existing small and static data sets cannot keep up with growing social media feeds. Social media data is available through either public Application Program Interfaces (APIs) or paid data services. However, there are both rate and privacy limitations on collecting, sharing or distributing new social media data (such as Facebook or Twitter data) for controllable and repeatable test and evaluation. These limitations may slow down progress in social network research and social media analytics by granting data access only to a small group of researchers and preventing full data disclosure that would be necessary to verify and further improve results reported in research findings. Therefore, it is important to generate large-scale synthetic data sets that reflect real-word data sets in terms of statistical or application properties and can be shared with others without rate and privacy concerns. In addition, synthetic data can be used to generate and analyze large-scale and high fidelity behavior of networks of social bots (automated social media posting programs, such as spammers) as well as campaigns (such as for marketing, advertisement and recruitment) in social media.