Popularity and trends of media topics may be determined in hindsight using various means. Data collected through digital media may be aggregated in a number of sophisticated ways to provide sociological insights about media attention in general, or detailed per-topic media timeline data to provide insights into history. There has also been a sharp growth in the availability of historical data sources, for example from social media, online news, books and newspapers in digitized form, and the like.
There are many things that can go wrong when trying to analyze a large media corpus in a uniform way. For example, archival data for past generations is decreasing due to everything from population growth, to economics of publishing, to destruction of literature by natural disaster, and the like. The optical character recognition (OCR) process is inherently error-prone, and OCR quality tends to degrade for older media, due to degrading microfilm quality and lower-quality printing press processes. Also, in corpuses where meta-data is also acquired via OCR, OCR errors in the scanned timestamp of the document can create wild variation in many aspects of the time series data. Variation in media trends over time has seen changes like growth in per-edition publication size, a shift from weekly to daily publications, and shortening of the timescale on which information is synchronized between publishers. Changes in language may also pose various problems when analyzing data unless it is the language itself that becomes a subject matter of interest.