The present invention relates in general to a method and system to predict the likelihood of data topics that may occur from data sources. The likelihood of the data topics may be predicted over a dimension of time or over other dimensions.
Anyone would like to have a crystal ball: to know what to expect, to know what will happen and take advantage of that information. Of course, this is impossible, especially when human beings are involved. However, some reliable probabilities may be true of human behavior, especially at the group level. A number of companies and researchers listed below have taken a computational social science view by creating templates of behaviors, fitting human group activities seen on the ground into those behaviors, and determining the frequencies with which one kind of behavior follows another.
When taken from the news, these data are often called “event data” and techniques of “sparse parsing” (e.g. U.S. Pat. No. 6,539,348 to Douglas G. Bond et. al.; King, G. & Lowe, W. (2003), “An automated information extraction tool for international conflict data with performance as good as human coders: A rare events evaluation design.” International Organization, 57, 617-642; and Schrodt, P. A. (2000), “Forecasting conflict in the Balkans using Hidden Markov Models.” paper presented at the American Political Science Association, Washington, D.C., found at the time of this application at http://web.ku.edu/keds/papers.dir/KEDS.APSA00.pdf) are often used to extract data from the headlines or the body of news articles. The data extracted are usually in terms of an event such that actor1 performed some action on actor2. The actors are defined in a dictionary, as well as the set of possible actions that can be performed. These dictionaries must also contain the variety of words and word strings used to express the presence of an actor (e.g., “Israel”, “Rabin” and “Tel Aviv” would all map to the actor called “Israeli Government”) or the occasion of an event (e.g., thousands of verbs are matched to about 100 types of events—as illustrated at the time of this application at http://web.ku.edu/keds/data.dir/KEDS.WEIS.Codes.html). Once these event data have been captured, techniques can be used to determine what sequences of events tend to precede crises versus non-crises (Schrodt, 2000 and Bond et al.).
Related word-based methods for predicting behavior include 1) looking for specific keywords to detect a mood or sentiment in large-scale micro-blogging sources and relating counts of these words to socio-economic data (as illustrated by Bollen, J., Pepe, A., and Mao, H. (2010) Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena, WWW2010, Raleigh, N.C.), and 2) counting all words and other features of a movie review to predict the revenue from the opening week of a movie (as illustrated by Joshi, M., Das, D., Gimpel, K., and Smith, N. A. (2010) Movie Reviews and Revenues: An Experiment in Text Regression. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, pages 293-296, Los Angeles, Calif.). The first method biases the categories (i.e., the moods) to be of a pre-defined nature. The second method provides no further understanding of the results since individual features are used independently to make a prediction.
Another method of predicting future behavior at a large scale is to use agent-based modeling (e.g., as illustrated at the time of this application at http://blog.wired.com/defense/2007/11/lockheed-peers-.html). This work attempts to model a population as a discrete set of agents, each with their own internal dynamics using data collected from the field and socio-cultural models.
One problem with both kinds of analysis is that human behavior is much more complex and dynamic than they can accommodate. These analyses tend to require large amounts of manual labor (e.g., interviewing many people in a population) or are biased and limited by what the theoretician's model can accommodate in the textual analysis. They are also both developed specifically for a given population and so may be inappropriate for another. What is needed is a method for analyzing all forms of human behavior, without theoretical constraints or biases, to determine the relationships between one behavior and another in a culturally relevant manner.
Situations exist in the art today in which users attempting to predict future events have access to a large corpus of open source documents (such as newspapers, blogs, or the like) covering an extended time period (months to years). In this situation, a user concerned with non-tactical decision-making may need to address questions of why things happened and what will happen (or, more precisely, what is likely to happen), in addition to questions of what happened and who's who. For example, consider elections in Nigeria. A user might be asked to identify the key political parties in Nigeria and the key players; to summarize what happened in the elections since Nigerian independence; to provide an assessment of why those things happened (e.g., why rioting followed one election, why another was postponed, etc.), or what is likely to happen following the election of April 2007.
A user today might solve such problems by using a system like the Open Source Center (as illustrated at the time of this application at www.opensource.gov) which provides reports and translations from thousands of publications, television and radio stations, and Internet sources around the world covering many years. Current news data archives like the Open Source Center, or any number of other news data aggregators and suppliers, support keyword search, so the user could conduct a variety of searches and retrieve (perhaps very many) articles concerning elections in Nigeria or Africa more broadly. These articles would be rank-ordered in some way, for example by recency, the number of mentions of the search string, popularity or link analysis, but generally not reflecting the user's special requirements.
Given the list of articles, the user might then have to conduct various searches to narrow down the articles to those of interest; if, for example, he or she was concerned about the possibilities of violence associated with elections, searches might need to include “violence,” “riots,” “killings,” “voter intimidation,” and other related terms. Then, those articles would have to be reviewed in temporal order to extract meaningful information, since the user is not merely seeking to compile a list of interesting anecdotes.
This is how users perform information retrieval in numerous parts of the government and military, ranging from human intelligence (HUMINT) reports in a Marine Corps Intelligence Battalion, to newspapers in the Virtual Information Center of the Asia-Pacific Area Network, to TV show transcripts that PSYOP analysts use to understand the attitudes and beliefs of a population and influence them. Nevertheless, this process has several obvious drawbacks: it can take a great deal of time, since iterative searching is typically required; it can be quite inaccurate, with problems in both precision (that is, returning too many irrelevant documents; i.e., false alarms) and recall (that is, failing to find enough relevant documents; i.e., misses), since virtuoso keywording skills may be necessary; and it does not help the user detect the kinds of patterns that could be of interest, since it has no temporal pattern-detecting ability to get at the real issue, which is, e.g., what is likely to happen after a flawed election in Nigeria?
These three methods for determining likely future events (word counting, agent-based systems, and user-intensive searching and understanding) may also be used to determine the financial direction of an individual company, a market sector, or of the economy of a country or world as a whole as measured by any number of economic indicators such as stock prices, employment, or gross domestic product (GDP). In these economic cases, textual data concerning the company, market sector, or nation might be searched or analyzed by human or machines to produce forecasts of future market behavior.
These three methods for determining likely future events (word counting, agent-based systems, and user-intensive searching and understanding), whether those events are political, military, or economic, each have their own deficiencies as described above. Embodiments of the disclosed invention address many of these drawbacks and provide additional novel improvements to the art.