Today, in media, blogs, Internet Forums etc. across time, various entities e.g., organizations, companies, individuals, weather, crop production etc. are being valued and described based upon semantic dimensions, such as “creative”, “trustworthy”, “innovative”, “warm”, “high”, “bad” etc. This is of great interest, for example when estimating opinion polls, estimating future weather conditions, possession of a television (or basically any variable that may be estimated), based upon what is written in the media, on the Internet or in a set of questionnaire answers. Currently, information retrieved from texts is often collected through manual reading and subjective estimates of small samples of relevant texts published in media. This is done either by experts in the relevant field, e.g. by market analyzers, meteorologists etc., or by opinion polls conducted by written or oral questionnaires.
However, such opinion polls and questionnaires introduce some problems which include subjectivity, since people making subjective evaluative judgments of texts are known to be influenced by several variables, including prior life experiences, type of questions asked, the setting that the questions are being asked in, information provided just prior to the question and so forth.
Moreover, usually only small text samples are included as people have a limited ability or time to read or acquire information, which means that an evaluation and a subsequent conclusion may be based on insufficient information. Another problem is that few evaluative dimensions are studied since current opinion polls have a limit in the number evaluative questions that can be asked.
Further, evaluative judgments of today often are processed manually, leading to large cost for collecting this information. Other difficulties include studying changes across time since opinion polls have to be taken at the time which is being studied. Therefore it is not possible to study evaluations occurring earlier in time making it difficult to graphically track changes in evaluations.
Also, in today's globally networked society, an incredible amount of text is produced at an ever increasing rate by various distributed text sources. The semantic contents of such texts are important to various industrial processes and technical systems which in some way deal with or depend on the behavior and opinions of a group or population of human individuals. However, it is believed that an automated and efficient approach for deriving control data input to industrial processes and technical systems based upon the semantic contents of distributed text sources has hitherto not been available.
Accordingly, there is a need for the possibility of efficiently extracting information from a number of texts. Automated systems for measuring e.g. valence (which refers to the number of arguments controlled by a verbal predicate) are known although these systems do not allow for such measure across time.
One automated system for measuring the valence of single words was accomplished by first creating a so called semantic space of a text corpus. Then a set of positive and a set of negative words were collected from word norms and the valence of a word was estimated by measuring, in the semantic space, the distance between this word and the set of positive and a set of negative words respectively.
Another automated system measures the valence of news headlines. Here semantic spaces were first created and the headlines of newspapers articles were summarized by averaging the words in the headlines in the semantic space. Eight positive words and eight negative words were also summarized by averaging the representation of these words in the space. The valence of the headlines was estimated by, in the semantic space, measuring the distance between the summary of the headlines and the summary of the positive and negative words respectively.
In the patent literature, there are several purposes and techniques for processing and analyzing text. US 2004/0059736 A1, for example, includes means for determining a concept representation for a set of text documents based upon partial order analysis and modifying this representation if it is determined to be unidentifiable. Furthermore, described is also means for labeling the representation, mapping documents to it to provide a corresponding document representation, generating a number of document signatures each of a different type, and performing several data processing applications each with a different one of the document signatures of differing types.
US 2007067157 A1 describes a phrase extraction system that combines a dictionary method, a statistical/heuristic approach, and a set of pruning steps to extract frequently occurring and interesting phrases from a corpus. The system finds the “top k” phrases in a corpus, where k is an adjustable parameter. For a time-varying corpus, the system uses historical statistics to extract new and increasingly frequent phrases. The system finds interesting phrases that occur near a set of user-designated phrases, uses these designated phrases as anchor phrases to identify phrases that occur near the anchor phrases, and finds frequently occurring and interesting phrases in a time-varying corpus is changing in time, as in finding frequent phrases in an on-going, long term document feed or continuous, regular web crawl.
Even though known techniques fulfill their respective purpose, there is no possibility to make evaluation in terms of predicting a variable of choice that is related to a given word or words in a given text corpus.