Previous research suggests methods for discovering networks of individuals consistently interacting over time through temporal analysis and event response monitoring (“TAERM Network Detection Methods”). See J. Mugan, E. McDermid, A. McGrew, and L. Hitt, Identifying Groups of Interest Through Temporal Analysis and Event Response Monitoring, In IEEE Conference on Intelligence and Security Informatics (ISI), 2013. Such methods may reveal many such networks for a given domain, however. Depending on the domain and the amount of data, the number of networks discovered may be so large that it is difficult for the analyst to identify which networks, among the potentially large number of networks discovered, warrant further investigation.
Another area of research focuses on methods for using unstructured text to characterize entities. Extractive summarization is the problem of representing a document by its most important sentences. One method of extractive summarization works by finding the k most frequent words in the document (ignoring stop words). Those words are assumed to describe the document and are called important. One can then score each sentence based on how many important words occur in the document (one can normalize by how many words are in the sentence, or one can give a higher score if important words occur together). The n most important sentences become the summary. More broadly, extractive summarization chooses a subset of the document as the summarization. If a document has 30 sentences, an extractive summarization method may try to find, say, the top 2 that best represent the document.
A modern version of the extractive text summarization approach is to use Latent Semantic Analysis (LSA). To extract summaries using LSA one creates a matrix A that is a term-sentence matrix of size m×n where m is the number of terms and n is the number of sentences. One then performs singular value decomposition on A so that A=TΣST. Matrix Σ is diagonal and has the r singular values and is of size [r×r]. We can think of the r values of Σ as being the r most important concepts in the document, ordered by importance. Matrix T is the term matrix and row j of T corresponds to term j of A. The first value of row j of T gives a score of how relevant the most important concept is to term j. Likewise, column i of ST corresponds to sentence i, and index 0 of that column is a score of how relevant the most important concept is to sentence i. Index 1 of that column is a score of how relevant the second most important concept is to sentence i, and so on, up to r. This means that one can take the sentence s E S with the singular vector whose first value is highest as the most important sentence, as was done in Yihong Gong and Xin Liu, Generic text summarization using relevance measure and latent semantic analysis, In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 19-25. ACM, 2001.
A variation of Latent Semantic Analysis was proposed in Josef Steinberger and Karel Jezek, Using latent semantic analysis in text summarization and summary evaluation, In Proc. ISIM'04, pages 93-100, 2004 (“Steinberger-Jezek LSA”). The variation takes the most important sentence as the sentence whose column vector in ST has the highest magnitude when weighted by the singular values in Σ.
Extractive summarization methods may have utility in the special case of documents that contain a summary of the content of the document in a small number of sentences. Extractive summarization is inadequate, however, in cases where the document lacks a summary or a sentence or small number of sentences that alone adequately summarizes the document.
Abstractive summarization is an alternative to extractive summarization. Ganesan, et al. (Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions, In Proceedings of the 23rd international conference on computational linguistics, pages 340-348. Association for Computational Linguistics, 2010, http://kavita-ganesan.com/opinosis) discloses a symbolic abstractive method which works by taking all of the words in the document to be summarized and making a graph. This method makes a node for each combination of word and part of speech, and they connect two nodes with an edge if the associated words follow each other in the document. Once they have built this graph, they find heavily traveled paths, and those become the summaries. This method is designed for situations where there is high redundancy. In fact, it is not really to summarize a “document” but rather to summarize a large number of reviews for a product. Because this method uses words as symbols to define nodes, it is a symbolic method, and so it requires a high density of text around relatively few topics because the text must say the same thing using the same words multiple times. Consequently, graph-based abstractive summarization and other symbolic methods may not perform well at characterizing entities based on sparse environments. In addition, symbolic methods are not able to generalize, which makes them less amenable to supervised training.
Other embodiments of approaches to abstractive summarization are disclosed in Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, Bing Xiang, Abstractive Summarization using Sequence-to-sequence RNNs and Beyond, arXiv preprint arXiv:1602.06023v3, 23 Apr. 2016 and in Baotian Hu, Qingcai Chen, Fangze Zhu, LCSTS: A Large Scale Chinese Short Text Summarization Dataset, In Proceedings of the 2015 Conference on Empircal Methods in Natural Language Processing, pages 1967-1972, Lisbon, Portual, 17-21 Sep. 2015, Association for Computational Linguistics. The tools described in these approaches, however, are trained and tested on source datasets limited to news articles or short news and information pieces. News articles conventionally use an “inverted pyramid” format, in which the lead sentence describes the focus of the article and the most important facts are presented in the opening paragraph. Whether dictated by convention or authorial intent, the information required to accurately summarize a news article will be included in the opening sentences of the source. Sentences at the end of the article are likely to contribute very little information to a helpful summary of the article. The short news and information pieces (Weibo) described in Hu are focused and fact-dense because of the text length limitation of 140 characters. Also, such newspaper articles and short news and information pieces typically reflect a consistent viewpoint, thus providing an inherent focus to the exposition.
Such concise, focused, orderly and well-behaved source materials are significantly different in terms of organization and structure from the interaction data generated by entities such as networks of individuals that interact over time. A corpus of such interaction data (for example, a collection of tweets broadcast by Twitter®) may aggregate multiple utterances by multiple speakers speaking at different times. Because the interaction data may include utterances by multiple speakers, the utterances may reflect different, and possibly even conflicting or inconsistent viewpoints, and there is no journalist or editor to enforce the “inverted pyramid” format or to provide a concise summary of the topic of discussion. Because the interaction data may include statements, messages, or other utterances over a period of time, significant facts may be in the last utterances, as opposed to the first ones, or randomly distributed throughout the corpus. Known abstractive summarization techniques are neither trained nor tested to handle the disorganized and entropic content characteristic of interaction data generated by a network of individuals interacting over time.