1. Field of the Invention
The present disclosure relates generally to textual analytics. More specifically, the disclosure relates to analysis of sentiment, topic targeted sentiment, and semantic vector calculation explanation functions.
2. Background of the Invention
An important part of text analytics at present is the processing of text to determine the sentiment of a writer towards his topics. People hold opinions about products, companies, people and events that are often revealed, intentionally or not, by how they discuss these topics. By analyzing sentiment in aggregate, a great deal can be learned about the world.
However, much of the writing available today does not work well with traditional sentiment algorithms. Social media is a highly informal and terse medium. Improper grammar and spelling provides unique challenges to sentiment algorithms in this space. Additionally, the terseness inherent in very short form content such as tweets and text messages often hides implicit and explicit messages. In such a “noisy” environment, new techniques are necessary for exploiting all of the available information.
A single, cohesive written work regularly spans multiple topics, especially distinct but similar topics. Each topic may be treated slightly differently by the author, referencing different themes and entities and with different sentiment implied. For the field of Text Analytics, understanding the relationship between individual topics and calculated metadata provides a more in depth view of what exactly is being said. Additionally, the consumer of Text Analytics analysis may have more limited interest than all the ideas being communicated in an article. For example, they may be a corporation concerned only with their own business domain. Alternatively, they may be interested only in topics directly connected with them. Thus, an algorithm to understand the scope of topics and their interactions with other metadata can produce more focused results for the end user.
A number of techniques exist to transform a corpus of text into a matrix representation of the semantic relationships between terms and phrases in the corpus, in order to provide some degree of semantic knowledge to various text analytics processes. In general, a tf-idf variant is used to encode the relationships between terms and documents. Various operations then may be performed on the matrix, such as, multiplying the matrix by its own inverse, taking the eigenvectors of the matrix, or clustering the data. The final, transformed matrix can be used to compare arbitrary pieces of text by finding the vectors corresponding to the content of each segment of text, combining them with some operator to form a single vector representation of the entire text, then using a measure of vector distance, such as the cosine similarity between vectors, to obtain a numerical distance value between the two pieces of text.
While effective, this process can also be opaque. The engine is able to defend its decision to match pieces of text only by reference to similarities between vectors. The interpretations of what each column means may be limited, such as “both texts contained terms that appeared in a particular news story in the training data” or incomprehensible except mathematically when more transformations occurred. Among other issues, this makes it significantly harder to fix misunderstandings by the engine, since it is not always clear where the misunderstanding is coming from.