Digital data has been growing at an enormous pace and much of this growth, as much as 80% is unstructured data, mostly text. With such large amounts of unstructured text becoming available both on the public internet and to enterprises internally, there is a significant need to analyze such data and to derive meaningful insight from it. Superior access to information is the key to superior performance in almost any field of endeavor. As a result, there exists a growing demand in the analysis of unstructured data on a timely basis to derive insights for the organization.
However, it is humanly impossible to analyze the vast amount of information that is arriving every second at an exponentially increasing rate. To our knowledge, there are no attempts at analyzing the meaning or impact of the text in a document on a topic of interest. Such a topic may be a company for example. As an example, a hedge fund manager is interested in analyzing what a document is saying about a company of interest. As another example, the treasurer of a corporation will be interested in what the contents of a document imply about the creditworthiness of a customer or borrower. In yet another example, a pharmaceutical company is interested in the impact of the content of a document on a drug's efficacy. It is also important to point out that any method or system to understand meaning should not preferably be a black box and must be capable of providing an audit trail to the user for how the impact was derived.
There are a number of statistical methods that are generally deployed to process unstructured text and they can be employed for the purpose of the meaning or impact of a document's contents on a concept. But such methods have major limitations in terms of being able to understand the real meaning of a document's contents. Firstly, a statistical model is developed from a set of training documents and afterwards, an unclassified document is classified into the one or more categories by applying the statistical model. There are a variety of statistical approaches available for the purpose ranging from naïve Bayes classifiers to support vector machines. All statistical methods irrespective of approach have several limitations. First, given the large scale nature of the problem, to develop a robust model, one needs a large homogeneous training set with respect to the problem being solved. Second, statistical models are black boxes and not tractable. Users will not have the ability to understand the precise reason behind the outcome. Third, statistical methods are largely frequency or word pattern based. Given the large number of ambiguous words in any word based language like English, statistical methods will not be able to interpret the fine grained context in a document. There is even a more complex form of such ambiguity which occurs in the form of phrases which are semantically equivalent in their usage in a document but cannot be determined to be so without some external input. Such systems are unable to decipher whether a particular word is used in a different context within the different sections of the same document. Similarly, these systems are limited in identifying scenarios where two different words (e.g., factory output or production from a unit) may have substantially identical meanings in the different sections of the document. The restriction to process the content of the document matching on the level of individual words can generate inaccuracies while interpreting the impact of the document's content. Therefore there exists a need for a system and a method for a context based, tractable interpretation of the meaning or impact of a document's content on a concept. The system and method should also be extendable to incorporate user provided additional context without any additional programming.
Various natural language processing methods exist to understand the local sentiment of the unstructured text. Generally, such methods use statistical tools and a set of sentiment lexicons to extract sentiments. For example, Hatzivassiloglou and McKeown in their publication titled ‘Predicting the semantic orientation of adjectives’ published in EACL '97 Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics predict semantic orientation of adjectives by grouping the adjectives into two clusters such that maximum constraints are satisfied based on a hypothesis of how adjectives are separated. In another method Wiebe in the publication titled ‘Learning Subjective Adjectives from Corpora’ published in Proceedings of the 17th National Conference on Artificial Intelligence (AAAI-2000) analyzes adjectives for gradation and polarity and thereby, utilizes statistical factors to predict the gradability of adjectives. In another method, Kim and Hovy in their publication titled ‘Determining the sentiment of opinions’ published in Proceedings of the 20th International Conference on Computational Linguistics rely on WordNet to generate lists of words with positive and negative orientation based on seed lists. However, these methods lack the ability to understand the full context and the interrelationships of the entire text as it impacts a concept. They do not decompose the entire contents of the document in a linguistic sense to understand the contextual meaning or impact of the contents of the document on a concept.
Further, the various existing sentiment analysis methods fail to use prior knowledge of a domain pertaining to the unstructured text. Furthermore, none of these methods allow interpreting the meaning of the unstructured text in a contextually relevant manner. Additionally, these methods fail to provide accurate semantic analysis when the same unstructured text can carry a different sentiment for two different audiences. That is to say, these methods fail to interpret words that are positive in one domain and negative in another domain, and words that are relevant in one domain and non-relevant in another domain. Such constraints create inaccuracies in the sentiment analysis. Therefore, there exists a need for a system and method for accurately interpreting the unstructured text with reference to a specific concept in a tractable manner so the user can understand precisely how the engine interpreted the document's content and reached its conclusions.