1. Field of the Invention
The invention described herein relates to analysis of unstructured information and, in particular, relates to text analytics.
2. Related Art
More than 80% of data being gathered by users of information systems is unstructured. This data may be in the form of emails, blogs, websites, documents, spreadsheets etc. As the number of computer and Internet users continues to grow, more and more unstructured data content is created. This is evident when observing the growth of the Internet and websites (about 10 million) and documents and emails residing on an individual's desktop. This growth of data has created a need for understanding and leveraging the data's content.
This has given rise to the emerging text analytics area. This area focuses on understanding, harnessing and linking contents of emails, websites, and text data, for example, in database records and documents (henceforth collectively called “documents”). The primary domain of expertise and knowledge comes from natural language processing (NLP), a sub-division of artificial intelligence (AI).
The advantages of text analytics are well-known and several text analysis techniques are increasingly adopted in all walks of life. Today, customer feedback through a company's website is analyzed immediately and accurately for negative tones. Once such tones are identified, they are immediately used with the rest of the customer data for personalized action. Law enforcement agencies use the text analysis techniques on incident reports to look for the modus operandi of suspected criminals. Such broad examples are plentiful. In spite of such compelling benefits and advantages, however, text analysis adoption has been limited due to several limitations in current processing techniques.
There are several methods/algorithms available in NLP that can be used in attempting to understand the language of a document and summarize what the document is about. The challenge has been to match the expectations of the reviewer with the output of the NLP algorithm(s).
Overcoming this challenge has been difficult and several approaches have been proposed with varied degrees of success. Key approaches in this area include:                Use of clustering techniques to identify key themes and patterns: These techniques are very useful in identifying themes of interest and then identifying nuggets of information around these themes. Usually, these techniques make assumptions about themes of interest and build clustered word patterns around those assumptions. As the entire process is automated, many times these techniques produce output that might not fit the user's interest. As a result, clustering is of limited value to situations where the user's interest is not centered on themes of interest.        Idea extraction techniques to summarize contents based on ideas: These techniques are based on the notion that the purpose of natural language is to communicate ideas and therefore, capturing ideas (for example, from a document) is tantamount to capturing the essence of communication. While these techniques have applications in several business and consumer areas, they cannot be easily adapted to identify conditions that do not form the main idea or theme but yet might be of interest to the user.        Semantic networks based on semantic indexing: Semantic networks are based on the notion of extracting semantic content from a document by combining like terms with the help of thesaurus, dictionary or other similar look-up tools. While these networks can accurately reflect contents, they tend to produce high volume of results for even small document sets and thus are not scalable.        Semantic and syntax analysis: These techniques take into account grammatical constructs within the natural language to pick up themes and patterns, to build a summarization. However, these techniques are computationally intensive and are not scalable.        
What is needed, therefore, is a method and system for performing text analytics in a manner that avoids these shortcomings.
Further embodiments, features, and advantages of the present invention, as well as the operation of the various embodiments of the present invention, are described below with reference to the accompanying drawings.