Today, there are vast amounts of unstructured data on the Internet. There is a great need to be able to search and analyze this data in order uncover useful information about particular areas of interest. This is not only desired by consumers who want to find information about people and products, but also by companies that want to know what their customers are saying about their products and services.
Traditionally, there have been two approaches to this problem. One approach is the “search approach.” With the search approach, textual data is collected and stored in a full-text index that allows for rapid searching of the data. Large public Internet portals (such as Google, Yahoo, etc.) as well as numerous commercial indexing solutions support this functionality.
A second approach to this problem is the “analytical approach.” The analytical approach allows for analysis by collecting items and running these items through various text mining algorithms to extract additional metadata information. This additional processing may include language detection, extraction of links to other data, or determining the sentiment of the author. This derived metadata information is typically stored in a relational database which allows for aggregate analytics such as what websites are linked by the data. Typically, these analytics are preconfigured to extract information relevant to the goal of the system.
The advantage of the search approach is speed and simplicity. Without any pre-configuration, a full text index allows ad-hoc searching of the data. For instance, if someone wants to find textual data about a particular movie, they can simply search for the title of the movie and find it. However, the search approach does not give deeper insights such as what websites are linked or how people feel about the movie. The analysis approach can provide this type of information, but it typically requires a separate time-consuming text mining step. Therefore, the analysis approach lacks the speed and simplicity needed for “ad-hoc” analysis.
Therefore, there is a need for a solution that combines the speed of the search approach and the deep insights of the analytical approach to provide for true ad-hoc analysis. Aspects of the present invention address this need.