Technical Field
The embodiments herein generally relate to analyzing popularity of one or more user defined topics among big data, and, more particularly, a system and method for analyzing popularity of one or more user defined topics by identifying correlations between grams contained in user identified topical anchor documents which user identified anchor documents respectfully describes the one or more user defined topics and the grams contained in raw documents.
Description of the Related Art
With the advent of the internet and the contributions to the wealth of data provided by individuals, businesses and government is rapidly increasing. Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet. Generally, search engines create an index that relates documents to the individual words present in each document. A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document. The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like. The retrieved documents are then presented to the user, typically in their ranked order, and without any further grouping or imposed hierarchy. In some cases, a selected portion of a text of a document is presented to provide the user with a glimpse of the document's content. The data has massively accumulated become known as big data. Analyzing a designated subject matter in the context of this massive data is very difficult. Identifying relevant relationships between the topics, relative to a given designated subject matter, has become increasingly more complex simply due to the huge amount of data that is available and which must be analyzed to discern which relationships are sufficiently important to anticipate a trend away from the historical background data.
The difficulty in analyzing these relationships is further complicated by the sheer number of different sources of information that deal with any given topic and the different times the information becomes available, the locations, the authors, the timeliness of the information all must be considered. The volume of data that is accessible will grow some 50 times between 2010 and 2020. Science and business have taken advantage of this massive accumulation of data be pulling together structured and un-structured data into massive data bases, data warehouses, and data centers. The method of this invention analysis this massive accumulation of data and identifies relationships in the data with identified topics for the associated subject matter under investigation and the analysis enables the identification of trends in the data relative to the topics. Historically it has been presumed that more data will provides better insight. Unfortunately, in practice the presumption has been proven naive. Simply looking at more data does not always result in greater insight. More data generally results in requiring a more complicated algorithm with little or no enlargement of insights into the relevancy of the information.
A significant complexity in any analysis is that data is available in both structured data and unstructured data formats. Structured data is provided in tables, list or charts where each element represents a fixed value of similarly formatted information linked by the table's parameters. More often, however, the information is unstructured and does not clearly identify the relevant information. In addition important information is found in the metadata, information about the data such as date, author, location, source, and key words. Unstructured date includes an address for the data and the content of the information within the general text form individual words or series of words, numbers, locations, names, and times. Current processing techniques allow operations on this data using greater computer power, memory space, and processor time but such operations do not necessarily provide better or more accurate analysis.
Accordingly, there remains a need for an effective theory, system and method to analyze the massive collection of data, big data.