Various embodiments of the present invention relate generally to Business Intelligence (BI) and more specifically to a method and system for BI metrics on data collected from multiple data sources that may include unstructured data.
BI refers to technologies, applications and practices for collection, integration, analysis, and presentation of content such as business information. Current BI applications collect content from various information sources such as newspapers, articles, blogs and social media websites by using tools such as web crawlers, downloaders, and RSS readers. The collected content is manipulated or transformed in order fit into predefined data schemes that have been developed to provide businesses with specific BI metrics. The content may be related to sales, production, operations, finance, etc. After collection and manipulation, the collected content is stored in a data warehouse or a data mart. The content is then transformed by applying information extraction techniques in order to provide the BI metrics to users. For example, an organization may utilize a BI application to obtain financial data in order to present a two-dimensional graph illustrating the stock prices of a particular organization “X” or a simple sentiment analysis providing the sentiment, i.e., positive or negative, expressed in the content. This output may then be interpreted by the organization to identify cues from the presented content and use them for business performance management. For example, an organization may provide the stock pricing graph to the public in order to inform users interested in investing with information related to the organization's earnings and securities performance.
BI applications use Key Performance Indicators (KPIs), sometimes referred to as Key Performance Metrics (KPMs), to assess the current state of a business. For example, in an organization, inventory turnover is a KPI for managing inventory. Nowadays, organizations that face higher operational risks, such as credit card companies and financial management service providers, often make KPI-related data available on a weekly basis. In some cases, organizations may provide a daily analysis of the data for time-efficient decision making. Various information extracting techniques such as enterprise decision management, marketing analytics, predictive science, strategy science, credit risk analysis, and fraud analytics are applied on the KPIs. Conventionally, KPIs and the underlying information extracting techniques are applied on the data that is already parameterized and maintained in a structured format. However, there does not exist any equivalent information extraction technique that can be employed for various other types of data, such as unstructured data. Unstructured data includes information that either does not have a data structure or has one that is not easily usable by a computer program. Examples of unstructured data include, but are not limited to, text files such as Microsoft® Word documents, Portable Document Format (PDF) files, and e-mail records; image files such as Joint Photographers Experts Group (JPEG) files, Tagged Image File Format (TIFF) files, and Graphics Interchange Format (GIF) files; audio files such as MP3 and Windows Media files; and video files such as Waveform Audio (WAV) format files and Moving Pictures Experts Group (MPEG4) files.
Current BI applications are incapable of utilizing unstructured data for a variety of reasons. Current BI applications are designed or architected to provide specific analytics and thus expect a specific data schema or arrangement. Thus, current BI applications are not able to utilize the various metadata, either explicit or inherent provided by unstructured data. For example, unstructured data fails to provide any explicit data structure and instead provides dimensions or identification attributes, such as tags or metadata that may describe the unstructured data. In the rare case that unstructured data has some explicit structure, it may not be compatible with any of the designed, or turnkey BI applications. In addition, unstructured data may not provide a schema or other data descriptor that may be interpreted by current BI systems. Thus, current BI applications will fail to extract any base data on which analytics may be run. Finally, unstructured data often is in variable formats and structure, even within the same service area, market, type and content. Thus, implementers of current BI systems are unable to make assumptions about data and attempts to automate the use of current BI systems on various unstructured data sources have failed.
Currently, a few methods exist which attempt to bring unstructured data into existing BI applications Methods such as manual tagging and automated extraction attempt to address the problem of using unstructured data for BI applications, but such methods also have one or more disadvantages.
Manually tagging the unstructured data to provide a well-defined structure is disadvantageous for the following reasons. First, market research reveals that unstructured data accounted for 6 petabytes of capacity in 2008 and is expected to grow at an annual rate of 54% to 28.5 petabytes by 2010, leading to difficulty in coping with such a myriad of unstructured data. Second, a manual process may produce more errors as compared with an automated method of tagging. Thus, manual tagging fails to scale as the amount of unstructured data grows and may introduce a significant number of errors into the data.
Various automated information extracting techniques such as semantic web, Latent Semantic Analysis (LSA), Support Vector Machines (SVM), and search engines have been used in an attempt to provide data to BI applications. These techniques may assess content (structured and unstructured) and extract KPIs. However, these techniques also have various drawbacks.
The semantic web technology relies on publishing content in specific languages, for example, Resource Description Framework (RDF), Web Ontology Languages (OWL), and Extensible Markup Language (XML). Specifically, the Semantic Web technology relies on RDF, which is based on setting RDF links between data items from different data sources but with same properties such as structure and size. Thus, there is an exponential growth in the data structure based on the link connectivity. Further, there is a limitation on the number of links that can be provided in the documents. Additionally, the semantic web technology performs unsatisfactorily for imprecise concepts present in user queries. For example, concepts such as “young” and “tall” are vague and cannot be understood without performing linguistic processing. For precise concepts, there exists an uncertainty, for example, a patient might report a set of symptoms, which may correspond to a number of distinct diagnoses.
Further, in semantic web technology, various hierarchical ontologies are formed to classify data. Subsequently, ontologies from different sources are combined to form a complex ontology. There exist logical contradictions while combining ontologies from separate sources. These logical contradictions lead to inconsistencies that cannot be resolved without context- or domain-specific approaches. Moreover, these hierarchical ontologies are built from innumerable root nodes or class names with semantically duplicate terms. The semantic web technology cannot eliminate the duplicate terms present in the ontologies. Thus, while classifying the content in these ontologies, the semantic web system has to deal with large volumes of data and due to which causes processing speed to reduce considerably.
Another information extracting technique is LSA, which is a Natural Language Processing (NLP) technique that is used for tagging data to provide a structure to unstructured data. The technique uses a term-document matrix that provides the frequency of occurrence of a term in a document. However, using the LSA technique for tagging the unstructured data is disadvantageous in a number of ways. First, the LSA technique cannot capture the multiple meanings of a word or a phrase (i.e., polysemy). Additionally, the technique cannot differentiate between sentences that have same grammatical structure but different interpretations. For example, “We gave the monkeys the bananas because they were hungry” versus “We gave the monkeys the bananas because they were over-ripe,” where the pronoun “they” refers to monkeys in the first sentence and bananas in the second sentence. Using the LSA technique, a string of words may be interpreted in different ways; therefore, the appropriate context of the string of words cannot be determined accurately. For example, the string “Time flies like an arrow” may be interpreted in a variety of ways, one interpretation can be as the common simile, “Time moves quickly just like an arrow does.” Alternatively, the phrase could be interpreted as measure the speed of flying insects like you would measure that of an arrow (thus interpreted as an imperative), i.e., (You should) time flies as you would (time) an arrow.
Additionally, SVM is a Machine Learning (ML) technique that classifies the content in taxonomies using a set of supervised learning techniques. The taxonomies are updated automatically after predefined intervals. The major drawback of the technique is that there are a set of rules that are used only during the classification phase. However, no rules for language are applied for differentiating sentences with same grammatical structure but have different interpretations and to identify multiple meanings of a word or a phrase (i.e., polysemy).
Most information extracting techniques rely on text search and search engines to index the data for text mining. Typically, text mining techniques facilitate the structuring of input data and deriving patterns from structured data, and a final evaluation and interpretation of the output. There are various drawbacks related to using search engines in text mining, such as search-related text mining identifies results on the basis of exact keyword and pattern matches; therefore, the results that are imprecise or inferential are not identified. Further, there is no semantic analysis performed during the search, thus the context of the query is not well interpreted and understood by the system that is used for applying the information extracting techniques.
Information extracting techniques as described above are based on classifying content in pre-defined hierarchical ontologies that are stored in a database. These pre-defined ontologies are generally domain specific and not specific to the context of a user's query. Thus, classifying content in the pre-defined ontologies produces irrelevant results for the user's query.
While there are computational tools that perform sentiment analysis or opinion mining using NLP techniques, these tools simply grade the content as positive or negative and are insufficient in assessing the impact of the content. For example, a news item related to a particular brand may be graded as positive or negative based on a user's preference. However, the tool cannot assess the impact of the news item on the reputation of the brand. Also, these tools can only process one analytic at a time corresponding to the content that is collected. For example, the tool may only provide sentiment analysis and may not perform assessment to provide other analytics such as risk scores analysis, portfolio analysis, etc., on the same content. Thus, a user has to rely on various computational tools, techniques, and research analysts recommendations to process and receive the desired information from the content.
Various computational tools, such as IBM Intelligent Miner, are used to categorize a set of extracted documents in predefined categories. However, one of the major disadvantages of the IBM Intelligent Miner is that the documents are categorized in fixed categories and no further categorization or refinements to next level categories such as organization-based and/or context-based categories is not possible. Further, the categories are generally first-level categories, which are not useable in providing numerous analytics that require multiple-level hierarchical classes or taxonomy. Also, different analytics cannot be applied on the categorized data as it requires precise concepts (that are derived after high-level classification) and ability of a system to apply various data/text mining techniques on different data sources.
Extract, Transform, and Load (ETL) process, which involves extracting data and automatically (through machine learning process) building a new structured database from the intake of the data has been utilized to load data into structured databases. Descriptions of various embodiments of the ETL process can be found, for example, in Patent Applications US 2007/0011134 A1, titled “System And Method Of Making Unstructured Data Available To Structured Data Analysis Tools” and US 2005/0108256 A1, titled “Visualization Of Integrated Structured And Unstructured Data,” U.S. Pat. No. 7,409,393 B2, titled “Data Gathering And Distribution System” and U.S. Pat. No. 6,694,307 B2 titled “System For Collecting Specific Information From Several Sources Of Unstructured Digitized Data.” The result of the ETL process is a structured database that includes the extracted unstructured data in a known format. However, the patents/applications mentioned above and other ETL publications and applications fail to disclose automatically applying analytics on the data that require various manipulations and modification of data during the data processing stage.
Additionally, all of the aforementioned information extracting techniques require textual or ASCII-based source content. However, with advent of high-speed wired and wireless networks it is evident that the enormous amount of non-text data such as image, video, and audio is being generated in the variety of business and personal transactions performed day-to-day. Thus, the current information extraction techniques are unable to utilize the various non-text data fail to extract and utilize the large amount potentially relevant data in their generation of KPIs.
Existing information extracting techniques described above have a number of drawbacks including the inability to obtain data from text and non-text data sources, requiring a pre-defined, pre-architected data scheme, lack of scalability, lack of flexibility and a critical emphasis on the use of structured data. Therefore, in light of the foregoing discussion, there is a need for a method and system for BI analytics that can utilize textual and non-text data that is structured or unstructured, is scalable, efficient, flexible and malleable in its ability to provide key BI metrics as they develop in the marketplace.