The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Business information processing applications depend on having a technical understanding and the ability to mine unstructured data stores. These business information processing applications can include assessment of technologies, market trends, competitive products, technical systems and functions, and new and over the horizon emerging markets Unstructured data is neither homogenous in format, in technical form or structure, nor in its method of storage and retrieval. Unstructured data is, by definition, not stored, curated, formatted or made to be accessible in a standardized, machine-readable, cross-computer hardware and software platform manner.
To-date assessments by portfolio analyst concerned with information hidden in unstructured data fields and its impact on identification of risks, threats and opportunities has been done using either technical (e.g., technical functions and measures) and fundamental (e.g., semantic data, information and ontologies) analytics. However, full integration of technical and fundamental analytics to include the ability to identify and use informational signals from unstructured data, both implicit and explicit in origin, for the purpose of identification and characterization of “pre-requisite” conditions for certain outcomes to occur (e.g., risk proxies, analogies and analogies of analogies) has not been realized.
A significant challenge facing natural language processing (NLP) is that geometric increases in unstructured data create continuously changing text-streams that bring continuously changing meaning. Modern unstructured data is not amenable to “after-the-fact” processing or expert-system-dependent filtering, sifting, sorting and computing for the timely delivery of analytic results. Instead, only a system that can deliver real-time filtering, sifting, sorting, and computing on unstructured data content and that adapts in outputs as the underlying meaning of the data changes, is needed. Traditional approaches to syntactic and semantic processing, which is focused on word-, sentence-, paragraph-, document- and file-units is insufficient to the challenge because they do not address identifying the presence of hidden or implicit concepts that add risk to the purely symbolic based (i.e. dictionary) semantic interpretations. Specifically, traditional natural language processing (NLP) and computational linguistics, as represented by the disciplines of LSI/LSA (2), probabilistic and statistical data-driven models of semantic search (3), expert models and systems (4), concept graphs (5), semantic graphs (6), meta-tagging (7), and related fields, do not address the technical requirements for real-time processing of unstructured data for analog discovery.
For large data sets, similarities are usually described in the form of a symmetric matrix that contains all the pairwise relationships between the data in the collection. Unfortunately, pairwise similarity matrices do not lend themselves for numerical processing and visual inspection. A common solution to this problem is to embed the objects into a low-dimensional Euclidean space in a way that preserves the original pairwise proximities as faithfully as possible: for example, LSA, PCA and other such vector methods.
One approach, known as multidimensional scaling (MDS) or nonlinear mapping (NLM), converts the data points into a set of real-valued vectors that can subsequently be used for a variety of pattern recognition and classification tasks. Multidimensional scaling (MDS) is a statistical technique that attempts to embed a set of patterns described by means of a dissimilarity matrix into a low-dimensional plane in a way that preserves their original (semantically pairwise) interrelationships with minimum error and distortion. However, current MDS algorithms are very slow, and their use is limited to small data sets.