Humans organize things in their environment into semantically meaningful sets. Natural language is one example of a semantically meaningful set. An adjective is an annotation label that can be associated with one or more nouns, and every noun X associated with adjective Y is an element of set Y. Nouns can also be sets themselves. The phrase “X is a Z” can be transformed into the logical concept “noun X is an element of set Z.” These natural language principles reflect an aspect of human cognition that has persisted across millennia. In today's computational age, this process of entity-to-set association has exploded into a universe of data.
The actual data warehouses that store information generated in relation to the activities of enterprises, government agencies, social networks, medical or other types of research, sporting events, etc. may be arranged into loose, almost unstructured schemata or complex thousand-table relational database systems. Such models can be transformed into relatively simple schema based assumptions, which can include 1) there are entities that are the focus of domain-specific research (e.g. people, genes, media items), 2) there are potential network connections between those entities (e.g. personal relationships, protein-protein interactions, nearest-neighbor media, hyperlinks), and 3) there are sets of entities, partitioned into set-categories (e.g. San Francisco, Calif. as a set of people-entities is in the location set-category, and the University of California at San Francisco (UCSF) as a set of people-entities is in the alma mater set-category; there may also exist a different set UCSF in the employer set-category).
A schema can be a simple form of a topic map that does not attempt to represent relationships between different sets/topics. For example, a topic map might explicitly model that the employer set/topic UCSF has an is-in relationship with the location set/topic San Francisco, Calif. A schema such as the Exploratory Gene Association Networks (EGAN) schema (described, for example, in “Hypergraph visualization and enrichment statistics: how the EGAN paradigm facilitates organic discovery from Big Data” by J. Paquette et al., Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 7865, 2011) can provide advantages relating to expectations that a human analyst, such as for example a domain expert, will be interpreting the information. The human analyst can provide his or her own complex mental map about how sets/topics are semantically related. Keeping the schema simple allows for more metadata and sets to be included while keeping the workflow relatively simple for the user.
Entity types that can be grouped into sets (for example as discussed above) can be monitored and researched via collection of empirical data, which can include information from a variety of sources, such as numbers, change rates, clicks, purchases, scores, votes, surveys, ratings, etc. The analytics/prediction industry is evolving right along with the empirical input stream with algorithms for clustering, classification, and prediction, all of which can be parallelized to an array of cloud-based processors in order to find that needle in the haystack as quickly as possible. Many of these algorithms work on matrices of data, where each column in the matrix represents an entity and each row represents a variable that can be measured for each entity. However, current analytics paradigms for large empirical data sets generally confront two issues in addition to the general challenges of storage and parallelization: noise/sparseness and single needle focus.
The issue of noise/sparseness depends on the quality of the data collection process and the consistency and frequency of the variables being analyzed. All empirical data sets have some degree of noise, and even a little noise in the data can raise questions regarding whether a “best-candidate needle” (e.g. an answer to a query or the like) found by an algorithm is really the “correct” needle being sought. Substantial uncertainty can arise about the correctness of the “best-candidate needle” relative other candidates (e.g. the second best, tenth best, etc.). Depending on the number of entities and the strength of the confidence values produced, it can be useful to consider many more candidates than just the top-hit. It may be easy for an analyst to manually investigate one candidate, but extending such an analysis to a top hundred or more candidates can be challenging.
Single needle focus relates to situations in which an investigator is not interested in simply finding just one needle in the haystack but instead wishes to learn how the best candidates from the analysis or how a subset of entities that cluster together are related to each other, and what those relationships indicate about the environment measured in the experiment. The hypothesis that drives this type of experiment is a systems-hypothesis: no individual entity in the environment is as important as different systems (e.g. sets) of entities. Systems-driven knowledge discovery can identify important trends like social trends, purchasing behavior of consumers, hidden drivers of markets, communication flow in networks and novel biological processes in disease.