Recent data suggests that nearly eighty-five percent of all digital data is found in unstructured files and it is growing annually at around sixty percent. One reason for the growth is that regulatory compliance acts, statutes, etc., (e.g., Sarbanes-Oxley, HIPAA, PCI) force companies to keep file data in an accessible state for extended periods of time. Likewise, information storehouses such as the Internet tend to keep file data accessible for extended periods of time, but meaningful connections between potentially related files or groups of files are difficult to establish. However, block level operations in computers are too lowly to apply any meaningful interpretation of this stored data beyond taking snapshots and block de-duplication. While other business intelligence products have been introduced to provide capabilities greater than block-level operations, they have been generally limited to structured database analysis. They are much less meaningful when acting upon data stored in unstructured environments. In addition, the World Wide Web consists of billions, or even trillions according to some estimates, of interlinked pages, most of which have no semantic meta-data associated with the content on those pages.
Unfortunately, entities the world over have paid enormous sums of money to create and store their data, but cannot find much of it later in instances where it is haphazardly arranged or arranged less than intuitively. Not only would locating this information bring back value, but being able to observe patterns in it might also prove valuable despites its usefulness being presently unknown. However, entities cannot expend so much time and effort in finding this data that it outweighs its usefulness. Notwithstanding this, there are still other scenarios, such as government compliance, litigation, audits, etc., that dictate certain data/information be found and produced, regardless of its cost in time, money and effort. Thus, a clear need is identified in the art to better find, organize and identify digital data, especially data left in unstructured states.
In search engine technology, such as Internet search engines, large amounts of unrelated and unstructured digital data can be quickly gathered. However, most engines do little to organize the data other than give a hierarchical presentation. Also, when the engine finds duplicate versions of data, it offers few to no options on eliminating the replication or migrating/relocating redundancies. Thus, a further need in the art exists to overcome the drawbacks of search engines. Also, current search engine technology is hampered in the area of semantic relevance. For example, a search using the keyword “ball” yields results for spherical toys, formal dances, games, people, and/or companies. Unless the reference to the word “ball” is tagged with semantic meta-data, search engines are left with limited context on both the query and the indexed data in order to retrieve the most pertinent result set for a given query.
When it comes to large amounts of data, whether structured or not, compression techniques have been devised to preserve storage capacity, reduce bandwidth during transmission, etc. With modern compression algorithms, however, they simply exist to scrunch large blocks of data into smaller blocks according to their advertised compression ratios. As is known, some do it without data loss (lossless) while others do it with data loss (“lossy”). None do it, unfortunately, with a view toward recognizing similarities in the data itself.
From biology, it is known that highly similar species have highly similar DNA strings. In the computing context, consider two word processing files relating to stored baseball statistics. In a first file, words might appear for a baseball batter, such as “batting average,” “on base percentage,” and “slugging percentage,” while a second file might have words for a baseball pitcher, such as “strikeouts,” “walks,” and “earned runs.” Conversely, a third file wholly unrelated to baseball, statistics or sports, may have words such as “environmental protection,” “furniture,” or whatever comes to mind. It would be exceptionally useful if, during times of compression, or upon later manipulation by an algorithm if “mapping” could recognize the similarity in subject matter in the first two files, although not exact to one another, and provide options to a user. Appreciating that the “words” in the example files are represented in the computing context as binary bits (1's or 0's), which occurs by converting the English alphabet into a series of 1's and 0's through application of ASCII encoding techniques, it would be further useful if the compression algorithm could first recognize the similarity in subject matter of the first two files at the level of raw bit data. The reason for this is that not all files have words and instead might represent pictures (e.g., .jpeg) or spread sheets of numbers.
Appreciating that certain products already exist in the above-identified market space, clarity on the need in the art is as follows. One, present day “keyword matching” is limited to select set of words that have been pulled from a document into an index for matching to the same exact words elsewhere. Two, “Grep” is a modern day technique that searches one or more input files for lines containing an identical match to a specified pattern. Three, “Beyond Compare,” and similar algorithms, are line-by-line comparisons of multiple documents that highlight differences between them. Four, block level data de-duplication has no application in compliance contexts, data relocation, or business intelligence.
In modern day “relevancy” systems, most, if not all, utilize explicit user interaction to harvest relevancy data. For example:
Amazon.com: Users purchase books or other products and Amazon suggests other books and products that the user might find interesting based on books/products that were purchased by others who bought the same exact book/product. However, Amazon's relevancy engine would not work if people did not buy books/products at the Amazon website. Also, the more that people make purchases, especially books at the same time, the more data points Amazon has to find relevant data. Conversely, if no one has ever made an exactly similar purchase, Amazon has no mechanism to make suggestions to other people other than by way of keyword associations.
Netflix.com: Users rent movies, and place others in waiting queues, and Netflix suggests other movies that the user might find interesting. Similar to Amazon, however, this approach relies on others watching or selecting those same exact movies. The same is true at Fandango.com, whereby Fandango suggests movies to users based on same ticket purchases by other users.
Online store web analytics: Users browse through online web stores or online catalogs and web analytic software determines a length of stay on a page, how many times the page or site is revisited, what else occurred during visitation, etc. In turn, new merchandise, package deals, coupons, etc. are suggested for purchase/downloading by the user.
Social Networking sites, e.g., Facebook, LinkedIn, Plaxo, etc.: These all suggest “friends” that users might want to “connect with” based on meta data and other associations with connections to common friends and their connections/friends, and so on. Similarly, websites such as Flickr, YouTube, pandora.com, etc., offer relevancy services, but they are all founded on structured data as well as input from other users.
At Hunch.com, Hunch offers the tagline: “Hunch helps you make decisions and gets smarter the more you use it.” In other words, Hunch develops more knowledge the more it is trained, either explicitly or by people using its services, which is recorded. This is then harvested for finding relevant data.
In ISBN 10: 0-596-52932-5|ISBN 13: 9780596529321, “Programming Collective Intelligence,” Toby Segaran, O'Reilly, basic algorithms are used to “demonstrate[ ] how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.”
While no shortage exists in the art directed to finding relevant data, or not, based on user's actions, there is a dearth of knowledge when users involve themselves with unstructured data, a lack of history or other record keeping, and user collaboration, to name a few, especially in real time. As such, a need exists to find relevancy information when no action has occurred by an individual or collaborative team to explicitly or implicitly start the process of finding it. Above and beyond, the need further extends to finding this information in unstructured data where no database, no meta data, etc., exists, as well as finding it in traditionally structured data (e.g., the foregoing movie example with a database storing movie meta data such as type, genre, rating, content, keywords, actors, directors, etc. as well as the number of users who have rented the movie, and indicia of those users). Even further, a need exists to take such unstructured data, establish relevant groupings, and then reveal semantic context for the data, to allow automatically providing relevant and related information. The benefit of such an automated system is clear, in that user bias is eliminated, in that vast quantities of data can be processed, and in that associations may be revealed independent of the intent of the association and/or beyond what is known or assumed by a user.
In a particular example, challenges associated with databases containing vast amounts of information, such as the Internet or World Wide Web, are known. For example, a human information consumer is capable of considering information, such as on a Web page, and assigning a context to that information to allow differentiation from other Web pages using similar or identical terms but in distinct context. However, computing devices cannot make such associations, at least not easily. The Semantic Web, broadly, is an evolving development of the Internet or World Wide Web wherein the meaning (i.e., semantics) of information and services on the Web are desirably defined, allowing greater ease in “understanding” Web content and satisfying the requests of users and machines to access and utilize the Web content. Particular challenges to developing the Semantic Web include vastness, uncertainty, inconsistency, and deceit.
As examples, the World Wide Web contains at least 48 billion or more pages. The SNOMED CT medical terminology ontology contains 370,000 class names, and existing technology has not been able to eliminate all semantically duplicated terms. Thus, automated “reasoning” systems must deal with vast amounts of inputs.
Information contained on the World Wide Web is rife with imprecise concepts, i.e., “young,” “tall,” etc. This is a result of vague user queries, of concepts represented by content providers, of matching query terms to provider terms, and of trying to combine different knowledge bases with overlapping but subtly different concepts. Fuzzy logic is most commonly used to deal with vagueness.
Likewise, there are precise concepts with uncertain values, such as when a patient presents symptoms corresponding to a number of different diseases or syndromes, corresponding to a number of different diagnoses each with a different probability. Probabilistic reasoning techniques are most commonly used to address uncertainty.
Logical contradictions inevitably arise during development of large ontologies, and when ontologies from separate sources are combined. Deductive reasoning may fail catastrophically when faced with inconsistency. Defeasible reasoning and paraconsistent reasoning are two techniques which may be employed when dealing with inconsistency.
Deceit occurs when a producer of information is intentionally misleading the information consumer. Cryptography techniques are currently used to alleviate such a threat.
On a grander scale, the need extends even further to serve advanced notions of identifying new business intelligence, conducting operations on completely haphazard data, and organizing it, providing new useful options to users, providing new user views, providing new encryption products, and identifying highly similar data, to name a few. As a byproduct, solving this need will create new opportunities in minimizing transmission bandwidth and storage capacity, among other things. Still further, however, improvements are possible extending beyond such relevancy grouping. That is, rather than relying on specific “tags,” i.e., such as keywords or metatags, to provide associations between files or groups of files. In particular, improvements relating to association of files or groups of files using particular content are contemplated herein. Naturally, any improvements along such lines should contemplate good engineering practices, such as stability, ease of implementation, unobtrusiveness, etc.