Recent data suggests that nearly eighty-five percent of all digital data is found in unstructured files and it is growing annually at around sixty percent. One reason for the growth is that regulatory compliance acts, statutes, etc., (e.g., Sarbanes-Oxley, HIPAA, PCI) force companies to keep file data in an accessible state for extended periods of time. However, block level operations in computers are too lowly to apply any meaningful interpretation of this stored data beyond taking snapshots and block de-duplication. While other business intelligence products have been introduced to provide capabilities greater than block-level operations, they have been generally limited to structured database analysis. They are much less meaningful when acting upon data stored in unstructured environments.
Unfortunately, entities the world over have paid enormous sums of money to create and store their data, but cannot find much of it later in instances where it is haphazardly arranged or arranged less than intuitively. Not only would locating this information bring back value, but being able to observe patterns in it might also prove valuable despites its usefulness being presently unknown. However, entities cannot expend so much time and effort in finding this data that it outweighs its usefulness. Notwithstanding this, there are still other scenarios, such as government compliance, litigation, audits, etc., that dictate certain data/information be found and produced, regardless of its cost in time, money and effort. Thus, a clear need is identified in the art to better find, organize and identify digital data, especially data left in unstructured states.
In search engine technology, large amounts of unrelated and unstructured digital data can be quickly gathered. However, most engines do little to organize the data other than give a hierarchical presentation. Also, when the engine finds duplicate versions of data, it offers few to no options on eliminating the replication or migrating/relocating redundancies. Thus, a further need in the art exists to overcome the drawbacks of search engines.
When it comes to large amounts of data, whether structured or not, compression techniques have been devised to preserve storage capacity, reduce bandwidth during transmission, etc. With modern compression algorithms, however, they simply exist to scrunch large blocks of data into smaller blocks according to their advertised compression ratios. As is known, some do it without data loss (lossless) while others do it “lossy.” None do it, unfortunately, with a view toward recognizing similarities in the data itself.
From biology, it is known that highly similar species have highly similar DNA strings. In the computing context, consider two word processing files relating to stored baseball statistics. In a first file, words might appear for a baseball batter, such as “batting average,” “on base percentage.” and “slugging percentage,” while a second file might have words for a baseball pitcher, such as “strikeouts,” “walks.” and “earned runs.” Conversely, a third file wholly unrelated to baseball, statistics or sports, may have words such as “environmental protection.” “furniture,” or whatever comes to mind. It would be exceptionally useful if, during times of compression, or upon later manipulation by an algorithm if “mapping” could recognize the similarity in subject matter in the first two files, although not exact to one another, and provide options to a user. Appreciating that the “words” in the example files are represented in the computing context as binary bits (1's or 0's), which occurs by converting the English alphabet into a series of 1's and 0's through application of ASCII encoding techniques, it would be further useful if the compression algorithm could first recognize the similarity in subject matter of the first two files at the level of raw bit data. The reason for this is that not all files have words and instead might represent pictures (e.g., .jpeg) or spread sheets of numbers.
Appreciating that certain products already exist in the above-identified market space, clarity on the need in the art is as follows. One, present day “keyword matching” is limited to select set of words that have been pulled from a document into an index for matching to the same exact words elsewhere. Two, “Grep” is a modern day technique that searches one or more input files for lines containing an identical match to a specified pattern. Three. “Beyond Compare,” and similar algorithms, are line-by-line comparisons of multiple documents that highlight differences between them. Four, block level data de-duplication has no application in compliance contexts, data relocation, or business intelligence.
In modern day “relevancy” systems, most, if not all, utilize explicit user interaction to harvest relevancy data. For example:
Amazon.com: Users purchase books or other products and Amazon suggests other books and products that the user might find interesting based on books/products that were purchased by others who bought the same exact book/product. However, Amazon's relevancy engine would not work if people did not buy books/products at the Amazon website. Also, the more that people make purchases, especially books at the same time, the more data points Amazon has to find relevant data. Conversely, if no one has ever made an exactly similar purchase, Amazon has no mechanism to make suggestions to other people other than by way of keyword associations.
Netflix.com: Users rent movies, and place others in waiting queues, and Netflix suggests other movies that the user might find interesting. Similar to Amazon, however, this approach relies on others watching or selecting those same exact movies. The same is true at Fandango.com, whereby Fandango suggests movies to users based on same ticket purchases by other users.
Online store web analytics: Users browse through online web stores or online catalogs and web analytic software determines a length of stay on a page, how many times the page or site is revisited, what else occurred during visitation, etc. In turn, new merchandise, package deals, coupons, etc. are suggested for purchase/downloading by the user.
Social Networking sites, e.g., Facebook, LinkedIn, Plaxo, etc.: These all suggest “friends” that users might want to “connect with” based on meta data and other associations with connections to common friends and their connections/friends, and so on. Similarly, websites such as Flickr, YouTube, pandora.com, etc., offer relevancy services, but they are all founded on structured data as well as input form other users.
At Hunch.com, Hunch offers the tagline: “Hunch helps you make decisions and gets smarter the more you use it.” In other words, Hunch develops more knowledge the more it is trained, either explicitly or by people using its services, which is recorded. This is then harvested for finding relevant data.
In ISBN 10: 0-596-52932-5| ISBN 13: 9780596529321 “Programming Collective Intelligence.” Toby Segaran, O'Reilly, basic algorithms are used to “demonstrate[ ] how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.”
While no shortage exists in the art directed to finding relevant data, or not, based on user's actions, there is a dearth of knowledge when users involve themselves with unstructured data, a lack of history or other record keeping, and user collaboration, to name a few, especially in real time. As such, a need exists to find relevancy information when no action has occurred by an individual or collaborative team to explicitly or implicitly start the process of finding it. Above and beyond, the need further extends to finding this information in unstructured data where no database, no meta data, etc., exists, as well as finding it in traditionally structured data (e.g., the foregoing movie example with a database storing movie meta data such as type, genre, rating, content, keywords, actors, directors, etc. as well as the number of users who have rented the movie, and indicia of those users).
On a grander scale, the need extends even further to serve advanced notions of identifying new business intelligence, conducting operations on completely haphazard data, and organizing it, providing new useful options to users, providing new user views, providing new encryption products, and identifying highly similar data, to name a few. As a byproduct, solving this need will create new opportunities in minimizing transmission bandwidth and storage capacity, among other things. Naturally, any improvements along such lines should contemplate good engineering practices, such as stability, ease of implementation, unobtrusiveness, etc.