Conventional data storage systems and subsystems typically control use of and access to files, directories, and the data therein primarily using meta data, for example file name, file type, file location, and other like parameters. In such systems/subsystems, users are allowed, or not, access and/or the taking of particular actions based on predetermined meta data criteria. For example, based on specific meta data criteria an individual may be allowed, or not, to create or delete content, read and/or write content, list or browse content, move or copy files or content, open/close files/content, view or navigate, and the like. This type of access control relies only on specific elements found in the individual files, such as file name, path, location, file type, file extension, and other such direct file content. Conversely, conventional methods provide no means for controlling access via other, more flexible parameters where perhaps predetermined meta data criteria have not yet been created. For example, unstructured or semi-structured data may not yet have identified meta data parameters or commonalities allowing a predetermined criteria as to whether an individual is or is not allowed to access a particular file or group of files, or take a particular action, or even further to alter access privileges based on additional criteria such as user location and the like.
In other words, an individual may be allowed or denied access to a particular file based on specific, pre-established meta data criteria identified for that user or user class. However, if the same individual encounters similar, related, or otherwise relevant files that do not share the pre-established meta data criteria indicating that the user should or should not have access, then the individual may still be able to access information to which she has no right, and may even be able to extrapolate knowledge gained thereby to the prior file to which access was denied based on the information in the related file to which she was granted access.
In that regard, recent data suggests that nearly eighty-five percent of all digital data is found in unstructured files and it is growing annually at around sixty percent. In such unstructured or semi-structured data, establishing pre-determined meta data criteria determining whether a particular individual or class of individuals should or should not be allowed access to files or take some action on those files may simply not be possible. One reason for the growth is that regulatory compliance acts, statutes, etc., (e.g., Sarbanes-Oxley, HIPAA, PCI) force companies to keep file data in an accessible state for extended periods of time. However, block level operations in computers are too low to apply any meaningful interpretation of this stored data beyond taking snapshots and block de-duplication. While other business intelligence products have been introduced to provide capabilities greater than block-level operations, they have been generally limited to structured database analysis. They are much less meaningful when acting upon data stored in unstructured environments.
Unfortunately, entities the world over have paid enormous sums of money to create and store their data, but cannot find much of it later in instances where it is haphazardly arranged or arranged less than intuitively. Not only would locating this information bring back value, but being able to observe patterns in it might also prove valuable despite its usefulness being presently unknown. However, entities cannot expend so much time and effort in finding this data that it outweighs its usefulness. Notwithstanding this, there are still other scenarios, such as government compliance, litigation, audits, etc., that dictate certain data/information be found and produced, regardless of its cost in time, money and effort. Thus, a clear need is identified in the art to better find, organize and identify digital data, especially data left in unstructured states.
As another example, in search engine technology, large amounts of unrelated and unstructured digital data can be quickly gathered. However, most engines do little to organize the data other than give a hierarchical presentation. Also, when the engine finds duplicate versions of data, it offers few to no options on eliminating the replication or migrating/relocating redundancies. Thus, a further need in the art exists to overcome the drawbacks of search engines.
When it comes to large amounts of data, whether structured or not, compression techniques have been devised to preserve storage capacity, reduce bandwidth during transmission, etc. With modern compression algorithms, however, they simply exist to scrunch large blocks of data into smaller blocks according to their advertised compression ratios. As is known, some do it without data loss (lossless) while others do it with data loss (lossy). None do it, unfortunately, with a view toward recognizing similarities in the data itself.
From biology, it is known that highly similar species have highly similar DNA strings. In the computing context, consider two word processing files relating to stored baseball statistics. In a first file, words might appear for a baseball batter, such as “batting average,” “on base percentage,” and “slugging percentage,” while a second file might have words for a baseball pitcher, such as “strikeouts,” “walks,” and “earned runs.” Conversely, a third file wholly unrelated to baseball, statistics or sports, may have words such as “environmental protection,” “furniture,” or whatever comes to mind. It would be exceptionally useful if, during times of compression, or upon later manipulation by an algorithm if “mapping” could recognize the similarity in subject matter in the first two files, although not exact to one another, and provide options to a user. Appreciating that the “words” in the example files are represented in the computing context as binary bits (1's or 0's), which occurs by converting the English alphabet into a series of 1's and 0's through application of ASCII encoding techniques, it would be further useful if the compression algorithm could first recognize the similarity in subject matter of the first two files at the level of raw bit data. The reason for this is that not all files have words and instead might represent pictures (e.g., .jpeg) or spread sheets of numbers.
Appreciating that certain products already exist in the above-identified market space, clarity on the need in the art is as follows. 1, present day “keyword matching” is limited to select sets of words that have been pulled from a document into an index for matching to the same exact words elsewhere. 2, “Grep” is a modern day technique that searches one or more input files for lines containing an identical match to a specified pattern. 3, “Beyond Compare,” and similar algorithms, are line-by-line comparisons of multiple documents that highlight differences between them. 4, block level data de-duplication has no application in compliance contexts, data relocation, or business intelligence.
In modern day “relevancy” systems, most, if not all, utilize explicit user interaction to harvest relevancy data. For example:
Amazon.com: Users purchase books or other products and Amazon suggests other books and products that the user might find interesting based on books/products that were purchased by others who bought the same exact book/product. However, Amazon's relevancy engine would not work if people did not buy books/products specifically at the Amazon website. Also, the more that people make purchases, especially books at the same time, the more data points Amazon has to find relevant data. Conversely, if no one has ever made an exactly similar purchase, Amazon has no mechanism to make suggestions to other people other than by way of keyword associations.
Netflix.com: Users rent movies, and place others in waiting queues, and Netflix suggests other movies that the user might find interesting. Similar to Amazon, however, this approach relies on others watching or selecting those same exact movies. The same is true at Fandango.com, whereby Fandango suggests movies to users based on same ticket purchases by other users.
Online store web analytics: Users browse through online web stores or online catalogs and web analytic software determines a length of stay on a page, how many times the page or site is revisited, what else occurred during visitation, etc. In turn, new merchandise, package deals, coupons, etc. are suggested for purchase/downloading by the user.
Social Networking sites, e.g., Facebook, LinkedIn, Plaxo, etc.: These all suggest “friends” that users might want to “connect with” based on meta data and other associations with connections to common friends and their connections/friends, and so on. Similarly, websites such as Flickr, YouTube, pandora.com, etc., offer relevancy services, but they are all founded on structured data as well as input from other users.
At Hunch.com, Hunch offers the tagline: “Hunch helps you make decisions and gets smarter the more you use it.” In other words, Hunch develops more knowledge the more it is trained, either explicitly or by people using its services, which is recorded. This is then harvested for finding relevant data.
In ISBN 10: 0-596-52932-5|ISBN 13: 9780596529321, “Programming Collective Intelligence,” Toby Segaran, O'Reilly, basic algorithms are used to “demonstrate[ ] how you can build web applications to mine the enormous amount of data created by people on the Internet. With the sophisticated algorithms in this book, you can write smart programs to access interesting datasets from other web sites, collect data from users of your own applications, and analyze and understand the data once you've found it.”
While no shortage exists in the art directed to finding relevant data, or not, based on specific user's actions, there is a dearth of knowledge when users involve themselves with unstructured data, a lack of history or other record keeping, and user collaboration, to name a few, especially in real time. As such, a need exists to find relevancy information when no action has occurred by an individual or collaborative team to explicitly or implicitly start the process of finding it. Above and beyond, the need further extends to finding this information in unstructured data where no database, no meta data, etc., exists, as well as finding it in traditionally structured data (e.g., the foregoing movie example with a database storing movie meta data such as type, genre, rating, content, keywords, actors, directors, etc. as well as the number of users who have rented the movie, and indicia of those users).
On a grander scale, the need extends even further to serve advanced notions of identifying new business intelligence, conducting operations on completely haphazard data, and organizing it, providing new useful options to users, providing new user views, providing new encryption products, and identifying highly similar data, to name a few. As a byproduct, solving this need will create new opportunities in minimizing transmission bandwidth and storage capacity, among other things.
Along these lines, a need exists in the art for methods for providing broader-reaching controls on access privileges to and operations on such data, whether unstructured or structured. Such methods will desirably allow implementation of policies for such access to/operations on related files, without the necessity for a direct link between files and groups of files or a specific user history, such as conventional metadata-based controls and/or a specific policy directed to specific users or user classes, as is conventional. Naturally, any improvements along such lines should contemplate good engineering practices, such as stability, ease of implementation, unobtrusiveness, etc.