Recent data suggests that nearly eighty-five percent of all data is found in computing files and growing annually at around sixty percent. One reason for the growth is that regulatory compliance acts, statutes, etc., (e.g., Sarbanes-Oxley, HIPAA, PCI) force companies to keep file data in an accessible state for extended periods of time. However, block level operations in computers are too lowly to apply any meaningful interpretation of this stored data beyond taking snapshots and block de-duplication. While other business intelligence products have been introduced to provide capabilities greater than block-level operations, they have been generally limited to structured database analysis. They are much less meaningful when acting upon data stored in unstructured environments.
Unfortunately, entities the world over have paid enormous sums of money to create and store their data, but cannot find much of it later in instances where it is haphazardly arranged or arranged less than intuitively. Not only would locating this information bring back value, but being able to observe patterns in it might also prove valuable despite its usefulness being presently unknown. However, entities cannot expend so much time and effort in finding this data that it outweighs its usefulness. Notwithstanding this, there are still other scenarios, such as government compliance, litigation, audits, etc., that dictate certain data/information be found and produced, regardless of its cost in time, money and effort. Thus, a clear need is identified in the art to better find, organize and identify digital data, especially data left in unstructured states.
In search engine technology, large amounts of unrelated and unstructured digital data can be quickly gathered. However, most engines do little to organize the data other than give a hierarchical presentation. Also, when the engine finds duplicate versions of data, it offers few to no options on eliminating the replication or migrating/relocating redundancies. Thus, a further need in the art exists to overcome the drawbacks of search engines.
Also, it is typical for users to search the web by using a search engine, such as one found at Google.com or Yahoo.com. These Information Search and Retrieval systems, however, are based on indexing content, calculating a closeness vector between a query string and the indexed content, and then returning content that is closest to the query. Ultimately, it is left to the end user to determine what they are looking for, to formulate a query string, to submit that query string to the search engine, and then wait for a search response to return. Often times, there is a feedback loop where the searcher can refine their query or select items that are most similar to what they are searching for and those items are aggregated with the original search to narrow its focus. The problems with this scenario are several, including: 1) the search is only as good as the underlying query string; 2) the search must be initiated by the end user; 3) the search is request/response oriented as it is starts with the query (request) and results are returned (response), but there is no open ended query where results can continue to flow back to the searcher; 4) only data related to the search is returned; and 5) despite seemingly simple queries, an exorbitant amount of data is often returned to users given the vastness of web content that is indexed.
While certain solutions have been introduced to address some of these problems, the solutions themselves have caused even further problems, as illustrated:
1. RSS Feeds are known techniques to notify users when new content is available to them. In that the feeds are “push” rather than “pull” models, it has no true search or filtering built into it. At present, the only filtering done relates to end users either registering for an RSS Feed or not.
2. Digg (reddit, fark, StumbleUpon, etc) is a known mechanism for crowds of users to determine through votes of “like” or “dislike,” “yes” or “no,” etc., what is relevant. However, there is very little searching or filtering available. Users simply get results.
3. As is known at del.icio.us (social bookmarking), users annotate and share bookmarks. In this way, a first user can see what a second user likes or dislikes and/or the first user can suggest to the second user their recommendations. However, this model relies on users searching bookmarks that are describing the underlying content of a file (some of which can be spoofed) rather than searching for content itself.
The net result of these techniques is that it remains difficult for users to create a persistent, ongoing filter looking for new, but relevant information that is not based on key words or indexing of semantic content. To this end a need in the art is recognized.
In modern day OSI Model layers, switching includes L2, L3 and L4 layers. In Layer 2 (L2), switches route network traffic based on MAC addresses of each node. In Layer 3 (L3), switches route network traffic based on IP addresses and, when combined with NAT technology, allows for many servers to act as a single server behind a single IP address. Layer 4 (L4) switches route network traffic based on layers 4 (and sometimes above that). However, these switches have specific look up tables based on packet type and some easily referenced fields in the packets and their payloads. They rarely look at a whole packet. Also, when the L4 switches do make a decision to route, they usually make an all-or-none decision to switch or not switch all subsequent traffic of the same type, or from the same source, to the same destination independent of the content of the subsequent packets. Accordingly, a still further need in the art recognizes an OSI model switch, such as at the application layer (L7), that could utilize an entirety of the payload, e.g., the content of the packet (and sequences of packets), when determining whether or not it matches any filtering rules of a device/appliance.
When it comes to large amounts of data whether structured or not, compression techniques have been devised to preserve storage capacity, reduce bandwidth during transmission, etc. With modern compression algorithms, however, they simply exist to scrunch large blocks of data into smaller blocks according to their advertised compression ratios. As is known, some do it without data loss (lossless) while others do it “lossy.” None do it, unfortunately, with a view toward recognizing similarities in the data itself.
From biology, it is known that highly similar species have highly similar DNA strings. In the computing context, consider two word processing files relating to stored baseball statistics. In a first file, words might appear for a baseball batter, such as “batting average,” “on base percentage,” and “slugging percentage,” while a second file might have words for a baseball pitcher, such as “strikeouts,” “walks,” and “earned runs.” Conversely, a third file wholly unrelated to baseball, statistics or sports, may have words such as “environmental protection,” “furniture,” or whatever comes to mind. It would be exceptionally useful if, during times of compression, or upon later manipulation by an algorithm if “mapping” could recognize the similarity in subject matter in the first two files, although not exact to one another, and provide options to a user. Appreciating that the “words” in the example files are represented in the computing context as binary bits (1's or 0's), which occurs by converting the English alphabet into a series of 1's and 0's through application of ASCII encoding techniques, it would be further useful if the compression algorithm could first recognize the similarity in subject matter of the first two files at the level of raw bit data. The reason for this is that not all files have words and instead might represent pictures (e.g., .jpeg) or spread sheets of numbers.
Appreciating that certain products already exist in the above-identified market space, clarity on the need in the art is as follows. One, present day “keyword matching” is limited to select set of words that have been pulled from a document into an index for matching to the same exact words elsewhere. Two. “Grep” is a modern day technique that searches one or more input files for lines containing an identical match to a specified pattern. Three, “Beyond Compare,” and similar algorithms, are line-by-line comparisons of multiple documents that highlight differences between them. Four, block level data de-duplication has no application in compliance contexts, data relocation, or business intelligence.
Accordingly, still further needs in the art extend to serving advanced notions of identifying new business intelligence, conducting operations on completely haphazard data, and organizing it, providing new useful options to users, providing new user views, providing new encryption products, and identifying highly similar data, to name a few. As a byproduct, solving these needs will create new opportunities in minimizing transmission bandwidth and storage capacity, among other things. Naturally, any improvements along such lines should contemplate good engineering practices, such as stability, ease of implementation, unobtrusiveness, etc.