The coming of the Information Age has had profound and far-reaching effects on the lives of individuals and in the day-to-day operations of businesses alike. Globally-connected network communication exchanges such as the Internet and the World Wide Web have sparked exponential growth in information availability over even just the past few decades; with no sign of subsiding, the ubiquity of digital information continues to accelerate at an astounding pace. Further, due to the pervasiveness of mobile computing devices in the 21st century, the ability to query for and to retrieve information readily has not only been enabled—it has become an increasingly essential capability.
With exponential growth in the number of Internet-enabled devices that have come online in even just recent years, along with forecasted trends like the Internet of Things (IoT), put lightly, acceleration in the creation of information and its subsequent availability can be palpably anticipated. At the same time, while we've made tremendous strides in the way of assembling, categorizing, and systematically making sense of or reasoning about this information via interconnected graphs of knowledge (e.g. Wikipedia), there exists a vast quantity unable to participate; specifically, information that's purposefully omitted from being shared more readily is often associated with one or more of the following labels: “private”, “proprietary”, “confidential”, “personal”, or “secret”, just to cite a few. By virtue of the aforementioned obstacle, occurring for both individuals and business entities alike, the flow of information in such cases has become a one-way street: information can come in, but cannot flow out. This creates a conundrum of sorts in that said information can't be readily augmented or supplemented without manual human intervention (e.g. “gatekeepers”). Perhaps stated even more simply: without the ability to communicate the gamut of information one has, it becomes impossible to determine additional value that could be added (e.g. additions, corrections, annotations, etc.).
In the case of an individual, the previous predicament may manifest itself in the form of personal contacts, like a logical address book or rolodex, where it's a common desire to keep said contacts up-to-date. For each contact entry, it's a time-consuming process to go through the exercise of manually changing associated attributes. For instance, a contact entry may have moved geographies as the result of a new job; may have a new profile picture or avatar image; or could have changed job titles, having been promoted. Further, there's valuable information that the contact entry may be missing entirely, which it could potentially be augmented with: an international resource identifier (IRI) at which the contact is represented by an online identity (e.g. a social network), or even something as simple as the contact's age or date of birth.
Unsurprisingly, contact information changes at a quicker pace in the real world than does the respective logical contact entry, especially when the procedure for updating it is in some part manual. While there are certainly methods or systems in existence that attempt to automate this process with little-to-no time investment or level of human intervention required, and despite the inherent value such a service could provide, the owner may not want to use said methods or systems due to a concern for privacy.
Similarly, a business entity may use, for example, customer relationship management (CRM) software to manage interactions with past, current, and/or future customers. These software systems frequently record confidential information that would yield potentially detrimental results if revealed. Accordingly, companies may be reluctant to integrate with external systems in order to augment or complement their internal customer information because of data sensitivity. Instead, businesses may resort to any number of alternative tactics to circumvent the latter.
For example, a company might purchase lists of individuals' information from data brokers and subsequently go through the arduous process of cleansing, de-duping, and integrating it into their existing system(s) (often collectively referred to as an ETL process). In this scenario, notice that the flow of data is unidirectional: it can come in, but it can never go out. Thus, the business's information sensitivity concern has effectively limited the possibility of data augmentation by external systems in a largely automated fashion. Instead, more often than not, humans end up importing, cleaning, and processing this information, despite the fact that it ends up being far less efficient, more error-prone, more time-consuming, and ultimately much more expensive.
Ideally, these individuals and business entities would instead be able to integrate with external systems and have their internal information augmented, but more importantly, without concern for revealing anything considered potentially sensitive. Although attempts in this regard have been made in various limited forms or fashions, no such combination of methods and systems exist to-date.
Conventional information retrieval systems and associated methods focus on satisfying the requirements of a single business. As a result, these systems are designed to retrieve information that fit into memory on a single or small number of machines using an inverted index. Inverted indices allow direct and rapid access given a search key. In order for such inverted indices to perform at acceptable response times for most businesses, they are loaded into and reside in computer memory from a secondary permanent storage. They are usually organized as B-trees or some B-tree variation.
However, scaling an inverted index to support many businesses simultaneously is impractical. First, for large information stores or information stores with a non-uniform key distribution, inverted indices contain a large number of pointers and span several operating system pages. Inserts and deletes are expensive in these cases. Another issue with inverted indices, and what the inventive techniques described herein address, is the space required by these structures. In naive implementations, the overhead of the inverted index structure may be so large that they are impractical to use. To compensate for these limitations, advocates of inverted indices advise, when it is possible to route the information retrieval requests to specific machines, to shard the data by that routing mechanism. When an information retrieval request fails to satisfy the routing mechanism, the request must be sent to each machine, processed by each machine individually and returned to the routing machine to do further processing on the aggregated results from each machine. A business with 25 machines will, in these scenarios, perform over 25 times the necessary work.