Access control by a search engine is typically performed using an access control list (ACL), which restricts access to documents based on a user role within an enterprise or their identity. Providing fine-grained access control within a search engine is a challenge for many organizations. The actual technical challenge is to filter a large document set based on large a number of access control predicates. A typical approach within a search engine is to build a Boolean query out of the search predicates and create a filter for the main query. The main problem with the Boolean query approach is it that it begins to slow down with a relatively low number of clauses (low tens of thousands). Also, just the act of parsing and building a Boolean query with tens or hundreds of thousands of clauses becomes memory-intensive and slow.
For the sake of discussion, consider the case where each access control predicate is a text token with an average size of 16 bytes. In this scenario a “bag” of 400,000 tokens would be over 6 megabytes (MB) of text. Handling each token as an individual java string would expand this figure considerably. Java will treat each byte as a character, doubling the size to 12 MB of text. Java Strings also carry 32 bytes of overhead per string. Now the original 6 MB of text has expanded to 25 MB. Building Boolean queries from the 400,000 strings continues to pile on more and more object reference overhead. Creating, destroying and manipulating all of this object reference overhead becomes very costly for performance.
What is needed is a way to handle access control predicates, and for providing fine-grained access control within a search engine efficiently and effectively.