1. Field
Embodiments of the invention relate to enforcing native access control to indexed documents.
2. Description of the Related Art
An enterprise may be described as any organization (e.g., business, government entity, charity, etc.) that uses computers. The information found in an enterprise may exist in many shapes and forms. The information may be distributed throughout the enterprise and managed by various software programs, depending on the task at hand. For example, enterprise users may use a SQL application to tap into relational databases or a document management application to access documents pertinent to their work.
Controlling access to sensitive information contained within these repositories is typically enforced by the managing software programs. The extent to which the information is secured may vary from system to system, with each enforcing its own security policies and requirements. For example, file systems generally control read, write, and execute operations on files and associate security groups with the allowed operations. A security group may include a single user or multiple users. However, file systems do not control access to individual elements within a file. Once the user is permitted to open a file, the user has access to all of its contents. In contrast to a file system security model, a relational database management system may control access to individual columns of data in a table of a database, and a document management program may enforce security policies to limit access to documents within a specified period of time.
An enterprise search engine may be described as being capable of retrieving relevant documents of the enterprise in response to a query (a form of a search request). The diversity in security models for the different types of enterprise content is problematic for enterprise search engines. A goal of an enterprise search engine is to provide quick and relevant responses to inquiries for documents that users are authorized to see. In order to meet the performance and relevance requirements, most enterprise search engines build a search index that represents the content to be searched. Rather than searching the original content, the user is actually submitting queries to the index, which is like searching a card catalog in a library.
The search index includes documents that are extracted from various backend repositories. A repository may be described as a data source. Backend repositories may be described as contributing data sources to the search index. The documents contained in these backend repositories are extracted with a crawler that has security credentials of sufficient authority to access all of the documents for that repository. Normally, the user identification (“userId”) presented to the crawler is a “super” user that has access to most, if not all, of the documents in the repository. Consequently, the initial document access rights of an enterprise search index represent the access rights of this “super” user.
Different enterprise search engines use different approaches to restrict an individual user's access rights. One approach is for the enterprise search engine to provide its own security model. The administrator of the enterprise search engine defines individual access rights to the cataloged documents. This approach has several drawbacks. First, this approach attempts to employ a common security model that will satisfy all of the security requirements of the contributing backend repositories. As previously demonstrated, this may not be practical or possible as the different types of repositories and access controls increases. Typically, the end result is a least common denominator effect for security causing a number of documents to loose some if not all of their native security controls. Second, this approach requires the administrator to redefine controlled access to documents that have already been defined in the originating repositories, which is a duplicative task. Lastly, the approach implies that the administrator has enterprise wide knowledge of the access controls for all enterprise content, which is an unlikely situation.
It is therefore highly desirable for the search engine to honor the access rights of the documents as defined by the native access controls of the backend repository. Native access control refers to the access control implemented at the repository from which the document was retrieved. Typically, a native access control list (ACL) is associated with each document and is used to enforce access control to that document. In many cases, the native ACL includes security tokens representing security groups and/or individual users who have access to a document. Native ACLs may also exist at higher levels than the document within the backend repository. For example, documents may be organized into folders, which themselves may have defined ACLs (i.e., folder level ACLs). The folders, in turn, may be organized into logical file cabinets, which again can have their own defined access controls (i.e., file cabinet level ACLs). There are generally two approaches for a search engine to honor these native ACLs. One approach is to copy the native ACLs into the search index. The other approach is to leave the native ACLs in the backend repository and to have the search engine request document access authority from the repository through impersonation.
The ACL approach is to automatically copy the document's native ACLs as defined by the backend repository into the search index of the enterprise search engine. Although this approach reduces the burden on the administrator to redefine a document's ACL, the approach has several shortcomings. If the native ACLs are to retain their original security model, then the enterprise search engine would be re-implementing the corresponding security mechanisms used by the backend to enforce those ACLs. This may be a daunting task. Alternatively, the search engine could try to transform these ACLs into a single common model so that a single security filtering mechanism may be used. A true normalized model may not be achievable. The term “normalized” may be described as causing to conform to a standard or making consistent. If a normalized model is achievable, the result would be a security model representing the least common denominator of all the contributing repositories.
The impersonation approach does not maintain any security information in the search index at all. In response to a query, a result set is generated from the index. Then, before the result set is presented to the user, the enterprise search engine removes those documents the user is not allowed to see by consulting in real time with the document's originating backend repositories. The enterprise search engine would, in a sense, be impersonating the end user when interacting with the native repository. Through impersonation, the enterprise search engine would be asking the native repository if the user may have access to one or more documents that were previously crawled and extracted from that repository. With this approach, document access is controlled by the native security mechanisms of the originating repository, however complex that may be. Also, the filtering is done in real time, thus reflecting the latest native ACL changes for any given document.
While the impersonation approach does meet the requirement to honor the document's original access rights, the approach has some shortcomings. First, the approach requires connectivity to the all of the backend repositories that have contributed to the index. If a particular backend repository is not available, then the disposition of a document in a result set may not be determined. That is, if the backend is not available then the document probably cannot be viewed. Under this condition the document would automatically be removed from the result set.
The impersonation approach, however, may take some time. Search indexes are optimized for speed and generally can be searched in sub second response times. With the impersonation approach, time is added to communicate with each backend repository to determine whether documents should be included in the final result set that is returned to the user. The more differentiated the result set, the greater the number of communications. The problem is compounded when a user is denied access to the majority of the results. For example, assume that a query generated 1000 interim results ranked by relevance by submitting a query against the index. Further, assume that the user did not have access to the first 900 results as dictated by the backend repositories. Then over 900 impersonations would have been performed by the enterprise search engine before the result set is populated with the remaining 100 results.
Thus, there is a need in the art for more efficient enforcement of native access control to indexed documents.