Many computer systems use repositories for holding data that the system uses in its operations. In an enterprise resource planning system, the data may relate to the ongoing business operations performed in the system. Systems can have great volumes of data electronically stored in repositories, and this data can be updated at regular or arbitrary intervals.
In systems with great amounts of repository data, it becomes very important to provide adequate search functions for users to access the data. The system may include a search engine or equivalent providing search functionality for relevant documents according to a variety of criteria. A search engine typically has associated with it an index of repository contents. When a user enters a search query, the engine consults the index to determine whether there are any matches. In response, the search engine may send a “hit list” that enables the user to access any responsive data, for example in the form of a document. The process of creating the index based on the repository data is usually referred to as indexing.
When a new index is to be created, an initial indexing process is performed on the entire contents of one or more data repositories. In the case of a repository containing a large volume of data or from multiple distributed repositories, the indexing can take quite some time, up to a full day or more depending on system size. This may cause system resources to be unavailable or slow for a significant amount of time. Particularly, one bottleneck in this initial indexing process may be the step of transmitting the data from the repository to the service or equivalent that performs the indexing. It always takes a finite time to access one or more knowledge entities in the repository and to transmit the retrieved data to the indexing service. Moreover, the retrieval process may suffer from partially failed retrievals, for example when documents cannot be found where expected.
In existing systems, the transmission of repository data to a data recipient, such as an indexing service, may be performed in a sequential batch data retrieval process. Such a process is used in some products available from SAP AG in Walldorf (Baden), Germany. One disadvantage with this process is that it can be held up by a single batch of data that takes a long time to retrieve from the repository and/or to transmit to the indexing service. Such a delay means that it will occupy the system resource, and other indexes must wait longer until the index is finished, or in the case of an index being updated, that the content of the index will not be updated for a longer time. Moreover, such systems do not have a sophisticated solution for handling failed batch jobs efficiently.
If the indexing process takes a long time, this reduces system efficiency and delays the moment when the new index is ready to use. For this reason, indexing jobs that involve transmitting large amounts of data typically are run at times when system use is low, such as overnight or on weekends. In contrast, the process of later updating the index with changes in the repository data may take relatively less time, because it may be possible to focus the update indexing on only the knowledge entities that have changed since the index was created, a so-called delta update. Nevertheless, also this process can impact system performance if it involves a great volume of repository data, and the index content typically is not updated for search.