Databases and especially relational databases provide convenient, fast and flexible access to data sets. As a data set grows it becomes more difficult and slower to search every item in order to retrieve one or more items of data. Database indexes are used to speed up this process, as an index can be searched faster than the indexed data. However, indexes have their own additional processing costs by adding to the number of writes and storage requirements.
A document index may be built by parsing a document, identifying individual words or keywords and adding an entry in the index identifying each document for every word. Therefore, for every word found in any document the index lists all documents which contain that word or keyword. Alternatively, each document may have a row that identifies every word or keyword that it contains. One drawback of this approach is that words are variable is size and so require different storage space. This in turn means that the words must be stored using a field type (e.g. CLOB or VARCHAR) that can accommodate different sized words. Such field types can require 256 or 2028 bits, for example, further increase storage requirements and slow index generation and searching.
With the rise of large data sets and “Big Data”, indexing has become an important issue to organizations. Even with a well-designed index, building such an index can require large computing resources and have large storage requirements. One approach is to index a database when it is off line or at a time when it is rarely used instead of indexing in real time as data is added to the database. Whilst this can alleviate some burden on the system when data needs to be accessed, this can cause search results to become outdated and even incorrect. Furthermore, the rise of huge amounts of data to be indexed leads to needing longer index updates and ever increasing management resources.
Therefore, there is required a method and system that overcomes these problems.