This invention in general relates to a method and system of indexing documents for retrieval and specifically relates to a method and system for encoding timestamps.
Recently there has been a significant increase in rapidly changing content on the web in the form of news and blog articles. A dynamic web requires a highly responsive search engine. One of the important page attributes for search engines is the publishing date of the page, referred to as a timestamp. Timestamps reflect the creation and modification of a document. Given the massive size of the dynamic web with changing content that search engines have to deal with, and the response time that users expect, it becomes imperative that these timestamps be efficiently stored in the memory associated with the search engine.
FIG. 1 illustrates the working of a typical search engine. A query 101 is submitted to the query processor 102. For example, consider the search query “bird flu”. The query processor generates the inverted index 104 and inverted attributes 105 as shown in the table 1 106, and table 2 107 and table 3 108. Documents 3 and document 5 contain the search words. The query processor gets the page attributes that determine the ranking of the documents retrieves the listed documents, and generates the ranked results 103.
To avoid bloating of the inverted index mentioned above, the attributes of the page such as timestamps, popularity, etc., are stored in a separate map indexed by document ID. This map typically is stored in memory for serving results to a large number of users at acceptable performance levels. In the current art, timestamps typically require a minimum of 3 bytes to store them over a 30 year span in granularity of minutes. It is estimated that around 10% of the web consists of dynamically changing content. To support such a massive repository size, reducing memory footprint as much as possible is desirable.
Thus, there is a need for a method and system for memory efficient encoding and decoding of timestamps. A smaller memory footprint for timestamps results in a lower implementation cost, improved scalability and faster search performance.