Searching for dates is a useful primitive in understanding and extracting relevant pieces from large collections of documents. Locating a source date for content on the web is especially useful in determining relevancy to a search request comprising a date. However efficiently performing a query for dates is challenging since dates tend to occur in various formats in unstructured text.
For example, the date October 11, 2004 can occur in text as 11th of October 2004, 11-10-2004, 11 October, '04, Oct. 11th 04, 11/10/04, 10.11.2004, 2004 Oct 11, etc. Variations in date expression can be even more pronounced on a diversified collection such as the web, where many different people and organizations write web content such as free-form text. This is a natural consequence of the decentralized nature of the web and the few rigid requirements imposed on free-form text.
Nevertheless, the free-form text on the web is an important source of information, both current and archived. Newspapers and magazines provide news articles online on the web; an estimate for news sources on the web is over 10,000. Covering a range of topics, these new articles cater to the needs of both businesses and individuals. Moreover, organizations such as companies and universities post a wealth of information available online. Some search engine sites estimate the number of web pages indexed at over 8 billion. Given the large number of sources and the large number of pages on the web, the need for automated techniques for searching and navigating such a large collection is increasing.
Dates are an important means to understand the temporal context of the information found near the dates or on the same web page as the dates. Queries such as:                Show all pages that mention a particular date D (e.g., 11 Oct 2004),        Show all pages that mention any date in a given month (e.g., Oct 2004), or        Show all pages that mention any date in a given year (e.g., 2004) with one or more keywords with a specified context such as “on the same page”, “on the same line”, etc. are natural and useful ways to filter and navigate such large collections of pages.        
Although conventional web search engines perform well using standard keyword and proximity searches, it would be desirable to present additional improvements. Conventional web search engines do not adequately search by dates. Even a basic date query such as “find all pages that mention 11th October 2004” requires a separate search for each possible date format. Such a search is tedious and unmanageable since the number of possible date formats is sizeable. Furthermore, some formats such as 11.10.2004 are difficult to search because some search engines ignore the numbers and periods in a date format if they occur frequently.
Searching on dates using a conventional web search engine becomes more unmanageable for hierarchical date queries such as “find all pages that mention any date in October 2004”.
Conventional web search engines have further difficulty searching for dates in ambiguous format. For example, 11.10.2004 can mean either 11th October 2004 or 10th November 2004, depending on the context. The ambiguity is further compounded when the year is specified as a two-digit number and the month, day, and year are in similar in value (for example, 01/04/05).
Another conventional approach for finding a source date finds a single date for each page, representing when the page may have been written, i.e., a date-of-page. However, this date-of-page may not exist for all web pages. A date-of-page is typically not well defined and is usually a best guess based on different dates that appear on the page or in the http header of the page. Furthermore, this conventional approach still retains only one date per page even when a page contains additional dates. Consequently, the information about other dates is lost, including the locations of the other dates for proximity queries.
A further conventional approach that identifies named entities such as different forms in which a keyword can be referenced in text lists all possible alternatives explicitly. This conventional approach works well in cases where the number of variants is a small number. However, in the context of locating source dates on the web, the large number of possible formats for each date and the large number of possible distinct dates renders this approach cumbersome. Consequently, regular expression-based spotting is a better alternative for dates.
Yet another conventional approach comprises a natural single-step regular expression matching. In particular contexts such as weblogs (also known as blogs), this conventional approach addresses identification of dates to some extent based on the structure of blogs. However, this conventional approach does not address the wide range of possible formats for dates that appear on the web and the resulting disambiguation required to identify dates. Furthermore, efficiency and processing time become serious issues for this conventional approach considering the large number of possible formats and the large number of pages requiring processing.
What is therefore needed is a system, a computer program product, and an associated method for searching dates efficiently in a large collection of web documents. The need for such a solution has heretofore remained unsatisfied.