Additional information regarding a document is commonly referred to as document “metadata (bibliographic information).” For example, it may include information such as the time of issue, title, author, category, and the like. In particular, the time of issue included in such metadata is temporal information that indicates when the information obtained from the document was issued. The time of issue is important in identifying the novelty of the information obtained from the document.
However, issue time-related metadata is not necessarily associated with all documents and there are numerous documents with unclear metadata. In order to manually determine the time of issue of a document whose metadata description format and schema have not been ascertained, it is usually necessary to get the requisite information from the document and determine whether or not it is the time of issue.
In other words, manual determination of the information that corresponds to the time of issue in a document having no associated metadata is equivalent to getting the requisite information from a document containing diverse representation formats. In addition, even if numerous items of time information can be found, it is difficult to identify the time of issue from the found time information. As a result, the problem of cost arises when one attempts to manually determine the time of issue.
Here, documents published on the Internet or Intranet are used as an example to describe how the time of issue of such documents is identified. Documents published on the Internet or Intranet contain diverse representation formats and metadata is not necessarily associated with such documents in accordance with pre-defined formats and schemas. It should be noted that while the RDF (Resource Description Framework), a standard introduced by the W3C, is known as an example of a metadata definition for such documents, it is believed that not all documents have information associated therewith in accordance with the RDF and there are more documents that have no associated information.
In addition, documents published on the Internet or Intranet are in many cases written in HTML (Hyper Text Markup Language) format. In general, the HTML format excels at representing the structure and appearance of documents and, in documents written in HTML format, the level of freedom in representation is increased. For this reason, HTML documents are written using variegated representation formats.
Therefore, in case of a document written in HTML format, in order to determine information regarding when, and who, issued what type of document, the requisite information has to be found by interpreting diverse representation formats. Accordingly, for a document written in HTML format, it is difficult to manually determine the information that corresponds to the time of issue, which creates the above-described cost problem.
On the other hand, as an alternative, it is contemplated to collect documents published on the Internet or Intranet and use the time of collection as the time of issue. However, while this technique does simplify the determination of the time of issue, it cannot ensure that all documents can be collected without delay at the point in time when they are issued. In addition, the above-described problems are difficult to eliminate because the documents have to be collected quickly and in large quantities, thereby increasing the associated costs.
In addition, it is also contemplated to use time information, such as the Last-Modified header and the like, which is returned by Web servers in response during communication via HTTP, as the time of issue. However, since in many cases Web servers return inaccurate time and sometimes these headers are not even attached, numerous problems arise when this type of time information is used as the time of issue of a document.
Against this background, for example, Patent Document 1 has disclosed a method for estimating the time of issue from time representations contained in a document. In the method disclosed in Patent Document 1, first of all, time representations are extracted using rules describing in advance the patterns of time representation contained in the document, and the rule with the largest number of extractions is identified. The date and time of issue represented by the time representation extracted based on the identified rule are then estimated to be the time of issue.