As computers and networks gain popularity, web-based computer documents (“documents”) become a vast source of factual information. Users may look to these documents to get answers to factual questions, such as “what is the capital of Poland” or “what is the birth date of George Washington.” The factual information included in these documents may be extracted and stored in a fact database.
When extracting facts, it is useful to know the subject with which a document is associated, because any facts extracted from the document are more likely than not associated with the same subject. If the subject is not known, not only are the extracted facts less useful, but organization and management of the extracted facts in the fact database may become more complicated.
One conventional approach to identifying the subject of a document is to select the document title as the subject. A document title (“title”) is a general or descriptive heading for a document. A document can have more than one title. For example, a document written in the Hypertext Markup Language (HTML) can have an HTML title, which is the text between two markup tags <TITLE> and </TITLE>. A document may also have a metadata title in the associated HTML metadata, a title as reflected in the associated file name (e.g., a document named conference memo.doc has a title conference memo), and a title in the document content (e.g., the title of this document is Determining Document Subject by Using Title and Anchor Text of Related Documents).
This approach of using the title as the document subject is inadequate. Some documents do not have a subject, while some other documents have multiple subjects (e.g., a webpage entitled Some Random Thoughts). For documents without a subject or with multiple subjects, the document title apparently should not be used as document subject. Also, a document title may not reflect the subject of the document because the author may use the title for purposes such as advertising. For example, an online news agency may universally set all titles of its documents to be The world's most trustworthy news source! Even if the author intends the title to be the subject of the document, the title may still contain unrelated information. For example, in a document titled CNN.com—Oscar Awards 2006, the first section of the title (CNN.com—) serves as advertising for the publisher—CNN.com, and is not related to the subject of the document.
Another conventional approach to identifying the subject of a document is to extract the subject from the document content. This approach is insufficient for human editors because the vast volume of documents and the fast growing speed makes it impractical for human editors to perform the task in any meaningful scale. This approach is also insufficient for computers because in order to properly extract the subject from the content of a document, a computer must process and understand the content. A document may include any machine-readable data including any combination of text, graphics, multimedia content, and so on, making such determination even harder.
For these reasons, what is needed is a method and system that identifies a subject for a source document.