Frequently it is desirable to process documents to summarize them or to index the documents with some kind of indication of their semantic content. A simplistic approach to such analysis is to process a document to generate a lexicon of words and phrases which appear in the document. Such a lexicon will contain many common words which appear in most documents in the language that the document is written. As these words appear in most documents in a particular language, these common words or stop words provide little indication of the content of a document. In contrast, however, unusual or less common words will often only appear in certain contexts and hence the occurrence of such words can provide useful indication the content of a document.
Where a document relates to more than one subject and the structure of a document is such that it can be divided into different parts, word frequency analysis can provide means for determining which portions of a document relate to which subjects. For less structured documents, it can be the case that frequency analysis can only identify that a document relates to multiple subjects without being able to distinguish between which portions of the document relate to which subject.
This is a particular problem when attempting to classify the content of documents generated in a piece meal manner such as website blogs. Using blogging software users are able to develop content for a website in a piece meal manner posting individual comments at different times. The individual sections may relate to the same subject or may relate to different subjects. Analyzing individual sections or postings can help to identify the content of those sections or postings themselves. However where different sections or postings relate to one another, restricting the analysis to individual postings means that relationships between individual postings can be lost.
It would be desirable to provide a means by which the content of unstructured documents could be determined. More specifically, it would be desirable to provide means by which the content of unstructured documents could be determined which provides a classification system with a better indication of the content of different portions of the document and how those portions interrelate.