There are many sources throughout the world that generate documents that contain content. These documents may include breaking news, human interest stories, sports news, scientific news, business news, and the like.
The Internet provides users all over the world with virtually unlimited amounts of information in the form of articles or documents. With the growing popularity of the Internet, sources such as newspapers and magazines which have historically published documents on paper media are publishing documents electronically through the Internet. There are numerous documents made available through the Internet. Often times, there is more information on a given topic than a typical reader can process.
For a given topic, there are typically numerous documents written by a variety of sources. To get a well-rounded view on a given topic, users often find it desirable to read documents from a variety of sources. By reading documents from different sources, the user may obtain multiple perspectives about the topic.
However, with the avalanche of documents written and available on a specific topic, the user may be overwhelmed by the shear volume of documents. Further, a variety of factors can help determine the value of a specific document to the user. Some documents on the same topic may be duplicates, outdated, or very cursory. Without help, the user may not find a well-balanced cross section of documents for the desired topic.
A user who is interested in documents related to a specific topic typically has a finite amount of time locate such documents. The amount of time available spent locating documents may depend on scheduling constraints, loss of interest, and the like. Many documents on a specific topic which may be very valuable to the user may be overlooked or lost because of the numerous documents that the user must search through and the time limitations for locating these documents.
It would be useful, therefore, to have methods and apparatus for clustering content.