In today's world of technology, information is available more than ever before. Computers all around the world typically have several gigabytes of storage, and are connected together over networks such as the Internet. For example, the Internet contains trillions of pages of valuable information that can be accessed by end users. However, although the Internet has a lot of valuable data, it is extremely full of noise. This noise makes it difficult to analyze content to find documents which discuss similar topics.
Search engines, such as google.com and yahoo.com display a list of sponsor links that are related to the given search criteria. These sponsor links are for companies that have paid a certain amount of money to have their site listed when a user searches for certain key words in the search engine. Some search engines have the ability to remove duplicate documents from the search results. Furthermore, some web pages, such as Internet news sites, use document clustering to provide a list of articles that appear to have something in common with each other. However, these sites do not measure how related the articles are to each other in any fashion. This means that the articles listed as related articles may not really be anywhere close in concept to each other.
Furthermore, now that blogs have become increasingly popular, it is becoming even more difficult to find content that is related to a given topic of interest. Blogs are typically organized by author, and not by content. For example, the blog of a particular person may talk about their work, their civic passions, and their family. Locating topics of interest in particular blogs is extremely cumbersome, and basically requires the user to search selected blogs, and then filter out the unwanted content.