There are two very generalized approaches to collecting and organizing information over the Internet. One approach is to use Internet search engines. These search engines typically have spidering programs that recursively traverse Internet links, capturing non-trivial terms on each page. These pages are then organized based on the terms encountered in each document. The strength of this approach is that a very wide number of documents can be spidered and made available for keyword searches. Some of the drawbacks are as follows: 1) Existing pages in the system are infrequently re-spidered, meaning that information can easily be out of date. 2) Internet pages have no consistent format, and therefore, the content of a page cannot be easily discerned. 3) The documents are organized based solely on the presence of a keyword in a document.
The other broad approach is to gather and process Internet information using information agents to retrieve information. These agents provide a number of ways to retrieve and organize information. Information agents are capable of accessing information from multiple sources, and then filtering information by relevance to a user. The most basic systems use non-cooperating agents to perform an information retrieval task. Enhanced systems use cooperating agents, and finally, adaptive information agents that can deal with uncertain, incomplete, or vague information. Information agents can efficiently gather heterogeneous and frequently changing information from the Internet. While the information agent concept is appealing, much of the literature in the area describes characteristics and attributes of agents, with little detail on specific advantages of the technology. Another technical problem is the lack of enough inherent structure in newspaper articles that would allow the information agents to transform the inherent structure to a common schema.
Once the information has been retrieved, the next challenge is how to organize it. There are a number of methods available for doing this. The most basic approach is keyword searching within a document as a way of classifying the document. This simple approach yields mixed results because documents that contain the same words may have no semantic relationship to each other.
A more sophisticated approach to organizing information uses a vector space model (VSM), where each unique word within a collection of documents represents a dimension in space, while each document represents a vector within that multidimensional space. Vectors that are close together in this multidimensional space form clusters, or groups of documents that are similar.
Clustering techniques can be used for organizing documents into similar groups of documents. Through local and global weighing schemes this approach can be adapted to compare the similarity of one document to another. One of the limitations of clustering is that the entire document set must be available at the time of the analysis, and clustering algorithms require extensive computations, typically n3 in complexity based on n documents.
Another approach to organizing information is to use neural networks to determine patterns within documents. It is assumed that documents with similar word patterns are similar in content. These models are built on the premise that historic patterns will hold in the future. This is clearly not the case with newspaper articles where topics, people, and events change at frequent intervals.
There remains a need for more effective software agents for collecting and summarizing large amounts of information from information sources, which can be web sites on the Internet.