The present invention relates to the monitoring of dynamic content in networks, e.g., the Internet, and, more specifically, to techniques which facilitate the monitoring and indexing of such content in near real time.
A vast array of software solutions facilitates the publishing of user-generated content on the Web and the Internet. Some solutions are hosted. Others operate from the user's machine or server. Some are highly configurable, providing source code which the user may customize. Current and past examples of such solutions within the context of so-called “Web logs” include Radio UserLand, Movable Type, Word Press, Live Journal, B2, Grey Matter, Blossom, Blogger, Blogspot, Type Pad, Xanga, Diaryland, Nifty, etc. In general, most of these tools and solutions may be thought of as relatively simple content management systems which facilitate “personal publishing.” The availability of these tools has resulted in the proliferation of Web logs, or “blogs,” which are typically generated by individuals on the Web.
A typical blog might include a series of postings by the “blogger,” or author of the content in the postings, relating to one or more topics. A posting might also include, for example, a link to an article relating to a current event being discussed, a link to another blog upon which the blogger is commenting or to which the blogger is responding, or a link to an authority on the subject of the posting. Blogs may also contain links outside of the regular postings which point to sites or documents in which the blogger has an interest, or to other blogs (i.e., blog roll). Blogs often include a calendar with links to an archive of historical postings on the blog. Obviously, these are merely exemplary characteristics of a blog, and are useful in pointing out the fact that blogs have a relatively structured way in which information is presented. In addition, blogs are only one example of mechanisms by which content may be dynamically published in electronic networks. The point is that there is a huge amount of content being dynamically generated and published on the Web and the Internet which includes links to other content and information, and which may be thought of as ongoing “conversations.”
And, as has been posited on the Internet, one can think of these ongoing and interconnected conversations as markets (e.g., see The Cluetrain Manifesto). This is to be contrasted with the traditional market model which defines markets primarily with respect to transactions. Relying primarily on information relating to transactions to monitor or evaluate a market arguably misses the most relevant information relating to the market being monitored or evaluated. Such a conventional approach can be likened to focusing on patterns of punctuation in a document rather than the substance of the document. And if one begins to focus on the substance of the conversations relating to a particular market rather than mere transaction data, the exercise then becomes tracking these conversations in meaningful and timely ways.
Unfortunately, most of the tools currently available on the Web and on the Internet are inadequate for such a task. For example, because of the way in which they operate, most search engines on the Internet are weeks or months behind in identifying and cataloguing the constantly changing content on the Web. That is, the typical search engine periodically “crawls” the Web to construct an enormous database which is, essentially, a copy of the Web. Given the size of the Web, these crawls may require on the order of weeks to complete. And there are many who assert that such crawling technologies are only actually looking at less than 10% of the content available on the Web. In any case, once the documents are identified, a reverse index is created using a key word vocabulary, and then the documents are then scored relative to those key words. All of this information is then pushed to query servers which respond to key word searches.
Given the time required to perform all of these tasks, it becomes apparent that traditional search engines are not particularly effective for identifying anything on the Web which is less than a couple of weeks old. In addition, because search engines are typically agnostic with regard to the time at which documents were created or modified, they are not particularly useful for finding content created within particular time ranges or with reference to any time-related metric.
In view of the foregoing, there is a need to provide mechanisms by which dynamic content on the Web and on the Internet may be indexed, monitored, and evaluated substantially in real time.