Automated tools to manage an enterprise's information technology (IT) infrastructure (i.e., intranet) efficiently are a necessity in today's computing environment.
Need for Analyzing Electronic Documents in an Intranet
Large corporations typically have thousands to tens of thousands of servers that serve content over their intranet. Typically, the cost of managing, maintaining and administering each of these servers comes to tens of thousands of dollars per year. A reduction in the number of servers here gives tremendous savings, up to several hundred thousand dollars per year. Such cost savings can be achieved by performing a semantic analysis of the content residing on servers on an intranet.
This results in a need for analyzing electronic documents in such intranets. Namely, there is a need to analyze the content (unstructured and structured) that sits across an enterprises' servers (i.e., web sites), managed by various business units within the enterprise.
Challenges in Analyzing Electronic Documents in an Intranet
The number of servers in any large enterprise deems manual analysis infeasible. Large enterprises typically have thousands to tens of thousands of sites that serve content over their intranets. For example, a particular large computer company has on the order of 50,000 sites with intranet content, and spends $5 billion annually on its IT infrastructure, out of which roughly $500 million is spent on web hosting sites.
Automated semantic analysis of this content poses multiple challenges, such as those of fetching this content from tens of thousands of servers, storing, annotating & indexing this content, and finally performing analysis of content based on the annotations.
Also, many sites choose not to expose different document-types by means of a robots.txt file on the site. For instance, such sites may forbid crawls for anything under the “image” directory. However, on certain sites, the robots.txt files may be used specifically to block duplicate data from showing up. Thus, for such sites, honoring the robots.txt files would ensure that the duplicate content is never identified.