Field of the Invention
The present invention relates to the field of content crawling and more particularly to crawling hierarchically structured content sources.
Description of the Related Art
The development of the modern computer communications network and the wide-scale adoption of the global Internet as a primary source of information have transformed the way in which information is both generated and also shared amongst individuals. Prior to electronic methods of publishing content, individuals seeking information largely relied upon libraries and personal subscriptions to periodicals, newspapers and journals. By comparison, today one can access vast repositories of data in a matter of minutes that otherwise would consume hours if not days of tedious, manual scouring of print documents.
Even before the popularization of the World Wide Web, information technologists recognized the need to properly index electronic content such that the content can be accessed electronically and remotely by interested parties. Indeed, the very need to access related content led to the development of the hyperlink and markup language formatted documents both of which enabled the acceptance of the World Wide Web. The World Wide Web itself can be viewed as a vast hierarchy of related documents and content, connected through hyperlink relationships all of which can be accessed globally over the Internet. From the very beginning, search engine technologies evolved to address the need to discover and catalog content published and accessible through the World Wide Web.
Search engines generally locate and index content on the World Wide Web and also internally defined networks by parsing content word by word to generate index records correlating the word with a location in a document. In order to automate the discover of available content on the World Wide Web, Internet bots specifically tailored to populate search engine databases commonly are deployed and permitted to “crawl” or “spider” the accessible World Wide Web first locating content, subsequently indexing located content, linking to related content, and repeating the process. Known as crawling or spidering, the foregoing process forms the foundation of modern search engine technologies.
Unlike a general, content crawler, a focused crawler seeks, acquires, indexes, and maintains pages on a specific set of topics that represent a relatively small portion of the World Wide Web. Focused crawlers require a much smaller investment in computing resources and can achieve high coverage of pertinent content at a rapid rate. A focused crawler usually can begin with a seed list that contains uniform resource locators (URLs) that are relevant to a topic of interest. Subsequently, the focused crawler can crawl the URLs and follow the hyperlinks from the pages corresponding to the URLs to identify the most promising hyper links based upon both the content of the source pages and the hyperlink structure of the World Wide Web.
The seed list, then, can resemble a site map of relevant content for a topic of interest. In this regard, site maps directly map to a Web site's entry points. In contrast, a seed list seeks to directly represent content at the application level which differs from the organization of the content at the Web site level. To do this effectively, seed lists mirror application structure and present a hierarchical representation of content as the application originally intended it to be, and not necessarily as a Web site would present the content. Moreover, in a conventional seed list, one must choose either to crawl the entire seed list, or to omit consideration of the seed list. Accordingly, the use of a seed list with a focused crawler does not comport with the hierarchical nature of application data.