Traditionally, data has been stored in relational databases for providing access to such data. However, use of relational databases for hierarchical data has exhibited various limitations. Particularly, relational databases have been limited with respect to storing and accessing very large amounts of data.
For example, traditional relational databases have limited ability to scale to large amounts of data. In addition, traditional relational databases oftentimes require costly redundant arrays of independent disks (RAID) devices to exhibit adequate query performance. Further, transforming hierarchical data into traditional relational databases has also been limited in performance.
Still yet, queries for retrieving data for web crawling purposes, storing data, modifying data, etc. have conventionally been unable to be a part of a distributed system. For example, selecting data from a traditional relational database usually results in a high number of cross-table joins and constraints to be processed in order to define crawling behavior and to impose limitations on the amount of data to be stored. Examples for such constraints include prevention of target server flooding by keeping an amount of parallel requests low, Internet protocol (IP) address information tracking to avoid parallel visits of target servers on multiple crawler nodes through usage of mutually exclusive IP ranges, as well as focused selections of hyperlinks to follow via user-definable strategies (e.g. only follow links that could point to potentially interesting content).
There is thus a need for overcoming these and/or other issues associated with the prior art.