The present disclosure relates generally to an analysis of data, and more particularly, to an optimization of an algorithm based on an analysis of a page.
Document object model (DOM) algorithms are used extensively in computing applications and environments. For example, a crawler may need to perform computations on a page (e.g., a webpage) to allow the crawler to identify features associated with the page.
It is often desirable to obtain DOM content associated with a page from a programmatic point of view. DOM algorithms may associate an identifier (ID) with the page to determine if a next page being visited is a new or duplicate of one visited previously. Use of an ID may help to avoid ending up in an infinite loop (e.g., exploring pages repeatedly) while covering most of the application (e.g., skipping pages in order to avoid loops but only skipping select pages). Another technique used is a computation of a local sensitive hashing (LSH) key on the components of a page that allows the crawler to understand which parts of the page the crawler explored before and identify pages most similar to a current page.
DOM algorithms work directly on the DOM and frequently manipulate a large amount of text. As such, the DOM algorithms are computationally intensive and often prove to be a limiting factor (e.g., a so-called “bottleneck”) in terms of performance or execution time.