This invention relates to the field of computer systems. More particularly, methods are provided for making the capture of data from a displayed page more efficient.
Current methods of extracting data from a web page are inefficient because they involve reading into memory and parsing large amounts of information, and then applying pattern-matching rules to find the specific data that are desired. Even if only a relatively small amount of data is to be captured, a relatively large amount of information (e.g., the entire page) must be processed.
For example, when the page to be scraped comprises HTML (HyperText Markup Language) code, the page's entire DOM (Document Object Model) tree is read into memory for parsing. As the code is parsed, it is searched for particular HTML tags that correspond to information that is desired.
This process of reading the page's DOM into memory, parsing it and performing pattern-matching is very processor-intensive, especially since the HTML source code is captured in string form. Storage space in memory and memory pointers must be maintained for the entire period of time needed to complete the processing.