The identification and retrieval of data from websites is a difficult and pressing issue. While techniques have been developed for ‘crawling’ the web to classify and index websites, and make the information available for searching and retrieval, extraction of complete data from a site is problematical. It is known for websites to offer APIs to facilitate data extraction, but this is far from universal. Automatic complete extraction of data from a site without a suitable API is problematic as sites are not structured consistently, and page elements such as forms, side bars, and navigation menus can be difficult to correctly identify or interact with. Supervised systems are known, in which a user navigates to a site and identifies relevant data, which the system then uses to direct data extraction, but these are time-consuming and not scalable. Automatic full-site extraction has so far only been successfully used in setting with limited structures, such as title and body extraction from news articles or search engine results. For extracting highly structured data, these approaches are unsuitable.