Optical character recognition (OCR) technology can convert images of text into actual text such as machine-encoded text. Systems may use optical character recognition to identify content of webpages. For example, an image of a home page or landing page of a website may be converted to determine an entity providing content for presentation via the website as well as contact information of the entity. Due to the sheer volume of existing websites and multiple webpages that may be included as part of a given website, it is computationally expensive to process large samples of websites.
Furthermore, markup language of a webpage may include information that is different than other information presented graphically on a client device to a viewer. As an example, the markup language includes multiple phone numbers, but a rendered version of the webpage shows only one of the phone numbers. In addition, a malicious entity may attempt to deceive automated processing systems or optical character recognition systems by obfuscating certain information presented on webpages. The technical challenges of optical character recognition may be barriers to efficient processing of webpages and detection of potential obfuscation or fabrication of information by malicious entities.