Benchmarking and performance testing of computer software and hardware is commonly performed for a variety of reasons. For example, a developer of software, a hardware manufacturer, a network service provider, or other party may desire to compare the performance of different combinations of software and hardware under various conditions, such as with different hardware configurations, under different operating loads, with different network conditions, etc. When the party testing software is the software's developer, the software's code may be configured to output or store error codes or other specific data when specific problems are encountered during runtime. However, identifying specific problems that occur or that affect usability of a given system or specific software may be more difficult when such a problem must be assessed by a third party, particularly when the software or service being tested or otherwise subject to an experiment conveys problems in a visual manner that is designed to convey the problem to a human viewing a computer monitor or other display rather than in the form of an automated data output log or similar output. For example, in instances where a robot or other machine is configured to perform a series of automated actions on a keyboard, mouse, and/or touchscreen in order to perform experiments or tests of a computing system under various conditions, operating problems that are visually displayed on a screen by the computing system being tested may be difficult to identify in an automated manner (e.g., without manual human intervention that undercuts the efficiency of using a robot to conduct such an experiment).