Computer vision and machine learning networks are used to classify and identify objects in both digital images and videos. In object classification, a machine learning network is trained using a set of training data for classifying particular types of objects. In object identification, the machine learning network is used to recognize specific instances of one or more object types that may exist in an image. With contemporary technology, evaluating and inferring object types in real-time video data, however, is often graphical processing unit (GPU) and central processing unit (CPU) intensive. Due to the intensive processing nature of real-time object inferencing of video data, in prior art embodiments of systems to infer images from video, a significant lag occurs in receiving, processing and rendering video output depicting imagery of the original video data and graphical indications of detected objects. This processing inefficiency leads to significant frame jitter and display frame rates falling well below 50 frames per second. Certain applications, such as real-time video monitoring of medical procedures, require a high display frame rate output of the monitored procedure along with real-time inferencing and detection of objects in the video data.