Systems that implement computer vision and artificial intelligence are used to predict human actions. For instance, a robot or computer-implemented program interacts with humans based in part on visual inputs, such as images received from a camera. A computer system including a camera uses images to prompt interactions with a human user, such as shaking hands or providing menu options on a user interface. Certain systems use one or more input images (e.g., frames of a video) as visual inputs to predict a single subsequent frame. However, these techniques are limited to forecasting a single next frame with respect to a subject (e.g., a user of a system that implements computer vision systems). A single “predicted” frame is inadequate to accurately determine the intent of the subject. A computer system that inaccurately predicts its user's intent will cause frustration. In the case of a system capable of physical interactions, such as a robot, inaccurate predictions could endanger the user. Thus, it is beneficial to develop techniques to forecast the pose (e.g., the position and location) of the subject to a point in the future.