Creating a mixed-reality video (e.g., a video that includes imagery captured by a video camera as well as virtual imagery added “on top” of the imagery captured by the video camera) can be a labor-intensive task. For example, editing video generated by a red green blue (RGB) camera to include an animation requires that a user access individual frames of the video to include graphics at desired positions. Thus, if a user were to capture a video of himself walking and then wished to edit the video to depict an animated glove over the hands of the user while walking, the user must edit individual frames of the video to include the glove in the appropriate position.
Conventional approaches for creating mixed reality video that includes a moving human, designed to render the above-described process more efficient, require specialized sensors or markers that are applied to the human body, thereby allowing for motion capture. An exemplary specialized hardware sensor includes an RGB camera and a depth sensor synchronized with one another, wherein the sensor outputs data that can be analyzed to detect skeletal features of humans in the field of view of the sensor. These sensors, however, tend to be somewhat inflexible (as the sensors must be maintained in a fixed position) and expensive. Exemplary markers include infrared (IR) light emitters that are placed upon skin-tight suits that are to be worn by humans that are going to be captured in a video stream. This approach is often used for motion capture for purposes of big budget films and/or video games, but is not well-suited for widespread adoption, as it requires use of IR cameras and further requires humans to wear “unnatural” clothing.