Creators of immersive experiences often include audio as a component. To this end, to make the experience truly immersive, the audio should align with what is being depicted in any visual component. Incorrectly aligned audio and video can result in an unrealistic and disjointed environment. In a two-dimensional experience, stereo audio can provide sound using two channels, one channel for sounds occurring to the left and one channel for sounds occurring to the right relative to a location where, for example, a user is listening to the stereo audio. These two channels are capable of being played in each ear to indicate where in the experience sound is being generated. A three-dimensional experience (e.g., augmented reality and/or virtual reality), on the other hand, can use ambisonics, or spatial audio to indicate where sound is being generated. Ambisonics refers to a class of representations of spatial audio of different orders. Spatial audio of first order ambisonics generally utilizes of four channels of audio instead of two as in stereo audio: W, X, Y, and Z to provide sound in three dimensions. W is omnidirectional audio, meaning audio that is captured from every direction. X, Y, and Z are the channels of audio along the x axis, y axis, and z axis—in other words, left/right, up/down, and forward/backwards. It should be appreciated that other orders of spatial audio can use additional channels (e.g., second order ambisonics can use nine channels and third order ambisonics can use sixteen channels).
When generating a three-dimensional experience, the video component is often recorded separately from the audio component. For example, a camera capable of capturing a scene in three dimensions can be placed at a location to record a scene in multiple directions (e.g., 360 degrees or some subset of visualization in all directions) with the camera as a reference point. The camera can be oriented in a particular direction such that the camera has a perspective that some direction is left, up, forward, etc. so as a scene is captured, the video is oriented as such. An ambisonic microphone placed in a position to capture audio related to the scene can have its own orientation, separate from that of the camera, with its own notion of x, y, and z. In this way, recorded audio can have an orientation different from the orientation of the camera.
When listening to this audio in conjunction with viewing a related visual component, a user can wear a pair of headphones that track the user's orientation as the user's head turns. Knowing the orientation of the user's head allows spatial audio to be rendered to the user so different sounds that are encoded in the ambisonics recording will be adjusted (e.g., volume or frequency responses) such that the audio sounds like a stable audio scene in which the user is moving.
As such, aligning spatial audio captured by a microphone with captured video can provide a more immersive user experience. Accurately aligning spatial audio with video, however, can be difficult. Some conventional methods require loading video and ambisonics audio and, thereafter, using headphones while watching the video. To determine if the audio and video are correctly aligned, a user watches the video and listens to the sound to see if sounds seem to be in the right spot or not. However, relying on human perception of sound and direction often results in inaccurate alignment. Other conventional methods have attempted to create visual representations of spatial audio to assist with such alignment. Such methods can be used in the context of alternative reality, virtual reality, and/or mixed reality post-processing editing of recorded spatial audio. However, such methods often result in visual representations of sound that do not accurately indicate where sound is actually coming from. Additionally, such methods fail to use meaningful visual attributes based on properties of the spatial audio to create visual representations of sound(s) within the spatial audio.