Depictions of large phenomonena, such as the sides of a long city street or the side of a large cruise ship, are difficult to create with a camera. A single perspective photograph of buildings taken from the opposite side of a street, for example, will capture only a short portion of a street that may extend for many blocks. A photograph with a wider field-of-view will capture a slightly longer section of the street, but the appearance of the street will appear more and more distorted towards the edges of the image. One solution to capturing a large scene is to take a photograph from much farther away. However, this isn't always possible due to intervening objects (as in the case of a city block), and will produce an image that looks quite a bit different from the appearance of the street that a human would see when walking along it. Since a local observer sees streets from a closer viewpoint, perspective foreshortening is more pronounced; awnings get smaller as they extend from the observer's location, and crossing streets converge as they extend away from the observer. Images taken from a viewpoint far away lose these useful depth cues.
Previous work has introduced the concept of a photo montage, which, in general is a process of assembling a single photograph from a series of photographs. Typically, there was an assumption that all the pictures would be taken of the same thing from the same point of view, but that something was changing, for example, if it were a group photo, different people might be smiling or not be smiling at any particular moment. Hence, one could employ this process to pick out all the smiling people to assemble a good image. The underlying technology dealt with graph cuts, which established some kind of objective function of what you want (e.g., smiling people), and the system would examine the pixels from the input images and choose the best one. However, this system cannot handle a moving camera very well. In most cases, the input images must come from still camera, or at most, rotating about its optical axis.
Multi-perspective images have long been used by artists to portray large scenes. Perhaps the earliest examples can be found in ancient Chinese scroll paintings, which tell stories through space and time with multiple perspectives. Kubovy [1986] describes the multi-perspective effects that can be found in many Renaissance paintings, and explains that the primary motivation is to avoid perspective distortion in large scenes that can look odd, especially at the edges. For example, people in these paintings are often depicted from a viewpoint directly in front of them, even if the people are far from the center of the image. Otherwise they would be stretched in unsightly ways. More recently, artist Michael Koller has created multi-perspective images of San Francisco streets. The images consist of multiple regions of linear perspective photographs artfully seamed together to hide the transitions. Each building of the city block looks roughly like what you would see if you were actually standing directly in front of the building.
As detailed above, multi-perspective images are not new; they can be found in both ancient and modern art, as well as in computer graphics and vision research. However, one difficulty of creating a multi-perspective image that visualizes a large scene is not well-defined in general. For example, if the scene were completely planar, the scene could be modeled as a rectangle with a single texture map, and the ideal output would trivially be that texture map. However, in a world of varying depths, the problem of parallax arises. Parallax refers to the fact that as an observer moves, objects closer to the observer appear to move faster than objects farther away from the observer. Thus, images taken from shifted viewpoints of a world with varying depths do not line up in their overlapping regions.
Researchers have also explored many uses of multi-perspective imaging. For example, rendering of multi-perspective images from 3D models was explored by several researchers [Agrawala et al. 2000; Yu and McMillan 2004; Glassner 2000]. Multi-perspective images were also used as a data structure to facilitate the generation of traditional perspective views [Wood et al. 1997; Rademacher and Bishop 1998], however, research using captured imagery is less prevalent. Kasser and Egels [2002] researched much about photogrammetry, which is the science of deducing the physical dimension of objects from measurements on photographs, including objects in the 3-dimensional world depicted in the images as well as the position and orientation of the camera when the image was taken. In that work, aerial or satellite imagery are stitched together to create near-orthographic, top-down views of the earth. However, such work does not address the difficulty of images that depict 3-dimensional scenes, because 3-dimensional scenes introduce foreshortening and parallax dilemmas that need not be dealt with in orthographic images.
As well, continuously-varying viewpoint images can be created from video captured by a continuously moving camera by compositing strips from each frame; examples include pushbroom panoramas [Gupta and Hartley 1997; Seitz and Kim 2002], adaptive manifolds [Peleg et al. 2000], and x-slit images [Zomet et al. 2003]. Pushbroom panoramas can be used to visualize long scenes such as streets, however, such images will typically look quite different from what a human would perceive when viewing the scene. Pushbroom panoramas have orthographic perspective in the horizontal direction and regular perspective in the vertical. Thus, crossing streets will not converge to a point, and objects at varying depths will be stretched non-uniformly.
In later work, Roman et al. [2004] took inspiration from the deficiencies of pushbroom panoramas and the examples of artist Michael Koller to devise an interactive system for creating multi-perspective images of streets. They allow a user to create an image with multiple x-slit regions; thus, the final result consists of rectangles of single-viewpoint perspective, separated by rectangles with continuously interpolated perspective. This work was the first to demonstrate the limitations of pushbroom panoramas as well as improvements over them, but there are several limitations to their approach. For one, they require a dense sampling of all rays along a camera path; this requirement necessitated a complex capture setup involving a high-speed 300 frame-per-second camera mounted on a truck that drives slowly down the street. Their use of video also severely limits output resolution compared to still cameras, and generates an immense amount of data that must be stored (and possibly compressed, resulting in artifacts). Moreover, since the video camera is constantly moving, short exposure times are required to avoid motion blur, which make it much more difficult to avoid the noise of higher ISO settings while achieving bright enough images from natural light. The output of their system, unlike that of artist Michael Koller, contains regions of continuously shifting viewpoint, which can appear quite different and often worse; for example, these regions often exhibit inverted perspective, where objects farther away appear bigger rather than smaller.
Finally, other attempts have been made to render a multi-perspective image from a series of singe perspective images of a scene. Photogrammetry can produce a 3-dimensional model of the scene represented by the collection of input images, as well as the camera positions and orientations. The input images are then projected into this 3-dimensional world in order to produce the multi-perspective image output. However, this 3-dimensional world is a complex and irregular surface, and these irregularities lead to common distortions. Moreover, there is no satisfactory way to stitch the images together without leaving tell-tale seams in the final output.