Autostereoscopic displays are well known and examples are disclosed in EP 0 602 934, EP 0 656 555, EP 0 708 351, EP 0 726 482 and GB 9619097.0. FIG. 1 of the accompanying drawings illustrates schematically the basic components of a typical autostereoscopic display. The display comprises a display system 1 and a tracking system 2. The tracking system 2 comprises a tracking sensor 3 which supplies a sensor signal to a tracking processor 4. The tracking processor derives from the sensor signal an observer position data signal which is supplied to a display control processor 5 of the display system 1. The processor 5 converts the position data signal into a window steering signal and supplies this to a steering mechanism 6 which cooperates with a display 7 such that an observer 8 can view the display autostereoscopically throughout an extended range of observer positions.
FIG. 2 of the accompanying drawings illustrates, purely by way of example, part of a display system 1 including the display 7 and the steering mechanism 6. The steering mechanism comprises a light source 10 which comprises a linear array of individually controllable light emitting elements. A beam splitter 11 such as a partially silvered mirror transmits light from the light source 10 to a mirror 12 and reflects light from the light source 10 to another mirror 13. Light reflected by the mirror 12 passes through a lens 14 and is modulated by a spatial light modulator (SLM) in the form of a liquid crystal device (LCD) 15 with the right eye image of a stereoscopic pair. Similarly, light reflected by the mirror 13 passes through a lens 16 and is spatially modulated by an LCD 17 with a left eye image of the stereoscopic pair. A beam combiner 18, for instance in the form of a partially silvered mirror, reflects light from the LCD 15 to a viewing window 19 for the right eye of the observer 8. Light from the LCD 17 is transmitted by the beam combiner 18 and forms a viewing window 20 for the left eye of the observer 8. The width of each of the viewing windows 19 and 20 is large enough to cover all possible eye separations and typical values of eye separation are from 55 to 70 millimeters. As shown in FIG. 3, the three dimensional (3D) space containing the display 7 and the observer 8 may be described in terms of Cartesian coordinates where X represents the lateral direction, Y represents the vertical direction and Z represents the longitudinal direction. As illustrated in FIG. 4a of the accompanying drawings, diamond-shaped regions 21 and 22 of illumination are formed such that, if the right eye of the observer remains within the region 21 and the left eye of the observer remains with the region 22, a 3D image is perceived across the whole of the display 7. These diamond-shaped regions are referred to as viewing zones and are widest at a "best viewing" window plane 23 which contains the viewing windows 19 and 20. The viewing zones 21 and 22 illustrate the theoretical longitudinal viewing freedom for the display 7.
In order to extend the viewing freedom of the observer, as described hereinbefore, observer tracking and control of the display may be provided. The positions of the viewing windows 19 and 20 are "steered" to follow movement of the head of the observer so that the eyes of the observer remain within the appropriate viewing zones. An essential part of such a display is the tracking system 2 which locates the position of the head and/or eyes of the observer. In effect, it is generally only necessary to track the centre point between the eyes of the observer because this is the position where the left and right viewing windows meet, as shown in the left part of FIG. 4b. Even for relatively large head rotations as shown in the right part of FIG. 4b, such a system accurately positions the viewing windows 19 and 20 so as to maintain autostereoscopic viewing.
Each viewing window has a useful viewing region which is characterised by an illumination profile in the plane 23 as illustrated in FIG. 5 of the accompanying drawings. The horizontal axis represents position in the plane 23 whereas the vertical axis represents illumination intensity. The ideal illumination profile would be rectangular with the adjacent window profiles exactly contiguous. However, in practice, this is not achieved.
As shown in FIG. 5, the width of the window is taken to be the width of the illumination profile at half the maximum average intensity. The profiles of the adjacent viewing windows are not exactly contiguous but have an underlap (as shown) or an overlap. There is variation in uniformity for the "top" of the profile, which represents the useful width. Outside the useful width, the intensity does not fall to zero abruptly but declines steeply to define an edge width. The profile does not reach zero intensity immediately but overlaps with the adjacent profile to give rise to cross talk. The differences between the ideal rectangular illumination profile and the actual profile result from a combination of degradation mechanisms including aberrations in optical systems, scatter, defocus, diffraction and geometrical errors in optical elements of the display. One of the objects of the tracking system 2 is to keep the eyes of the observer within the best viewing regions at all times. Ideally, the viewing windows 19 and 20 should be able to move continuously. However, in practice, the viewing windows may move in discrete steps between fixed positions. The steering mechanism 6 controls the movement or switching of the light source 10 so as to control the viewing window positions. The number of positions and the time required to switch between these positions vary with different displays and steering mechanisms. FIG. 6 of the accompanying drawings illustrates an example of the range of viewing window positions achievable for a display of the type shown in FIG. 2 where the light source 10 comprises fourteen contiguous illuminators and each viewing window has a width determined by four illuminators. This gives rise to eleven possible positions for the viewing window and a typical position is illustrated at 25. Each illuminator is imaged to a strip or "step" such as 26 in the window plane 23 having a width of 16 millimeters with four contiguous strips providing a window width of 64 millimeters. The tracking system 2 attempts to keep the pupil of the eye of the observer in the middle two strips as illustrated at 27. Before the viewer moves one step laterally away from the centre of the region 27, the tracking system 2 illuminates the next strip 26 in the direction of movement and extinguishes the opposite or trailing strip.
In order to match the position data obtained by the tracking system 2 to the display window positions, a calibration process is required, for instance as disclosed in EP 0 769 881. A typical display 7 provides viewing zones in the shape of cones or wedges, such as 28 as shown in FIG. 7 of the accompanying drawings, which emanate from a common origin point referred to as the optical centre 29 of the display. The viewing zones determine the positions at which switching must take place whenever the centre of the two eyes of the observer moves from one window position to another. In this case, the viewing zones are angularly spaced in the horizontal plane specified by the lateral direction (X) and the longitudinal direction (Z) of the observer with respect to the display.
An ideal tracking and display system would respond to any head movement instantaneously. In practice, any practical tracking and display system always requires a finite time, referred to as the system response time, to detect and respond to head movement. When there is only a finite number of steps for moving the viewing windows, an instant response may not be necessary. The performance requirements of the tracking system are then related to the distance an observer can move his eyes before the position of the viewing windows needs to be updated.
For the autostereoscopic display illustrated in FIG. 2 producing the window steps illustrated in FIG. 6, the observer can move by a distance d equivalent to one step before the system needs to respond and update the window position. The distance d and the maximum speed v of observer head movement determine the required system response time t of the tracking system such that EQU t=d/v
Normal head speed for an average observer is less than 300 millimeters per second but it is not unusual for the observer to move at higher speeds. This happens most often when the observer responds to sudden movement of objects in the displayed stereo image. A typical maximum head speed is about 500 millimeters per second. At this speed, with a typical value of d being 16 millimeters, the tracking and display systems have approximately 32 milliseconds in which to detect and respond to the observer head movements. If this response time is not achieved, the observer may see unpleasant visual artefacts such as flicker. FIG. 8 of the accompanying drawings illustrates at 30 the required system response time in milliseconds as a function of maximum head speed in millimeters per second.
In practice, the actual response time of the tracking and display system includes not only the time required for determining the position of the observer but also the communication time needed to pass this information to the steering mechanism and the time required to switch between the current window position and the next window position.
The required system response time is further reduced by the accuracy of the tracking system. The effect of measuring error is equivalent to a reduction in the step distance d that an observer can move before the viewing windows have moved so that the required system response time becomes EQU T=(d-e)/v
where e is the measuring error. The broken line 31 in FIG. 8 illustrates the response time where e is 5 millimeters. Thus, the required response time is reduced to 22 milliseconds for a maximum head speed of 500 millimeters per second.
It is desirable to reduce the measuring error e but this cannot in practice be reduced to zero and there is a limit to how small the error can be made because of a number of factors including image resolution and the algorithms used in the tracking. In general, it is difficult to determine the measuring error until the algorithm for measuring the position data is implemented. For this reason, the above equation may be rewritten as: EQU v=(d-e)/T
This gives the maximum head speed at which an observer can see a continuous 3D image for a given measuring error and a given response time. The smaller the measuring error and the shorter the response time, the faster an observer can move his head. The step size, the measuring error and the system response time should therefore be such as to provide a value of v which meets the desired criterion, for instance of 500 millimeters per second.
A known type of infrared tracking system based on detecting infrared radiation reflected from a retroreflective spot worn by an observer between his eyes is called the DynaSight sensor and is available from Origin Instruments. The 3D coordinates of the retroreflective spot with respect to an infrared sensor are obtained at a rate of up to 64 Hz. This provides the required information on the observer head position relative to the retroreflective spot so that the left and right images can be directed to the correct eyes as the head moves.
Observer head position detection based on the use of infrared video cameras is disclosed in WO96/18925 and U.S. Pat. No. 5,016,282. Other such systems are available from IScan and HHI. However, all infrared based systems suffer from some or all of the following disadvantages:
the need for an infrared video camera system; PA0 the use of a controlled infrared light source and the resulting component costs; PA0 the complex arrangement between the infrared source and its sensor; PA0 the inconvenience of attaching markers to the observer; PA0 the extra power supply required for an infrared source; and PA0 discomfort caused by shining infrared light towards the eyes of the observer at close range.
Several tracking systems are based on the use of visible light video cameras. For instance, U.S. Pat. No. 4,975,960 discloses a system which tracks the nostrils in order to locate the mouth for vision-augmented speech recognition. However, the precision of this technique is not sufficient for many applications and, in particular, for controlling an observer tracking autostereoscopic display.
Another technique is disclosed in the following papers:
T. S. Jebara and A. Pentland, "Parametrized Structure from Motion for 3D Adaptive Feedback Tracking of Faces", MIT Media Laboratories, Perceptual Computing Technical Report 401, submitted to CVPR November 1996; A. Azarbayejani et al "Real-Time 3D Tracking of the Human Body" MIT Laboratories Perceptual Computing Section Technical Report No. 374, Proc IMAGE'COM 1996, Bordeaux, France, May 1996; N. Oliver and A. Pentland "LAFTER: Lips and Face Real Time Tracker" MIT Media Laboratory Perceptual Computing Section Technical Report No. 396 submitted to Computer Vision and Pattern Recognition Conference, CVPR'96; and A. Pentland "Smart Rooms", Scientific American, Volume 274, No. 4, pages 68 to 76, April 1996. However, these techniques rely on the use of a number of sophisticated algorithms which are impractical for commercial implementation. Further, certain lighting control is necessary to ensure reliability.
Another video camera based technique is disclosed in A. Suwa et al "A video quality improvement technique for videophone and videoconference terminal", IEEE Workshop on Visual Signal Processing and Communications, 21-22 September, 1993, Melbourne, Australia. This technique provides a video compression enhancement system using a skin colour algorithm and approximately tracks head position for improved compression ratios in videophone applications. However, the tracking precision is not sufficient for many applications.
Most conventional video cameras have an analogue output which has to be converted to digital data for computer processing. Commercially available and commercially attractive video cameras use an interlaced raster scan technique such that each frame 32 consists of two interlaced fields 33 and 34 as illustrated in FIG. 9 of the accompanying drawings. Each field requires a fixed time for digitisation before it can be processed and this is illustrated in FIG. 10 of the accompanying drawings. Thus, the first field is digitised during a period 35 and computing can start at a time 36 such that the first field can be processed in a time period 37, during which the second field can be digitised. The time interval from the start of image digitisation to the moment at which the position data are obtained is referred to as the time latency as illustrated at 38. The update frequency relates to how frequently the position data are updated. If the computing time does not exceed the time for digitising one field as illustrated in FIG. 10, the update frequency is the same as the field digitisation rate.
As described hereinbefore, the required system response time includes not only the time latency 38 but also the communication time needed to pass the position information to the window steering mechanism and the time required to switch between the current window position and the next window position.
The field digitisation is performed in parallel with the computing process by using a "ring buffer". As illustrated diagrammatically in FIG. 11, a ring buffer 39 is a memory buffer containing two memory blocks 40 and 41, each of which acts a field buffer and is large enough to store one field of the digital image. Thus, while one of the buffers 40 and 41 is being used for digitising the current field, the other buffer makes available the previous field for processing.
The time required to capture a field of an image is 20 milliseconds for a
camera operating at 50 fields per second and 16.7 milliseconds for an NTSC camera operating at 60 fields per second. As described hereinbefore and illustrated in FIG. 8, for a typical autostereoscopic display, the tracking system 2, the display control processor 5 and the steering mechanism 6 shown in FIG. 1 have only about 22 milliseconds to detect and respond to head movement for a maximum head speed of 500 millimeters per second and a measuring error of 5 millimeters. If a PAL camera is used, the time left for processing the image and for covering other latencies due to communication and window steering is about 2 milliseconds. This time is increased to about 5.3 milliseconds if an NTSC camera is used. Thus, the available time limits the processing techniques which can be used if standard commercially attractive hardware is to be used. If the actual time taken exceeds this time limit, the observer may have to restrict his head movement speed in order to see a flicker-free stereo image.
Although the time required for digitising a video field may be reduced if a non-standard high speed camera is used, this is undesirable because of the substantially increased costs. Even if a high speed camera is used, there may be a limit to how fast it can be operated. It is very desirable to avoid the need for special light sources, whether visible or infrared, in order to achieve cost savings and improved ease of use. Thus, the tracking system 2 should be able to work with ordinary light sources whose intensities may oscillate at 100 or 120 Hz using the normal power supply i.e. twice the power supply frequency of 50 Hz, for instance in the UK, or 60 Hz, for instance in USA. If a camera is operating at a speed close to or above this frequency, images taken at different times may differ significantly in intensity. Overcoming this effect requires extra computing complexity which offsets advantages of using high speed cameras.
There is a practical limit to the computing power available in terms of cost for any potential commercial implementation. Thus, a low resolution camera is preferable so that the volume of image data is as small as possible. However, a video camera would have to cover a field of view at least as large as the viewing region of an autostereoscopic display, so that the head of the observer would occupy only a small portion of the image. The resolution of the interesting image regions such as the eyes would therefore be very low. Also, the use of field rate halves the resolution in the vertical direction.
There are many known techniques for locating the presence of an object or "target image" within an image. Many of these techniques are complicated and require excessive computing power and/or high resolution images in order to extract useful features. Such techniques are therefore impractical for many commercial applications.
A known image tracking technique is disclosed by R. Brunelli and T. Poggio "Face Recognition: Features Versus Templates", IEEE Trans on Pattern Analysis and Machine Intelligence, Volume 15 No. 10, October 1993. This technique is illustrated in FIG. 12 of the accompanying drawings. In a first step 45, a "template" which contains a copy of the target image to be located is captured, FIG. 13 illustrates an image to be searched at 46 and a template 47 containing the target image. After the template has been captured, it is used to interrogate all subsections of each image field in turn. Thus, at step 48, the latest digitised image is acquired and, at step 49, template matching is performed by finding the position at which there is a best correlation between the template and the "underlying" image area. In particular, a subsection of the image 46 having the same size and shape as the template is selected from the top left comer of the image and is correlated with the template 47. The correlation is stored and the process repeated by selecting another subsection one column of pixels to the right. This is repeated for the top row of the image and the process is then repeated by moving down one row of pixels. Thus, for an image having M by N pixels and a template having m by n pixels, there are (M-m+1) by (N-n+1) positions as illustrated in FIG. 14 of the accompanying drawings. The cross-correlation values for these positions form a two dimensional function of these positions and may be plotted as a surface as shown in FIG. 15 of the accompanying drawings. The peak of the surface indicates the best matched position.
A step 50 determines whether the peak or best correlation value is greater than a predetermined threshold. If so, it may be assumed that the target image has been found in the latest digitised image and this information may be used, for instance as suggested at 51, to control an observer tracking autostereoscopic display. When the next digitised image has been captured, the steps 49 to 51 are repeated, and so on.
Another template matching technique is disclosed in U.S. Pat. No. 3,828,122, which discloses a target tracking apparatus for an airborne missile having a video camera for providing a series of images. A user defines a target on the first image by moving a large rectangle (containing a small rectangle) over the image on a display. When the small rectangle is over the target, the image inside the small rectangle is stored in a target memory. When the next image is received, the position of the image inside the large rectangle is stored in a current frame memory and the apparatus determines whether the large rectangle is still centred on the target. In particular, the contents of the target memory are correlated with the contents of the current frame memory for all positions of the small rectangle within the large rectangle and the position giving the highest correlation is selected.
Although template matching is relatively easy for computer implementation, it is a computing-intensive operation. Direct template matching requires very powerful computer hardware which is impractical for commercial implementation. EP 0 579 319 discloses a method of tracking movement of a face in a series of images from a videophone. The centroid and motion vector for the face image are determined in a first frame and used to estimate the centroid in a subsequent frame. An image is "grown" around the estimated centroid by comparing areas around the centroids in the first and subsequent frames. The motion vector is estimated by comparing face positions in the first frame and in a preceding frame.
This technique suffers from several disadvantages. For instance, the shape and size of the image grown around the centroid may differ from frame to frame. The centroid does not therefore refer to the same place in the face image, such as the mid point between the eyes of an observer, and this results in substantial errors in tracking Further, after the initial position of the centroid has been determined in the first frame, subsequent estimations of the centroid position are based on determining the motion vector. Errors in determining the motion vector may therefore accumulate. Thus, although this technique may be suitable for videophone applications, it is not sufficiently accurate for use in observer tracking autostereoscopic displays.