There is a technology that obtains the position and the orientation of a camera with respect to a captured image based on the captured image of the camera attached to a personal computer (PC), a mobile terminal, or the like. Furthermore, there is an augmented reality (AR) technology that superimposes, by using the position and the orientation of a camera, additional information, such as computer graphics (CG), or the like, onto a captured image displayed on a screen of a PC, a mobile terminal, or the like and that implements a work support for a user.
FIG. 11 is a schematic diagram illustrating an example of the AR technology. As illustrated in FIG. 11, for example, if a user captures both an image of a marker 11 and a check target 12 by using a camera that is built into a mobile terminal 10, object information 13 with respect to the marker 11 is displayed on a screen 10a of the mobile terminal 10.
As a method of obtaining the position and the orientation of a camera, there is a conventional technology 1 that calculates the position and the orientation of the camera by using, for example, feature points included in a captured image. The conventional technology 1 detects the feature points based on the state in which the variation in shadings in the vicinity of the point of interest is great and the position of the point of interest on the image is uniquely specified due to the variation in shadings. The conventional technology 1 uses a set of three-dimensional coordinates of the previously created feature points. In a description below, the three-dimensional coordinates of the previously created feature points is appropriately referred to as map points and a set of the map points is appropriately referred to as a three-dimensional map. The conventional technology 1 calculates the position and the orientation of the camera by associating the feature points that are present in the captured image at the present time with the projected map points in the captured image.
FIG. 12 is a schematic diagram illustrating the conventional technology 1 that obtains the position and the orientation of a camera. In the example illustrated in FIG. 12, it is assumed that map points S1 to S6 are present. A certain map point S1 is represented by Equation (1) in the world coordinate system. It is assumed that feature points x1 to x6 are present in a captured image 20. A certain feature point xi is represented by Equation (2) in a camera coordinate system. It is assumed that the map points projected on the captured image 20 are projection points x1′ to x6′. A certain projection point xi′ is represented by Equation (3) in the camera coordinate system.Si=(x,y,z)  (1)xi=(u,v)  (2)xi′=(u′,v′)  (3)
For example, in the conventional technology 1, the position and the orientation of the camera are obtained by calculating a camera position/orientation matrix RT in which the sum of squares E calculated by Equation (4) becomes the minimum. The process of estimating the position and the orientation of the camera for each of the series of captured images is referred to as “tracking”.
                    E        =                              ∑            P                    ⁢                                                                                    x                  P                  ′                                -                                  x                  P                                                                    2                                              (        4        )            
Subsequently, the conventional technology 1 that creates a three-dimensional map will be described. FIG. 13 is a schematic diagram illustrating the conventional technology 1 that creates a three-dimensional map. For example, the conventional technology 1 uses a principle of stereo image capturing. The conventional technology 1 associates the same feature points in two captured images that are obtained from different image capturing positions. The conventional technology 1 creates a three-dimensional map in which the associated points are used as map points based on the positional relationship between the multiple associated points that are present in each of the captured images.
In the example illustrated in FIG. 13, it is assumed that the map point to be restored is represented by Si and the intersection point of the line connecting an initial image capturing position Ca of the camera to the map point Si and a first captured image 20a is represented by a feature point xai. It is assumed that the intersection point of the line connecting a second image capturing position Cb of the camera to the map point Si and a second captured image 20b is represented by a feature point xbi. Then, the associated points are a feature point xai and a feature point xbi. The conventional technology 1 calculates the three-dimensional coordinates of the map point Si based on the relationship between the feature points xai and xbi and the map point Si based on the principle of stereo image capturing.
In general, the position and the image capturing direction of the camera of the first captured image is used for the origin of the three-dimensional coordinates of the three-dimensional map. FIG. 14 is a schematic diagram illustrating an example of a definition of the image capturing direction of the camera. As illustrated in FIG. 14, the origin of the three-dimensional coordinates of the three-dimensional map is defined based on, for example, the position (Tx, Ty, Tz) and the orientation (Rx, Ry, Rz) of a camera 50.
There is a conventional technology 2 as a technology that similarly uses the feature points included in a captured image and that is similar to the conventional technology 1. FIG. 15 is a schematic diagram illustrating the conventional technology 2. The conventional technology 2 determines whether a previously prepared recognition purpose image is included in a captured image. As a recognition purpose image, an image, such as a photograph, an illustration, an icon, or the like, is used. In the recognition purpose image, information on the coordinate position of a feature point and information on a feature amount of the feature point are associated. The feature amount is a numerical value vector used to distinguish differences between the other feature points and indicates the density distribution of a plurality of pixels in the vicinity of the feature point.
The conventional technology 2 compares the feature amounts of the feature points in the captured image with the feature amounts of the feature points in each of the recognition purpose images and determines that the recognition purpose image in which the match rate of the feature amounts is the highest is included in the captured image. When determining the recognition purpose image included in the captured image, the conventional technology 2 calculates, similarly to the conventional technology 1, the position and the orientation of the camera by using each of the coordinate positions associated with the determined recognition purpose image as a three-dimensional map.
In the example illustrated in FIG. 15, it is assumed that recognition purpose images 1 to 5 are stored in a database. It is assumed that feature points 1a to 1d are included in a recognition purpose image 1 and assumed that the feature amounts of the respective feature points are 70, 110, 70, and 110. It is assumed that feature points 2a to 2d are included in a recognition purpose image 2 and assumed that the feature amounts of the respective feature points are 70, 70, 110, and 110. It is assumed that feature points 3a to 3e are included in a recognition purpose image 3 and assumed that the feature amounts of the respective feature points are 108, 108, 108, 108, and 108. It is assumed that feature points 4a to 4d are included in a recognition purpose image 4 and assumed that the feature amounts of the respective feature points are 90, 90, 90, and 90. It is assumed that feature points 5a to 5c are included in a recognition purpose image 5 and assumed that the feature amounts of the respective feature points are 60, 60, and 60.
The conventional technology 2 detects feature points 6a to 6d from a captured image 6 and sets the feature amounts of the respective feature points to 90, 90, 90, and 90. The conventional technology 2 compares the feature amounts of the feature points 6a to 6d in the captured image 6 with the feature amounts of the respective feature points in the recognition purpose images 1 to 5. The conventional technology 2 detects the recognition purpose image 4 that includes the feature amounts that is matched with the feature amounts of the feature points 6a to 6d. The conventional technology 2 determines that the recognition purpose image 4 is included in an area 7 in the captured image 6 and calculates the position and the orientation of the camera by using, as the map points, the coordinate positions associated with the feature points 4a to 4d in the recognition purpose image 4.
Patent Document 1: Japanese Laid-open Patent Publication No. 2013-141049
Patent Document 2: Japanese Laid-open Patent Publication No. 2014-164483
However, with the conventional technology described above, there is a problem in that it is not possible to continuously and stably perform a tracking by using a recognition purpose image.
In general, in a case of calculating the position and the orientation of a camera, in principle, the following relationship is present between the feature points and the accuracy. Namely, as map points are widely distributed in a captured image, the accuracy of calculating the position and the orientation of the camera becomes high. Furthermore, as the number of map points present in a captured image is increased, the accuracy of calculating the position and the orientation becomes high.
There may be a case in which the positional distribution of the detected feature points is biased depending on a recognition purpose image. FIGS. 16 and 17 are schematic diagrams each illustrating a problem of the conventional technology. In FIG. 16, in a recognition purpose image 30A, the distribution of the feature points is uniform; however, in a recognition purpose image 30B, the distribution of the feature points is biased. By using the recognition purpose images 30A and 30B, regarding the process of determining the recognition purpose image included in the captured image, the determination accuracy is not decreased regardless of whether the distribution is uniform. However, if the recognition purpose image 30B is included in the captured image and a tracking is attempted by using the recognition purpose image 30B, the map points are not widely distributed in the recognition image and thus the calculation accuracy of the position and the orientation of the camera is decreased.
In order to solve the problem described above, it is conceivable to alleviate a detection condition of the feature points and simply increase the feature points in the recognition purpose image. However, if the number of feature points is simply increased, a new problem, such as an increase in processing time at the time of tracking, or the like, occurs.
In the example illustrated in FIG. 17, the types of the recognition purpose images that are present in the captured images are the same; however, the areas in each of which the recognition purpose image is detected are different. If a recognition purpose image is present in a central area 35A of the captured image, the distribution of the map points is uniform and thus the calculation accuracy of the position and the orientation of the camera is not decreased. However, if the recognition purpose image is present in an area 35B that is the edge of the captured image, the map points are biased in terms of the entire captured image and thus the calculation accuracy of the position and the orientation of the camera is decreased.