The present disclosure relates to the field of digital image analysis, and more particularly to automatically extracting data from a digital image providing a graphical representation of quantitative data.
Graphical representations such as e.g. line graphs and scatter plots are commonly used in academic, scientific, financial, patent and other analytical documents in order to graphically represent quantitative data. As examples, the results of scientific experiments, mechanical processes or performance of businesses may be summarized using such graphical representations. Documents, printed as well as digital ones, generally provide the underlying quantitative data only in form of the respective graphical representation for illustrative purposes, but not in other data formats like tables etc. Extracting the underlying quantitative data directly from a graphical representation may thus provide access to knowledge otherwise inaccessible and be of high importance for enabling a quantitative analysis of the underlying data.
Such graphical data representations being available only as digital images, like e.g. bitmap images, may comprise a broad variety of elements including lines, markers and text which are used in order to represent and characterize quantitative data. Since those elements are rastered, specific image analysis techniques may be required in order to identify these elements and extract the informational content represented by them. In order to e.g. run statistical analysis, identify trends, forecast future behaviors, simulate models or compare own data, e.g. experimental results, with data published in a document, engineers, researchers, scientists, financial analysts and other users may need access to quantitative data provided by such graphical representations in a form that allows a computer to process them.
Semi-automatic methods for extracting quantitative data are known, but do not scale when data need to be extracted from large numbers of documents provided for example by large digital libraries.
A. Baucom and C. Echanique, “ScatterScanner: Data Extraction and Chart Restyling of Scatterplots,” in Conference on Human Factors in Computing Systems (CHI'13), 2013, pp. 1-8, describe a method for interactively redesign scatter plots in order to adjust their design. In view of the significant variability in the possible types of plots and their design, several assumptions are made in order to reduce the degree in complexity to be handled. The method is thus limited to clean scatter plots, without gridlines or text annotations, containing only one data series represented with simple shape markers and plotted in the first quadrant of the Cartesian plane.
S. R. Choudhury and C. L. Giles, “An Architecture for Information Extraction from Figures in Digital Libraries,” in WWW '15 Companion Proceedings of the 24th International Conference on World Wide Web, 2015, pp. 667-672, describe a method for information extraction from figures in digital libraries. The method comprises semi-automatic numerical data extraction from figures. The data extraction is not performed automatically, since the user needs to indicate the beginning and ending points of x- and y-axis by recording mouse clicks and axis scales. Then, curves plotted in different colors are extracted, but binary or grayscale curves pose greater challenges and require user's input.
Thus, known methods are unable to sufficiently handle a digital image comprising an arbitrary graphical representation of quantitative data, possibly comprising e.g. a grid, a legend or text annotations. Existing methods make very restrictive assumptions regarding the structure of the data representation such as the absence of a grid or any other element that is not representing quantitative data. A second limitation of known methods may lie in the fact that no real quantitative data in original data coordinates may be extracted, but only graphical pattern resembling the original graphical representation. Thus, there is a need for an efficient and flexible method for automatically extracting data from a digital image providing a graphical representation of quantitative data