The operation of visualizing and collecting data into meaningful images (“Data visualization”) plays a crucial role in knowledge discovery and transfer, both in academic and industrial applications. The inverse problem, i.e. the understanding and conversion of a visual graphics into the data it is representing (“Chart extraction”), is an essential business and scientific intelligence process that only humans so far can undertake. In secondary market research, analysts surf the web and other document reports from public and private sources to identify, extract and aggregate information into easy to consume visual forms. Likewise, the ability to retrieve the real data included into charts and graphs is very useful in the field of cloud storage to help professional analysts within companies and enterprises to easily identify, extract and repurpose information from one document to another.
Despite several academic papers and patents published on the subject of chart classification or search, the methodologies taught make use of only the chart images, while little or none of the information within the graphic is extracted or interpreted automatically. This limits the ability to leverage the information entrapped in the charts for improving document search and classification in emerging digital data storage such as the cloud storage industry. The chart extraction process still lies in a prototypical stage. Prior algorithms for performing data extraction rely on several simplistic assumptions concerning the shape of the examined chart and are based on arbitrary thresholds, which restrict their applicability to a few “ideal” cases.
As one example, previous work on pie chart extraction is suitable only for elliptical or 3d charts with precise and defined contours. However, in real cases, it is very common to find “exploded” pie charts (i.e. with one or more slices detached from the others), donut charts (i.e. annular with a hole in the center), multi-series pie charts, and any possible combination of all the pie charts categories mentioned above.
Bar chart extraction has received more attention, but the algorithms have been developed by leveraging small sample sets that drastically underestimate the actual variety of charts found in information graphics. The field of applications of information graphics are varied and include marketing research reports, financial and econometrics reports as well as the world wide web in its entirety (e.g. web blogs, web publications, web social media (Stock twitter) and many others). The algorithms for bar chart data extraction are typically based on simple heuristics, and are still lacking machine intelligence to account for the multitude of real life charts encountered in digital documents and the web.
Some prior works exploit much bigger datasets and deep learning models such as Convolutional Neural Networks, but they only focus on the extraction of high level information, such as chart type, axes titles or value ranges. The main target of these works is to make high-level query answering systems, rather than to reach a detailed data extraction and interpretation of the visual graphic.
In known prior art, it is often assumed that the image is high quality or in vector format, so that the text is perfectly readable and there are no compression artifacts. Conversely, in real applications it is very common to deal with images with lossy compression (jpeg, gif) or small size images for which entrapped graphical information is difficult to extract.
As the algorithms reported in prior art are not meant for real industrial applications, such as batch processes or real time software as a service (SAAS) applications, little investigation has been done on how to achieve the best data capture precision in the smallest amount of time. Some prior art efforts report an average of 1225 seconds to extract a pie chart and 100 seconds for a bar chart. These performance are not acceptable for real industrial applications. The reported processing times are absolutely order of magnitude above what a real industrial application demands. Most of the preexisting methods are restricted to the realm of pure proof of concept demonstration.
Prior art has not paid enough attention to text extraction and text-data aggregation. This is because it is often assumed that the final result may be manually adjusted by user intervention, rather than assuming a real case application scenario where manual intervention by a human being will unavoidably drastically reduce the industrial benefit of the extraction process. A real application in the field of marketing research, financial research, econometrics research and more broadly information search and retrieval, requires a fully-automatic system. If a human being has to manually edit each extracted pie or bar chart, the actual process benefit in terms of time saved and man power saved becomes marginal and adoption of such technology would be low.
The invention is directed to overcoming one or more of the problems and solving one or more of the needs as set forth above. The invention provides fast and fully automated information graphic data capture with information graphics understanding system for real applications.