A PDF document is based on a PostScript language image model, and for any printer, the PDF may faithfully reproduce every character, color and image of the manuscript. Due to the inherent feature that the PDF is irrelevant to the operating system platform, PDF is the most widely used ideal document format for electronic document distribution and digital information dissemination.
Although the PDF document may accurately display the layout, the structural information in the PDF, in particular the table information, has not been effectively recorded and stored, resulting in difficulty in restoring the table information in the PDF. One of the currently used methods is to directly collect a cutting area in the table area from the current page, perform some filtering processing on the cutting area, remove duplicated and invalid cutting areas and convert the remaining cutting areas into corresponding cells according to 1:1. The disadvantage of this method is that the cutting areas may be incomplete, resulting in the absence of parsed cells; the cutting areas may have the case that the area is wrongly encircled, for example, one cutting area is sliced into two cutting areas or two cutting areas are synthesized into one cutting area, resulting in wrong parsed cells. Aiming at the disadvantages of the above method, another method is to obtain the cells in the table area via a line-based method for both word and non-word generated PDFs, that is, to collect all horizontal lines and vertical lines in the table area at first, obtain intersection points of all horizontal and vertical lines, record the coordinate information (including the x direction and the y direction) of the corresponding points, and determine four points of the cell according to the information of all coordinate points to obtain the final cell. However, due to possible errors in the drawn line, the obtained cell may be missing.