For semi-structured documents like checks, business cards, passport (of different countries), credit cards, it is common to have a variety of location of the fields from copy to copy. The known methods of identification for such fields in semi-structured documents are based on a <<greedy algorithm>>—all the fields are searched for in the text in a given order. If a fragment of the text is identified as a field, this fragment is not considered in subsequent search procedures. This approach imposes harsh obligations on the quality of work for the first field search procedures and degrades the quality of work of the subsequent field search procedures. The first field search procedure makes a decision about whether a text fragment is a searched filed of a semi-structured document or not without any information about the results of subsequent search procedures or about the document as a whole. As a result the fields are often identified incorrectly.
To solve this problem we propose a method described herein using a graph structure. The graph enables us to save the results of all search procedures and to implement an examination of different combinations of the fields during further analysis of the search results. Besides our method allows to organize the work of the field search procedures by cascade classification, which allows us to save computational resources and to calculate only the required number of features for display. Also our method uses a reduced alphabet technique for generating dictionaries of keywords, which decreases the number of mistakes in the fields identifying by the search procedure employing the dictionaries of keywords.