Conventionally, various techniques are well known in order to improve efficiency of the transcription work. For example, there is well known a technique that each of plural character strings constituting voice text data, which is obtained by performing a voice recognition process on the voice data, and a position of each of the character strings in the voice data (playback position) are displayed on a screen so as to be associated with each other. In the technique, when a character string on the screen is selected, because the voice data is played back from the playback position corresponding to the selected character string, a user (transcription worker) selects the character string, and the user corrects the character string while listening to the voice data.
In the technique, it is necessary that each of the plural character strings constituting the voice text data and the playback position of the voice data are displayed on the screen while being associated with each other, which results in a problem in that a configuration of display control becomes complicated. During the transcription work, it is rare that the voice data including a filler or a grammatical error is directly transcribed and the voice data is generally corrected or refined. As is the case with the above technique, it is not necessarily efficient to correct the voice recognition result of the voice data because there is a large difference between the voice data and the text that a user transcribes. Accordingly, from the viewpoint of simplifying the configuration of a transcription method, transcribing an audio file without any restriction while listening to the voice data is preferable to correcting the voice recognition result. In this case, the user is forced to repeatedly temporarily stop and rewind while the transcribing. When the user resumes transcribing after the temporary stop, it is desirable that the playback is resumed from the exact position at which the transcription is completed.
However, it is difficult to specify the position at which the transcription is completed in the voice data.