In the fields of communication and the Internet, it is needed to add punctuation for some documents short of punctuation in some application scenarios, for example, adding punctuation for speech documents.
On adding punctuation for speech documents, conventionally, there exists a kind of scheme; it is based on the mute interval when the speaker is speaking to automatically add punctuation.
Concretely, setting the threshold value of the length of mute first, if the length of mute interval when the speaker is speaking is bigger than the threshold value, adding punctuation at this place, if it is not bigger than the mentioned threshold value, not adding punctuation.
Simply relying on the interval threshold value when the speaker is speaking to add punctuation may excessively result in wrong punctuation adding, wrong pauses of sentences and so on, for example, if the speaking speed of the speaker is fast, there is no interval or the interval is so short that it is less than the threshold value, there is no punctuation added in the whole passage, if the speaking speed of the speaker is slow, approaching speaking out sentences with cruel intervals after each character, the whole passage will have a lot of punctuation, these two kinds of situations will result in wrong punctuation adding, low accuracy of punctuation adding.
Aiming at the question of low accuracy existing in the scheme of adding punctuation for speech documents based on the threshold value of the length of mute, there is a kind of improved scheme of punctuation adding based on hyphenation processing and the place of each character.
In the mentioned improved scheme, conducting hyphenation processing to the sentences in corpus first, after dividing the sentences to be processed into each character, determining the place of each character in the sentences, namely at the beginning, in the middle or at the end of sentences, and determining the situation of punctuation after each character, for example, whether there is punctuation or not and so on, establishing language model according to the place of each character in the corpus and the situation of punctuation after each character, using the established language model to add punctuation to the sentences to be processed.
In the mentioned improved scheme, it uses the place of single character in the sentences and whether there is punctuation after single character or not to establish language model, due to the information used is limited, and the information used and the status of punctuation are not closely associated, the established language model cannot extract out the real relationship between the information of sentences and the punctuation status of sentences.
Due to the language model used in the mentioned improved scheme does not extract out the real relationship between the information of sentences and the punctuation status of sentences, the accuracy of punctuation adding is low as well.