The present invention is directed to computer technology. More particularly, the invention provides systems and methods for data processing. Merely by way of example, the invention has been applied to computer files. But it would be recognized that the invention has a much broader range of applicability.
For certain applications in the fields of communications and internet, punctuations are often added to some punctuation-less files. For example, punctuations can be added to certain voice files.
A conventional approach for adding punctuations to a voice file is based on character separation and character positions. Specifically, sentences in a corpus are divided into characters and the position of each character in the sentences is determined, e.g., at the beginning of a sentence, in the middle, or at the end of the sentence. Further, a punctuation state associated with each character is determined, e.g., whether there is a punctuation that follows the character. A language model is established based on the position of each character in the corpus and the punctuation state associated with the character. When adding punctuations to a voice file, the voice file is processed as a whole, and punctuations can be added based on the characters in the voice file and the language model.
But limited information is implemented to establish the language model based on character positions and punctuation states associated with the characters, and the implemented information is not closely related to the punctuation states. Thus, such a language model often may not extract the actual relationship between information associated with the sentences in the voice file and the punctuation states of the sentences. Furthermore, the internal structural characteristics of the voice file are not considered when punctuations are added to the voice file that is simply taken as a whole. In view of the above-noted factors, the conventional approach for punctuation addition often has a low accuracy.
Hence it is highly desirable to improve the techniques for punctuation addition.