In a karaoke process, at a host end, an accompaniment is played while singing sounds of the host are captured. Audio recorded in a karaoke device includes the singing sounds of the host and the played accompaniment. The captured singing sounds and the accompaniment need to be combined by using the karaoke device, to obtain final singing audio. When the karaoke device is used to combine the singing sounds and the accompaniment, it needs to be ensured that the singing sounds keep pace with the accompaniment at each playback time point, otherwise a dual sound phenomenon may occur due to a delay of the singing sounds relative to the accompaniment, and consequently, for a listener, it sounds like that the host is not on the beat. To resolve the problem of the dual sound, delay prediction may be performed, and then during combination, delay compensation is performed on the singing sounds by using a predicted delay value, so that the singing sounds keep pace with the accompaniment at each playback time point.
Currently, delay compensation is performed on a singing sounds mainly by using a method based on time domain prediction, such as an energy method, an autocorrelation method, or a contour method. Although a delay can be reduced by using the method to some extent, anti-noise performance is relatively poor. Consequently, a predicted delay value is not accurate, causing an unsatisfactory delay compensation effect.