With the development of mobile terminal technologies, the intelligence of a mobile terminal is in great demand. When playing voice data, it is important to reflect the intelligence of the mobile terminal such that the mobile terminal can automatically select a play mode according to an operation of a user. For example, a conventional mobile terminal is provided with an infrared sensor that is set at two sides or in a groove of an earphone of the mobile terminal. When the voice data is played, the infrared sensor transmits an infrared signal to detect a distance between the user and the mobile terminal, specifically a distance between the user and a screen surface of the mobile terminal. If the detected distance is smaller than a preset threshold, the mobile terminal determines that the user puts the mobile terminal closer to the ear of the user, and switches into an earphone play mode thereby using the earphone to output the voice data. If the detected distance is not within the preset threshold, the mobile terminal enters into a speaker play mode, and plays the voice data through the speaker. However, the conventional mobile terminal controls the play mode only through the distance between the user and the mobile terminal and may cause many undesirable operations. For example, if the hand of the user approaches the screen surface of the mobile terminal by accident, or the finger of the user covers the screen surface of the mobile terminal by accident, an unnecessary switchover operation of the play mode is triggered, and thus the accuracy of controlling the play mode is affected, more system resources are wasted, and the intelligence of the mobile terminal is deteriorated.