Terminal devices such as a mobile phone, a wearable device, and a robot all need to recognize multiple types of objects, voices, and actions from perception data such as an image, a video, and a voice. For example, to perform searching by means of photo taking, a mobile phone needs to first recognize a target object in a taken photo, and then can search for information related to the target object. For another example, to perform a task of grabbing a target object, a robot needs to first obtain a location of the target object in an ambient environment using camera data.
A general method for equipping a terminal device with extensive recognition capabilities is to train, with massive known sample data, a perception model that can distinguish between various objects, voices, or actions. The terminal device may obtain, through computation based on the trained perception model, a corresponding recognition result for a new image, a video, or a voice in each input.
As types that need to be recognized increase, to improve recognition accuracy, a perception model used to recognize perception data becomes increasingly complex. For example, a perception model has a growing quantity of parameters. For example, currently, a convolutional neural network (CNN) model used for image recognition already has tens of millions of or even hundreds of millions of parameters. Currently, to improve user experience, a perception model in many applications needs to precisely recognize a large quantity of objects, actions, or voices in various given scenarios. This poses a great challenge to accuracy of the perception model. In some approaches, a perception model with fixed parameters is usually used to complete all recognition tasks, and therefore, complexity of the perception model infinitely increases with refinement of a recognition requirement, which poses a great challenge to storage and computation.