Wide application of an emerging network parallel technology is accompanied with an emergence of a new computer network concept, that is, a coflow. The coflow is defined as a set of data streams that are in a semantic relationship or a correlation relationship. Because data streams in a coflow usually belong to a same task, the coflow has a consistent requirement for network service performance, that is, a completion time of a latest data stream is minimized or the data streams in the coflow need to be transmitted within a same time limit.
The emergence of the coflow concept brings both a tremendous opportunity and an overwhelming challenge to a network service. The opportunity is that an existing network scheduling algorithm usually uses a data stream as a unit, but the scheduling algorithm using a data stream as a unit does not fully use a semantic relationship between data streams. Consequently, scheduling performed based on a stream can optimize only a performance indicator of a stream layer, but cannot play an effective role in a cluster computing scenario. This is because in the cluster computing application scenario, the computing task can enter a next step only when data streams belonging to a same computing task all arrive at a destination terminal. Therefore, by means of the scheduling algorithm that uses a data stream as a unit and in which a semantic relationship between data streams is not considered, previous several data streams belonging to a same task may be extremely quickly transmitted, but an extremely long delay may occur in a transmission process of the last data stream. In this case, from the perspective of a terminal application, network service quality is extremely poor. If a synergistic relationship between data streams is considered and all data streams in a coflow are scheduled as a whole in the scheduling algorithm, it can be ensured that data streams belonging to a same task can be transmitted within a proper time interval, so as to ensure that computation of the terminal application can enter a next phase in time.
However, the challenge brought by the emergence of the coflow is that coflow information usually cannot be directly and explicitly obtained from a header of a data stream. This is because a terminal application that generates the data stream usually does not provide any explicit information in the header of the data stream for coflow identification.
Currently, in an existing technical solution, a correlation relationship between active data streams in a network is identified in a clustering manner. In this solution, a kernel of a terminal application that generates the data streams does not need to be modified, and the terminal application does not need to explicitly provide any information about a coflow or a task aspect to a network provider. On the contrary, the technical solution is based on the following principle: Data streams belonging to a same coflow are usually sent at extremely close time points. In the technical solution, the sending time point of the data stream is extracted as a feature, data streams are clustered by using a k-means algorithm, and then the data streams in the network are scheduled by using a scheduling algorithm and according to coflow information obtained by means of clustering, so that service performance of the network is improved. However, the network generates data streams extremely frequently, even within an extremely short unit time, for example, within one second, a small data center may generate tens of thousands of data streams, and these data streams may not belong to a same coflow. Therefore, if a sending time point of a data stream is used as a unique feature, clustering accuracy may be extremely low. This is because in this solution, data streams that are not in any semantic relationship may be classified into a same coflow just because sending time points of these data streams are extremely close.