CN116663633A

CN116663633A - Data processing method and related equipment

Info

Publication number: CN116663633A
Application number: CN202210138879.0A
Authority: CN
Inventors: 赵世雄; 陈旭升; 崔鹤鸣; 王森; 陈力; 张弓
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2023-08-29

Abstract

The application relates to the field of artificial intelligence, and discloses a data processing method, which is applied to a target computing node, wherein a computing node cluster to which the target computing node belongs can train a super-network in parallel, a training sample of the super-network comprises a first batch of latches and a second batch of latches, and in the process of training the super-network, a forward propagation process according to the first batch is configured to be performed after a feed-forward process according to the second batch, and the method comprises the following steps: the target computing node processes the first input data through the first sub-model if the updating of parameters of the second sub-model according to the second input data has been completed. The method and the device can improve the model precision of the super network obtained by final training and the certainty of the training result.

Description

Data processing method and related equipment

Technical Field

The application relates to the field of artificial intelligence, in particular to a data processing method and related equipment.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

In recent years, neural networks have rapidly evolved, and in some areas deep neural networks have been superior to humans. However, in practical application, due to different application scenarios, data sets, deployment equipment, index requirements, and the like, a great deal of time and effort are often required by experienced experts to construct the neural network meeting the application environment. In order to improve the efficiency of building a neural network, it has been proposed to design the neural network by using a neural network structure search (neural architecture search, NAS) to obtain a neural network satisfying an application environment.

The neural network structure search can automatically search the neural network meeting specific constraint conditions and achieving specific targets by applying specific data sets, namely, a user can complete the process of modeling by adopting the deep neural network without scene experience and knowledge skills of deep learning.

In order to reduce the significant amount of time spent by the NAS in the past, the latest NAS algorithm can evaluate any one DNN structure in the entire search space by combining all deep neural network (deep neural network, DNN) structures in the search space into one network covering all DNN layers, and this huge and intricate network is also called a super network (super net), and only one neural network (i.e., super net) needs to be trained.

The super-network training refers to the process of performing parameterized training on all candidate units in the super-network, and is essentially a (more complex) DNN, so that the training iteration mode is similar to that of a single DNN, and is to fit a certain data set, and the training process is formed by performing forward propagation and backward propagation iteration on a group of data (i.e., a data batch) of the data set. In the existing method for performing parallel training on the super-network, the accuracy of a model obtained by training is poor, so that a method for improving the accuracy of the super-network is needed.

Disclosure of Invention

The data processing method provided by the embodiment of the application can improve the model precision of the super network obtained by final training and the certainty of the training result.

In a first aspect, the present application provides a data processing method, where the method is applied to a target computing node, where the target computing node belongs to a computing node cluster, where the computing node cluster includes a plurality of computing nodes including the target computing node, where the plurality of computing nodes are used for training a super-network in parallel, and each of the computing nodes is used for training a part of a model of the super-network; the training samples of the super-network comprise a first batch of latches and a second batch of latches, and in the process of training the super-network, a forward propagation process according to the first batch is configured to be performed after a feed-forward process according to the second batch;

Here, the parallel is referred to as serial training, in which a computing node cluster needs to complete a forward propagation process and a backward propagation process of training samples of a current batch to perform training of a next batch. The dependency relationship does not exist in parallel training, and different computing nodes in the computing node cluster can simultaneously perform training processes of training samples of different batches at the same time.

In one possible implementation, the target computing node may be any one of a cluster of computing nodes, which may be an execution unit, such as a GPU, performing a forward propagation process or a backward propagation process of the subnet, and the specific decision of the target computing node in performing the subnet training may be controlled by other control units, such as a CPU.

The method comprises the following steps: acquiring first input data; the first input data is input of the target computing node in the forward propagation process of training the super network by the computing node cluster according to the first batch; processing the first input data through a first sub-model based on meeting a target condition to obtain first output data; the target conditions include: the target computing node finishes updating parameters of a second sub-model according to second input data, wherein the second input data is input by the target computing node in a back propagation process of training the super-network by the computing node cluster according to the second batch, the first sub-model and the second sub-model are results obtained by carrying out model searching on the same part of the super-network in a search space, the search space comprises a plurality of types of network layers, the first sub-model and the second sub-model both comprise the same type of target network layer, and the positions of the target network layer in the first sub-model and the second sub-model are the same.

In one possible implementation, a scheduling unit (e.g., CPU) of the target computing node may control a model training process (e.g., when to perform a forward propagation process or a backward propagation process of certain input data) of an execution unit (e.g., GPU) in the target computing node based on the causal dependencies described above. For example, the scheduling unit may determine, after determining that the target computing node has already or will receive output data from the previous computing node after performing the feedforward process, that there is no causal dependency between the structure of the subnet and the structure of the previous subnet in the feedforward process to be performed currently, if so, may determine whether the previous network layer having the causal dependency has already completed the parameter update, that is, whether the back propagation process has already been completed, and if so, may perform the current forward propagation process, that is, the target computing node may process the input data through its own responsible submodel based on the output data of the previous computing node, to obtain a processing result, where the processing result may be transferred to the next adjacent computing node or may directly perform the back propagation process when the target computing node is the last computing node.

By the mode, the scheduler can control the computing node to strictly meet the causal dependency relationship when performing the super-network training, and further the model precision of the super-network obtained by final training and the certainty of the training result are improved.

In existing implementations, storage management for the parameters of the super-network is not optimized, and the super-network has to be placed entirely in the memory of the training device (e.g., GPU), which makes the training device memory tense. When the super-network parameters occupy most of the memory of the training device, the residual space left for calculation in the training device becomes very limited, which directly leads to the reduction of batch size (batch size is bigger, the buffer space required for calculation is bigger), and indirectly leads to the low effective utilization rate of the training device (the bigger batch size is higher, the multi-core parallelism of the training device is higher, and the overall effective utilization rate of the training device is higher), and finally leads to the extremely low training efficiency of the prior art scheme in super-network training.

In order to solve the above problem, the embodiment of the application can store the parameters of the super network in the CPU memory through the accurate prediction of the scheduling, predict that the sub network i is to be scheduled before the sub network i is to be executed, and pre-fetch the parameters of the sub network i into the memory of the training equipment in advance.

In one possible implementation, the first sub-model in the first memory may be stored to the second memory before the processing of the first input data by the first sub-model; the first memory is not in the target computing node, and the second memory is in the target computing node; and after the parameter of the first sub-model is updated according to the first input data, releasing the updated first sub-model from the second memory to the first memory.

In one possible implementation, the target computing node is a graphics processor GPU, a tensor processor TPU, or a neural network processor NPU, and the first memory is a memory in a central processor CPU.

In one possible implementation, the target network layer in the first sub-model is obtained after the target computing node completes updating parameters of the target network layer in the second sub-model according to the second input data. That is, the first sub-model may be obtained after updating the second sub-model.

In one possible implementation, the target condition further includes: the target computing node does not have a back propagation task to be executed for a sub-model of the super-network, the sub-model being a result of model searching the same part of model from a search space. That is, the embodiment of the present application is not limited to performing the forward propagation process of the first sub-model immediately as long as the parameter updating of the second sub-model is completed, and in some scenarios, other factors need to be considered, for example, in order to ensure the maximization of parallel execution, that is, the maximization of resource utilization, if the target computing node has a task of the backward propagation process to be performed (for example, the gradient information transmitted by the computing node that has received the connection is already or predicted), the backward propagation process may be preferentially performed.

In one possible implementation, the target computing node may obtain third input data (e.g., may be a gradient); the third input data is input of the target computing node in the reverse propagation process of training the super-network by the computing node cluster according to the first batch; and updating parameters of the first sub-model according to the third input data. That is, the priority of the back-propagation process task is higher than the priority of the forward-propagation task, which is done in order to perform as preferentially as possible the back-propagation ("read and write" operation) of the subnet waiting on the current segment, helping to remove the causal dependency of the subsequent subnet on this subnet, so that the subsequent scheduling space increases.

In one possible implementation, the method further comprises:

after the first input data is acquired and before the first input data is processed through a first sub-model, based on that the target computing node does not complete updating of parameters of a second sub-model according to second input data, the first input data is stored into a waiting queue;

said processing said first input data by a first sub-model comprising:

And acquiring the first input data from the waiting queue, and processing the first input data through a first submodel.

In one possible implementation, the first batch and the second batch are at least one of image data, text data, audio data, and video data.

In a second aspect, the present application provides a data processing apparatus, the apparatus being applied to a target computing node, wherein the target computing node belongs to a computing node cluster, the computing node cluster including a plurality of computing nodes of the target computing node, the plurality of computing nodes being used for training a super-network in parallel, each of the computing nodes being used for training a part of a model of the super-network; the training samples of the super-network comprise a first batch of latches and a second batch of latches, and in the process of training the super-network, a forward propagation process according to the first batch is configured to be performed after a feed-forward process according to the second batch; the device comprises:

the acquisition module is used for acquiring first input data; the first input data is input of the target computing node in the forward propagation process of training the super network by the computing node cluster according to the first batch;

The data processing module is used for processing the first input data through a first sub-model based on the condition that the target condition is met so as to obtain first output data; the target conditions include:

the target computing node finishes updating parameters of a second sub-model according to second input data, wherein the second input data is input by the target computing node in a back propagation process of training the super-network by the computing node cluster according to the second batch, the first sub-model and the second sub-model are results obtained by carrying out model searching on the same part of the super-network in a search space, the search space comprises a plurality of types of network layers, the first sub-model and the second sub-model both comprise the same type of target network layer, and the positions of the target network layer in the first sub-model and the second sub-model are the same.

In one possible implementation, the apparatus further includes:

the data storage module is used for storing the first sub-model in the first memory to the second memory before the first input data is processed through the first sub-model; the first memory is not in the target computing node, and the second memory is in the target computing node;

And after the parameter updating of the first sub-model according to the first input data is completed, releasing the updated first sub-model from the second memory to the first memory.

In one possible implementation, the target network layer in the first sub-model is obtained after the target computing node completes updating parameters of the target network layer in the second sub-model according to the second input data.

In one possible implementation, the target condition further includes:

the target computing node does not have a back propagation task to be executed for a sub-model of the super-network, the sub-model being a result of model searching the same part of model from a search space.

In one possible implementation, the acquiring module is further configured to:

acquiring third input data; the third input data is input of the target computing node in the reverse propagation process of training the super-network by the computing node cluster according to the first batch;

The apparatus further comprises:

and the model updating module is used for updating parameters of the first sub model according to the third input data.

In one possible implementation, the model updating module is specifically configured to:

and under the condition that at least one forward propagation task to be executed aiming at the sub-model of the super network exists in the target computing node, updating the first sub-model according to the third input data before executing the forward propagation task to be executed.

In one possible implementation, the data storage module is further configured to:

said processing said first input data by a first sub-model comprising:

In addition, the embodiment of the application also provides a data processing method, which is applied to a target computing node, wherein the target computing node belongs to a computing node cluster, the computing node cluster comprises a plurality of computing nodes including the target computing node, the plurality of computing nodes are used for training a super-network in parallel, and each computing node is used for training a part of a model of the super-network; the training samples of the super-network comprise a first batch of latches and a second batch of latches, and in the process of training the super-network, a forward propagation process according to the first batch is configured to be performed after a feed-forward process according to the second batch; the target computing node comprises an execution device and a scheduling device, and the method comprises the following steps:

the scheduling device triggers the execution device to process the first input data through a first sub-model based on determining that a target condition is met, so as to obtain first output data; the first input data is input by the execution device in the forward propagation process of training the super network by the computing node cluster according to the first batch; the target conditions include:

The execution device completes parameter updating of a second sub-model according to second input data, the second input data is input by the execution device in a back propagation process of training the super-network by the computing node cluster according to the second batch, the first sub-model and the second sub-model are results obtained by carrying out model searching on the same part of the super-network in a search space, the search space comprises a plurality of types of network layers, the first sub-model and the second sub-model both comprise the same type of target network layer, and the positions of the target network layer in the first sub-model and the second sub-model are the same.

In one possible implementation, the scheduling device is a CPU and the execution device is a graphics processor GPU, a tensor processor TPU or a neural network processor NPU.

In one possible implementation, the method further comprises:

before the execution device processes the first input data through a first sub-model, the scheduling device triggers the execution device to store the first sub-model in a first memory into a second memory; the first memory is not the execution device, and the second memory is the execution device;

After the execution device has completed updating the parameters of the first sub-model according to the first input data, the scheduling device triggers the execution device to release the updated first sub-model from the second memory to the first memory.

In one possible implementation, the first memory is a memory in the scheduling device.

In one possible implementation, the first sub-model is obtained after the execution device completes updating parameters of the second sub-model according to the second input data.

In one possible implementation, the target condition further includes:

the executing equipment does not have a back propagation task to be executed aiming at a sub-model of the super-network, wherein the sub-model is a result obtained by carrying out model search on the same part of model in a search space.

In one possible implementation, the method further comprises:

the scheduling device triggers the execution device to acquire third input data; the third input data is input of the target computing node in the reverse propagation process of training the super-network by the computing node cluster according to the first batch; and the execution equipment updates parameters of the first sub-model according to the third input data.

In one possible implementation, the updating the first sub-model according to the third input data includes:

and under the condition that the execution device has at least one forward propagation task to be executed for the sub-model of the super-network, before executing the forward propagation task to be executed, the scheduling device triggers the execution device to update the first sub-model according to the third input data.

In one possible implementation, the method further comprises:

said processing said first input data by a first sub-model comprising:

In a third aspect, an embodiment of the present application provides a data processing apparatus, which may include a memory, a processor, and a bus system, where the memory is configured to store a program, and the processor is configured to execute the program in the memory, so as to perform the method according to the first aspect and any optional method thereof.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the above-described first aspect and any of its optional methods.

In a fifth aspect, embodiments of the present application provide a computer program which, when run on a computer, causes the computer to perform the above first aspect and any of its alternative methods.

In a sixth aspect, the present application provides a chip system comprising a processor for supporting a data processing apparatus for performing the functions involved in the above aspects, e.g. for transmitting or processing data involved in the above method; or, information. In one possible design, the chip system further includes a memory for holding program instructions and data necessary for the execution device or the training device. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Drawings

FIG. 1 is a schematic diagram of a structure of an artificial intelligence main body frame;

FIG. 2 is a schematic illustration of a neural network search system;

FIG. 3 is a schematic illustration of a neural network search system;

FIG. 4 is a schematic illustration of a neural network search system;

FIG. 5 is a schematic illustration of a neural network search system;

FIG. 6 is a schematic diagram of a NAS architecture;

FIG. 7 is a schematic diagram of an embodiment of a super network;

FIG. 8 is a schematic diagram of parallel training of a super-network;

FIG. 9 is a schematic diagram of an application architecture of a data processing method in an embodiment of the present application;

FIG. 10 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 12 is a flow chart of a data processing method in an embodiment of the application;

FIG. 13 is a scheduling illustration of a prior art one-subnet parallel training;

FIG. 14 is a scheduling illustration of a super-network parallel training;

FIG. 15 is a schematic illustration of a system provided by an embodiment of the present application;

FIG. 16 is a flow chart of an actuator according to an embodiment of the present application;

FIG. 17 is a flowchart illustrating a scheduler implementation process according to an embodiment of the present application;

FIG. 18 is a flowchart illustrating a predictor implementation procedure according to an embodiment of the present application;

FIG. 19 is a basic architecture of a system embodiment of the present application;

FIG. 20 is a diagram illustrating the call relationship between modules according to an embodiment of the present application;

FIG. 21 is a definition of a super-net and forward propagation execution design;

FIG. 22 is a system according to an embodiment of the application;

FIG. 23 is a diagram showing the reproducibility of the calculation results of the system according to the embodiment of the present application;

FIG. 24 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 25 is a schematic structural view of a training apparatus according to an embodiment of the present application;

fig. 26 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. The terminology used in the description of the embodiments of the application herein is for the purpose of describing particular embodiments of the application only and is not intended to be limiting of the application.

Embodiments of the present application are described below with reference to the accompanying drawings. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely illustrative of the manner in which embodiments of the application have been described in connection with the description of the objects having the same attributes. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, a schematic structural diagram of an artificial intelligence main body framework is shown in fig. 1, and the artificial intelligence main body framework is described below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where the "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process. The "IT value chain" reflects the value that artificial intelligence brings to the information technology industry from the underlying infrastructure of personal intelligence, information (provisioning and processing technology implementation), to the industrial ecological process of the system.

(1) Infrastructure of

The infrastructure provides computing capability support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the base platform. Communicating with the outside through the sensor; the computing power is provided by a smart chip (CPU, NPU, GPU, ASIC, FPGA and other hardware acceleration chips); the basic platform comprises a distributed computing framework, a network and other relevant platform guarantees and supports, and can comprise cloud storage, computing, interconnection and interworking networks and the like. For example, the sensor and external communication obtains data that is provided to a smart chip in a distributed computing system provided by the base platform for computation.

(2) Data

The data of the upper layer of the infrastructure is used to represent the data source in the field of artificial intelligence. The data relate to graphics, images, voice and text, and also relate to the internet of things data of the traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capability

After the data has been processed, some general-purpose capabilities can be formed based on the result of the data processing, such as algorithms or a general-purpose system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

(5) Intelligent product and industry application

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, smart city etc.

The application can be applied to the natural language processing field of the artificial intelligence field, particularly to the fields of neural network compression in the natural language processing field, neural network reasoning in the natural language processing field and the like, and a plurality of application scenes falling to products are introduced.

In order to better understand the scheme of the embodiment of the present application, a possible application scenario of the embodiment of the present application will be briefly described with reference to fig. 2 to 8.

Scene 1: neural network search

Referring to fig. 2, the present application may be applied to a service related to a neural network search, and in particular may be a neural network architecture search service provided by a cloud side server, where a user may transmit information related to a model search to a cloud side neural network search system (for example, a cloud server) through a user device, where the information related to the model search may be a performance requirement of the user on a searched model, etc., and further the cloud side server may obtain a search result through a certain neural network search algorithm (for example, a data processing method in an embodiment of the present application) based on the performance requirement uploaded by the user, and send the search result to the user device.

Fig. 3 illustrates a neural network search system 100. The system may obtain training data 102 for training a neural network, validation data 104 for evaluating the performance of the neural network, and performance requirements 103, and determine search results 160 (e.g., target neural networks in embodiments of the application) using the training data 102 and the validation data 104 and the performance requirements 103, the search results 160 configured to satisfy the performance requirements 103, i.e., receive input and generate output that meets the performance requirements 103. The search results 160 may be architectural information of the neural network that may define the number of layers of the neural network, the operations performed by each layer, and the connections between layers in the neural network, i.e., which layers receive input from other layers in the neural network.

The system 100 may receive training data 102, validation data 104, and performance requirements 103 in any of a variety of ways. For example, the system 100 may receive training data as an upload from a remote user of the system over a data communication network, using, for example, an application programming interface (application programming interface, API) available to the system 100, and performance requirements 103, and randomly divide the uploaded data into training data 102 and validation data 104. As another example, the system 100 may receive input from a user specifying which data the system 100 has maintained should be used to train the neural network, and then divide the specified data into training data 102 and verification data 104.

In general, the system 100 may determine the search results 160 by searching the space of candidate architectures to identify one or more best performing architectures. For example, as shown in FIG. 3, the system 100 may construct a plurality of candidate neural network architectures (e.g., candidate neural networks in embodiments of the present application) by searching a space of candidate architectures and by the candidate selection engine 130, and model training the candidate neural network architectures by the training engine 140, and the like, and the quality assessment engine 150 may evaluate the training results to determine the search results 160.

Fig. 4 shows a neural network search system including a user device and a neural network search device. The user equipment comprises intelligent terminals such as a mobile phone, a personal computer or an information processing center. The user equipment is an initiating terminal of the neural network search, and typically, a user initiates a neural network search request through the user equipment.

The neural network search device may be a device or a server having a neural network search function, such as a cloud server, a network server, an application server, and a management server. The neural network searching device receives the neural network searching from the intelligent terminal through the interactive interface, performs machine learning, deep learning, searching, reasoning, decision-making and other modes of the neural network searching through the memory for storing data and the processor link, and feeds back the searching result to the user device. The memory in the neural network search device may be a generic term including a database that stores historical data locally, either on the data processing device or on other network servers.

In the neural network search system shown in fig. 4, the user device may receive an instruction of the user, for example, the user device may receive a model performance requirement for the neural network search input by the user, and then initiate a request to the neural network search device.

In fig. 4, the neural network search device may perform the data processing method of the embodiment of the present application.

Fig. 5 shows another neural network search system, in fig. 5, a user device directly serves as a neural network search device, and the user device can directly receive a model performance requirement for the neural network search from a user input and directly perform the neural network search by hardware of the user device, and a specific process is similar to that of fig. 4, and reference is made to the above description and will not be repeated here.

In fig. 5, the user equipment itself may perform the data processing method according to the embodiment of the present application.

Because the embodiments of the present application relate to a large number of applications of neural networks, for convenience of understanding, related terms and related concepts of the neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit with xs (i.e., input data) and intercept 1 as inputs, and the output of the arithmetic unit may be:

Where s=1, 2, … … n, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to a next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by joining together a plurality of the above-described single neural units, i.e., the output of one neural unit may be the input of another neural unit. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) Deep neural network

Deep neural networks (Deep Neural Network, DNN), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, many of which are not particularly metrics. From DNNs, which are divided by the location of the different layers, the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression: Wherein (1)>Is an input vector, +.>Is the output vector, +.>Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector +.>The output vector is obtained by such simple operation>Since DNN has a large number of layers, the coefficient W and the offset vector +.>And thus a large number. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in DNN of one three layers, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as +.>The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficients from the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as +.>It should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

(3) Neural network structure search (neural architecture search, NAS) based on super network

As the deployment scenario of DNNs is more abundant, how to design a DNN network structure with high performance (i.e., high inference accuracy) and high efficiency (low inference latency or low power consumption) for a specific scenario becomes a major difficulty: DNNs often consist of a large number of layers, provided there are N layers, each layer having C structural choices, i.e., there are N C combinations available for selection; relying on human expert to select or design different DNNs is inefficient and impossible. In short, the objective of neural Network Architecture Search (NAS) is to have a set of algorithms or a framework that automatically finds the best DNN architecture according to the requirements, and the search target may be to search according to DNN performance (performance) or according to hardware resource constraints (hardware constraints).

As shown in fig. 6, the NAS can be divided into three major parts, namely, search space (search space), search policy (search strategy), and performance evaluation policy (performance estimation strategy). The search space, i.e. all choices that can be adjusted when selecting the DNN structure (i.e. the choice of DNN layer type). The search strategy, i.e. in what way the best DNN structure is to be searched out in a given search space. For example, the search strategy is a random search, or a genetic algorithm, etc. The performance evaluation strategy is how to evaluate whether a DNN structure is good or bad when it is selected from the search space. For example, each DNN structure may be actually trained to obtain its actual training accuracy over the test set.

To reduce the significant amount of time spent by the NAS in the past (e.g., the time to actually train each neural architecture), the latest NAS algorithm has also been called a super network (supernet) by combining all DNN structures in the search space into one network covering all DNN layers. Fig. 7 shows an example of building a super network (super network), where if there are three candidate units (candidates) in total in each DNN layer, there are three different candidate units simultaneously in each layer in the super network (super network). The benefit of this construction is that any one of the DNN structures in the search space (e.g., the rightmost part of fig. 7) can be estimated through different candidate cells in the super network (super net) training. Therefore, only one neural network (i.e., a super-network) needs to be trained, whereby any one DNN structure in the entire search space can be evaluated.

(4) Super-network training

The super-network training refers to the process of performing parameterized training on all candidate units in the super-network, and is essentially a (more complex) DNN, so that the training iteration mode is similar to that of a single DNN, and is to fit a certain data set, and the training process is formed by performing forward propagation and backward propagation iteration on a group of data (i.e., a data batch) of the data set. The difference between the super-network training and the single DNN training mode is that the super-network training only selects one sub-network (i.e., one DNN sub-structure) from the super-network for each group of data (each data batch) and updates the corresponding parameters, and in this way, not only the discrete DNN structure in the Search Space can be simulated, but also the GPU memory requirement can be greatly reduced. In the actual training process, the super-network training algorithm generates a global sequence containing a plurality of sub-network selections, and the sub-network selections in the global sequence are sequentially activated in the training process.

For a super-network training algorithm, a random seed and a global sequence generation algorithm are given, and the generated global sequences are consistent. Meanwhile, because single iteration of the super-network training is consistent with the training of a single DNN, the non-interference isolation among the subnetworks is realized. Global sequence and isolation of the super-network training ensure two core elements of the super-network training. First, the super-network training is reproducible: for a super-network training, the training process and training results can be reproduced on different hardware. Second, the super-network training is high-precision: the isolated subnetwork training ensures the convergence of the subnetwork on the fitting data set, and eliminates the interference among subnetworks.

(5) Pipeline model parallel training

Taking a Graphic Processor (GPU) as an example of a training device, in deep learning neural network training, the amount of data is too large to load due to the fact that the memory of training hardware (GPU) is not put under a single GPU environment because of the complexity of a model, so that parallelization training based on multiple GPUs, namely parallel training, is required.

Pipeline model parallelism (pipeline model parallelism) training handles situations where single GPU memory is insufficient when the DNN model is too large (e.g., super-net training). Pipeline-based model parallelism (Pipeline model parallelism) partitions a single model into N model segments (stages), deployed on N Graphics Processors (GPUs), each GPU responsible for training a segment of the model. For the processing of a single data batch, each GPU is processed in the order of model segments. Pipeline-based model parallelism, while injecting multiple data batches into the sliced model. For example, as shown in FIG. 8, a model is divided into 4 segments and placed on 4 GPUs, segment 1 horses continue to process data batch N+1 when segment 1 (at GPU 1) has processed data batch N and sent the output to segment 2 (at GPU 2); when the same segment 2 (located in the GPU 2) finishes processing the data batch N and sends the output to the segment 3 (GPU 3), the segment 2 immediately continues to process the data batch n+1; the other segments are similar. Thus, by injecting multiple data batches, all GPUs are pipelined in parallel to the processing of the data batches.

(6) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

In knowledge distillation, a loss needs to be constructed based on the output of a teacher model and the output of a student model, wherein the model output for constructing the loss may be the output of an output layer of the model, or may be the output of an intermediate feature map of an intermediate network layer, or may be a result obtained by processing the output of the output layer and/or the output of the intermediate feature map of the intermediate network layer.

(7) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the parameter in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal super-resolution model, such as a weight matrix.

The application architecture in the embodiments of the present application is described below:

referring to fig. 9, an application architecture of a data processing method in an embodiment of the present application may be in the form of a computing node cluster, where the computing node cluster may include, but is not limited to, a plurality of computing nodes connected in series, and may perform, in cooperation with the plurality of computing nodes connected in series, training of a super network. The computing nodes can be different devices on the cloud side or the end side, or can be the same device or different chips of different devices. The computing nodes may communicate with each other. The multi-computing nodes at the end side of the training stage can cooperatively deploy one super-network block to each device, and cooperatively train to obtain the trained super-network. The trained subnetworks can be used to determine performance ratings for each subnetwork, which can be used to select the best (or meet search requirements) subnetwork from the subnetworks as a model search result.

Next, a more detailed architecture of an execution body (e.g., any one of the computing nodes in the cluster of computing nodes, e.g., the target computing node) that executes the data processing method in an embodiment of the present application will be described.

The system architecture provided by the embodiment of the present application is described in detail below with reference to fig. 10. Fig. 10 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 10, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data acquisition system 560.

The execution device 510 includes a computing module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514. The calculation module 511 may include a target model/rule 501 therein, with the preprocessing module 513 and preprocessing module 514 being optional.

The data collection device 560 is used to collect training samples (e.g., a first batch sample, a second batch sample, a third batch sample, etc., in embodiments of the present application). The training samples may be image data, text data, audio data, etc., and in the embodiment of the present application, the training samples are data used when training the super network. After the training samples are collected, the data collection device 560 stores the training samples in the database 530.

It should be appreciated that a search space may also be maintained in database 530.

The training device 520 may train the super-network based on maintaining training samples in the database 530 to obtain a trained super-network.

It should be noted that, in practical applications, the training samples maintained in the database 530 are not necessarily all acquired by the data acquisition device 560, but may be received from other devices. It should be noted that the training device 520 is not necessarily completely based on the training samples maintained by the database 530 for performing the super-network training, and it is also possible to obtain the training samples from the cloud or other places for performing the super-network training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 501 trained according to the training device 520 may be used to determine performance evaluations of a plurality of candidate models (i.e., sub-networks of the super-network), and select a sub-network from the super-network that is optimal in performance or meets performance requirements as a search result (target model/rule 501) based on the performance evaluations, and deliver the search result to the client device 540.

The target model/rule 501 obtained by training according to the training device 520 may be applied to different systems or devices, such as the executing device 510 shown in fig. 10, where the executing device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a vehicle-mounted terminal, or may be a server or cloud.

In fig. 10, an execution device 510 configures an input/output (I/O) interface 512 for data interaction with an external device, and a user may input data (e.g., data to be processed in an embodiment of the present application) to the I/O interface 512 through a client device 540.

The preprocessing module 513 and the preprocessing module 514 are used for preprocessing according to the input data received by the I/O interface 512. It should be appreciated that there may be no pre-processing module 513 and pre-processing module 514 or only one pre-processing module. When the preprocessing module 513 and the preprocessing module 514 are not present, the calculation module 511 may be directly employed to process the input data.

In preprocessing input data by the execution device 510, or in performing processing related to computation or the like by the computation module 511 of the execution device 510, the execution device 510 may call data, codes or the like in the data storage system 550 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 550.

Finally, the I/O interface 512 presents the processing results to the client device 540 for presentation to the user.

In the case shown in FIG. 10, the user may manually give input data, which may be manipulated through an interface provided by the I/O interface 512. In another case, the client device 540 may automatically send the input data to the I/O interface 512, and if the client device 540 is required to automatically send the input data requiring authorization from the user, the user may set the corresponding permissions in the client device 540. The user may view the results output by the execution device 510 at the client device 540, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 540 may also be used as a data collection terminal to collect input data from the input I/O interface 512 and output data from the output I/O interface 512 as new sample data, and store the new sample data in the database 530. Of course, instead of being collected by the client device 540, the I/O interface 512 may directly store the input data of the I/O interface 512 and the output result of the I/O interface 512 as new sample data into the database 530.

It should be noted that fig. 10 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 10, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510. It should be appreciated that the execution device 510 described above may be deployed in a client device 540.

From the reasoning side of the model: the computing module 511 of the execution device 520 may include hardware circuitry (e.g., an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA), a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuitry, for example, the training device 520 may be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above hardware systems without an instruction execution function and a hardware system with an instruction execution function.

It should be understood that the computing module 511 of the execution device 520 may be a combination of a hardware system having no function of executing instructions and a hardware system having a function of executing instructions, which is not limited herein.

From the training side of the model:

in the embodiment of the present application, the training device 520 may obtain the code stored in the memory (not shown in fig. 10, and may be integrated into the training device 520 or disposed separately from the training device 520) to implement the data processing method in the embodiment of the present application.

In an embodiment of the present application, the training device 520 may include a hardware circuit (such as an application specific integrated circuit (application specific integrated circuit, ASIC), a field-programmable gate array (field-programmable gate array, FPGA), a general purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor, or a microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system having an instruction execution function, such as a CPU, DSP, etc., or a hardware system not having an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above hardware systems not having an instruction execution function and a hardware system having an instruction execution function.

Specifically, the training device 520 may be a hardware system with an instruction execution function, and the data processing method provided in the embodiment of the present application may be a software code stored in a memory, and the training device 520 may obtain the software code from the memory and execute the obtained software code to implement the data processing method provided in the embodiment of the present application.

It should be understood that the training device 520 may be a combination of a hardware system without an instruction execution function and a hardware system with an instruction execution function, and that some steps of the data processing method provided in the embodiment of the present application may also be implemented by a hardware system without an instruction execution function in the training device 520, which is not limited herein.

It should be appreciated that the number of training devices described above may be multiple (each as a computing node).

As described above, when the computer cluster performs the training of the super network, the computing nodes need to perform information interaction. As shown in fig. 11, the application architecture of the embodiment of the present application may be software that runs on the user application layer and CPU side in a computer system. Typically, for N training devices on a computer, a management process is started for each training device, running on the CPU side. Within the computer, CPU inter-processes may employ single computer inter-process communication (dashed double arrow in FIG. 11); between computers, processes may use a computer network card for inter-process communication (solid double arrow in FIG. 11).

Next, a data processing method provided by an embodiment of the present application is described, with reference to fig. 12, fig. 12 is a flowchart of a data processing method provided by an embodiment of the present application, and as shown in fig. 12, a data processing method provided by an embodiment of the present application may include:

1201. acquiring first input data; the first input data is input of the target computing node in a forward propagation process of training the super network by the computing node cluster according to the first batch.

The execution body of step 1201 may be a target computing node, where the target computing node belongs to a computing node cluster, and the computing node cluster includes a plurality of computing nodes of the target computing node, where the plurality of computing nodes are used to train a super-network in parallel, and each of the computing nodes is used to train a part of a model of the super-network.

In one possible implementation, the target computing node may also be an apparatus including a control unit and an execution unit, which is not limited by the embodiment of the present application.

In one possible implementation, during the training process of the distributed super-network, each part of the super-network is deployed on different computing nodes, and the forward propagation process and the backward propagation process of the super-network are realized through data flow between the computing nodes. In the training process of the super-network, different computing nodes are responsible for sub-models of different parts in the super-network.

In the searching process of the super-network structure, sampling can be carried out from the searching space aiming at each partial network of the super-network to construct different network structures, and correspondingly, the sub-model responsible for different computing nodes is also constructed by sampling from the searching space.

In one possible implementation, the search space may include multiple candidate operators, which may be a unitary operator (unary operation) that refers to performing operations on only one data, such as negative number operations (neg), open root number operations (sqrt), transpose operations (transfer), softmax operations, logsigmoid operations, softsign operations, etc., or a binary operator that refers to performing operations on two data to a rule that results in a third data, such as add operations (add), dot multiply operations (matmul), cosine similarity operations, and euclidean distance operations.

In one possible implementation, the search space may contain, but is not limited to, operation types including, but not limited to, convolution, pooling, residual connection, etc., which may include, for example, the following operation types:

1x3 and 3x1 con-figuration, 1x7 and 7x1 con-figuration, 3x3 dilatedconvolution, 3x3 average pooling, 3x3 max boosting, 5x5max boosting, 7x7 max boosting, 1x1 con-figuration, 3x3 con-figuration, 3x3separable conv, 5x5 seperable conv, 7x7 separable conv, jump-connect operation, zero operation (Zero, all neurons in the corresponding location Zero), etc.;

Therein, exemplary, 3x3 average pooling represents an average pooling with a pooling kernel size of 3×3; 3x3 max pooling represents maximum pooling with a pooling core size of 3x 3; 3x3 dilatedconvolution the hole convolution with a convolution kernel size of 3×3 and a hole rate of 2; 3x3 separable conv shows a separate convolution with a convolution kernel size of 3×3; 5x5seperable conv shows a separate convolution with a convolution kernel size of 5x 5.

For example, a super-network may include 5 network layers (network layer 1, network layer 2, network layer 3, network layer 4, network layer 5), each of which may have 5 choices of structures (structure 1, structure 2, structure 3, structure 4, structure 5) in the candidate space, that is, the specific structure of network layer 1 may be selected from the 5 structures, and the other network layers are similar.

Each network layer of the super-network may sequentially include structure 4, structure 2, structure 1, structure 4, structure 5 during one sampling (e.g., a sampling), each network layer of the super-network may sequentially include structure 3, structure 1, structure 3, structure 5, structure 3 during another sampling (e.g., B sampling), and if the computing node a is responsible for training of network layer 1 and network layer 2, the sub-models that the computing node a is responsible for performing the super-network training are structure 4 and structure 2, respectively, and structure 3 and structure 1. The training sample adopted in the sub-network training corresponding to the A sampling can be a training sample A, and the training sample adopted in the sub-network training corresponding to the B sampling can be a training sample B. That is, the computing node a may perform the forward propagation process and the backward propagation process of the structure 4 and the structure 2 on the data transmitted to itself when the computing node cluster is trained based on the training sample a, and the computing node a may perform the forward propagation process and the backward propagation process of the structure 3 and the structure 1 on the data transmitted to itself when the computing node cluster is trained based on the training sample B.

In one possible implementation, a global order may be set for each training sample, which constrains the order in which different batches of training samples begin to feed forward (i.e., the order used in the training process), since different subnets that sample the super-net may employ corresponding batches of training samples, i.e., the global order for the training samples also corresponds to the training order of the subnets.

In the serial training process, the global sequence constrains the sequence of the training samples input to the computing node cluster, and on the premise of meeting the global sequence, the computing node cluster only carries out the training process of the training samples of the next batch according to the global sequence after completing the complete training process of the training samples of the current batch. It can be understood that in the serial training process, it can be ensured that the training process of the training samples of the batch with the later global sequence is performed only after the training samples of the batch with the earlier global sequence have completed the parameter updating of the subnet.

However, in the parallel training process, it cannot be guaranteed that the training process of the training samples of the batch with the later global sequence is performed after the training samples of the batch with the earlier global sequence have completed the parameter updating of the subnet. This may lead to a decrease in the accuracy of the network and uncertainty of the training results, and it is discussed next why the above problems occur:

In some scenarios, it is possible for the same training device to sample the same network structure in different sub-networks for the same located network layer (or other network element) of the super-network. For example, training device a samples structure a for the tenth network layer of the super-network starting from the input layer in a sample of a sub-network, while samples of another sub-network also samples structure a for the tenth network layer of the super-network starting from the input layer, the global order of the former is earlier than the latter.

For example, at one sampling (e.g., a sampling) of the super-network, each network layer (network layer 1, network layer 2, network layer 3, network layer 4, network layer 5) of the super-network may sequentially include structure 4, structure 1, structure 4, structure 5, and at another sampling (e.g., B sampling), each network layer of the super-network may sequentially include structure 3, structure 1, structure 3, structure 5, structure 3, and if computing node a is responsible for training of network layer 1 and network layer 2, the sub-models that computing node a is responsible for performing the super-network training are structure 4 and structure 1, respectively, and structure 3 and structure 1. That is, there is the same case of the network structure of the same part in the sub-network obtained by sampling the super-network by different times (network layer 2 is structure 1).

The training sample adopted in the sub-network training corresponding to the A sampling can be a training sample A, and the training sample adopted in the sub-network training corresponding to the B sampling can be a training sample B. That is, the computing node a may perform the forward propagation process and the backward propagation process of the structure 4 and the structure 1 on the data transmitted to itself when the computing node cluster is trained based on the training sample a, and the computing node a may perform the forward propagation process and the backward propagation process of the structure 3 and the structure 1 on the data transmitted to itself when the computing node cluster is trained based on the training sample B.

In an ideal situation, in order to obtain better network accuracy, it is necessary to ensure that the feedforward process of the global sequence in the subsequent subnet training process is performed after the training device has completed the previous parameter update for the network layer of the global sequence. It should be appreciated that the above constraints may also be described as causal dependency constraints.

For example, if the global sequence of training sample a precedes training sample B in the above example, then in order to obtain better network accuracy, it is necessary to ensure that the forward propagation process based on training sample B is performed after training device a has completed updating parameters of the network layer based on training sample a.

In the serial super-network training process, the last forward propagation process is necessarily performed after the last backward propagation process is completed, that is, the parameters are updated and then are performed, so that the problem is not existed, but in the parallel training process, if the causal dependency constraint is violated, the network parameters of the model used in the forward propagation process are the latest parameters which are not subjected to the last or previous updating iterations in the process of the last training iteration, for example, the condition that the model parameters of the same network layer are consistent in the two different forward propagation processes occurs, and therefore, the accuracy of the model obtained by the final training is reduced.

If the establishment of the causal dependency constraint is always ensured in the training process of the super-network, on one hand, better performance or faster convergence of the trained model can be ensured, and on the other hand, the certainty of model training can be improved, wherein the certainty is that parameters of the super-network finally obtained by training in different computing node clusters are consistent or nearly consistent for the same super-network.

In the existing implementation, the causal dependency constraint is not considered, and only how to maximize parallelization and how to maximize resource utilization is often considered, which reduces the accuracy and certainty of model training.

Referring to fig. 13, fig. 13 is a scheduling illustration of an existing super-network parallel training, where a causal dependency exists in the representation of the existence of an arrow connection relationship, for example, for GPU0, there is a causal dependency between batch2 and batch1, that is, when training is performed based on batch2 and batch1, respectively, where there is a case that the same location of the two sub-models sampled has a consistent network layer type, and the global order of batch1 is before batch2, that is, GPU0 needs to perform a forward propagation process based on batch2 after completing the update of the model parameters based on batch 1. However, in the schedule shown in FIG. 13, GPU0 does not perform the batch 1-based back-propagation process until the batch 2-based forward-propagation process is completed.

The data processing method provided by the embodiment of the application can ensure that the causal dependency relationship can be met in the process of performing the super-network training.

Taking the example that the training samples comprise a first batch and a second batch, the sub-model (subjected to two samples) responsible for the target computing node comprises a first sub-model and a second sub-model, the first batch is used in the training process of the first sub-model, and the second batch is used in the training process of the second sub-model. During training of the super-network, the forward propagation process according to the first batch is configured to follow the feed-forward process according to the second batch, i.e. the global order of the first batch follows the second batch. In order to ensure the satisfaction of causal dependency relationship when the target computing node performs model training, the target computing node needs to perform forward propagation based on the first batch after the backward propagation based on the second batch has been completed.

Taking the target computing node as an example, the target computing node may receive the first input data, where when the target computing node is the first computing node, that is, the network layer that the target computing node is responsible for includes an input layer of the super-network, the first input data may be the first batch itself. When the target computing node is not the first computing node, that is, there is a previous neighboring computing node, the first input data may be output data of the forward propagation process performed on the previous neighboring computing node, that is, output data obtained by processing the input data of the neighboring computing node.

In one possible implementation, the target computing node may be responsible for training of some portion of the network layers in the super-network. For example, it may be responsible for training of network layer 1, network layer 2, network layer 3, network layer 4, network layer 5 in the super network. The first sub-model corresponds to the first batch, and the second sub-model corresponds to the second batch, i.e., during the training of the first sub-model, both the forward propagation process and the backward propagation process are performed with the first batch as the input to the computing node cluster, and for the second sub-model, both the forward propagation process and the backward propagation process are performed with the second batch as the input to the computing node cluster.

In the training process based on the samples of the first batch, the first batch may be input into the computing node cluster, and then a forward propagation process and a backward propagation process are sequentially performed according to the connection sequence between the computing nodes. Taking the target computing node as an example, the first input data may be input data of the target computing node in a training process of performing the super-network based on the first batch. In the existing implementation, after receiving the first input data, the target computing node does not directly perform the forward propagation process by considering whether a causal dependency relationship exists between a sub-model in the forward propagation process and a previous sub-model in the global sequence.

1202. Processing the first input data through a first sub-model based on meeting a target condition to obtain first output data; the target conditions include: the target computing node finishes updating parameters of a second sub-model according to second input data, wherein the second input data is input by the target computing node in a back propagation process of training the super-network by the computing node cluster according to the second batch, the first sub-model and the second sub-model are results obtained by carrying out model searching on the same part of the super-network in a search space, the search space comprises a plurality of types of network layers, the first sub-model and the second sub-model both comprise the same type of target network layer, and the positions of the target network layer in the first sub-model and the second sub-model are the same.

In the embodiment of the application, the pipeline parallel scheduling workflow based on causal relation dependence (namely, read-write dependence among subnets) is as follows: multiple subnets can be supported for parallel training, and dependencies among the subnets are deterministically scheduled and maintained. As shown in fig. 14, by the causal dependency-based scheduling, the system can execute in parallel in a pipeline, and can maintain the dependencies between the subnets on the respective segments (in fig. 14, arrows represent the dependencies to be followed). For example, for GPU0, there is a causal dependency between batch2 and batch1, that is, when training is performed based on batch2 and batch1, there is a case that the network layer types of the same positions of the two sub-models sampled are consistent, and the global order of batch1 is before batch2, that is, GPU0 needs to perform the forward propagation process based on batch2 after completing the model parameter update based on batch 1.

In one possible implementation, when the target computing node receives the first input data and does not complete the parameter update of the second sub-model according to the second input data, the first input data may be stored in a waiting queue, and after the parameter update of the second sub-model has been completed, the first input data may be obtained from the waiting queue and processed through the first sub-model.

An application example of the embodiment of the present application is described below in conjunction with a specific product implementation procedure:

in one possible implementation, as shown in fig. 15, the system in the embodiment of the present application may receive an input as a front end (front) module, where the user defines a search space and a super-network construction manner of the super-network through the front end, and simultaneously gives a sub-network sequence for super-network training, where each sub-network is defined by a list composed of unit choices of each layer (e.g., sub-network 7, i.e., SN7 in fig. 15, has a series of selection sequence numbers, and each sequence number represents a candidate unit selected by that layer). Pipeline segmentation of the subnet is equally divided according to the number of training devices (e.g., GPUs). For example, there are 4 training devices (GPUs) in the figure, and assuming 24 layers for subnet 7, there are 6 layers per pipeline segment (P0-P3, or Stage 0-Stage 3). Based on the element selection list of the sub-networks, causal dependencies are naturally formed between the sub-networks on the individual segments. For example, assuming that subnet 2 and subnet 1 have shared candidate units on segment 0 and that the global order of subnet 2 is greater than subnet 1 (the sequence number of the subnet is its global order), subnet 2 (SN 2) depends on subnet 1 (SN 1) on segment 0.

In one possible implementation, each training device (e.g., GPU) corresponds to a pipeline segment (P0-3, or Stage 0-3), each of which starts an instance of the system of the present invention (a CPU process), and the right-hand block diagram shown in Figure7 is an instance of the system of the present invention on segment 0 (Stage 0). The system comprises an input/output queue and four core modules: a Scheduler (Scheduler), a Predictor (Predictor), an Executor (Context Executor), and a storage Manager (Context Manager).

The scheduler may accept the subnet input from the front end (segment 0 from the front end) or the preamble (e.g., when the segment number is greater than 0, segment i+1 from the segment i, and so on), add the input subnet to the wait Queue (Queue List). For the subnets in the waiting queue, on the premise of ensuring the causal dependency relationship among the subnets, more forward/backward propagation calculation tasks of the subnets are executed in parallel as much as possible (namely, the execution is carried out as long as the subnet segments meeting the dependency relationship among the subnets exist), and the executed subnet segments are added into the completion queue.

In one possible implementation, the predictor may predict the next subnet segment forward/backward propagation computation to be performed each time the current subnet segment forward/backward propagation computation is performed by segmenting the current wait queue, completion queue, and execution information of the neighboring segment, and communicate (and invoke) the prediction result to the storage manager.

In one possible implementation, the executor may execute the control logic of the data processing method in the embodiment of the present application, and trigger some of the functions of the scheduler, predictor, and storage management module. And according to the scheduling result of the scheduler, performing forward propagation or backward propagation of the current sub-network segment, and transmitting the calculation result and the sub-network sequence to the next segment.

In one possible implementation, the storage manager may prefetch subnet segmentation parameters from the CPU storage to the storage of the training device according to the prediction result of the predictor, while transferring parameters of the subnet segment for which execution by the executor is completed from the storage of the training device to the CPU storage.

Fig. 16 shows the execution flow of the actuator in the above embodiment.

Wherein, as shown in fig. 16, the actuator may perform the following steps: (1) Acquiring the subnet counter-propagation of the next current segment from the counter-propagation network queue, if yes, entering a step 2, and if no, entering a step 3; (2) If any subnet counter-propagation is in the segmented network queue, then the vertical horse performs the subnet counter-propagation; the purpose of this is to perform as preferential as possible the back propagation of the subnet waiting on the current segment ("read and write" operation), helping to remove the causal dependency of the subsequent subnet on this subnet, so that the subsequent scheduling space increases; (3) Checking the forward propagation network queue, if there is forward propagation of the subnet, moving the subnet into the waiting queue and invoking the scheduler, if not, entering the next iteration.

It should be appreciated that before forward propagation and backward propagation into the subnet, the executor will invoke the predictor to schedule the look-ahead; after performing the forward propagation and backward propagation of the subnet, the executor invokes the storage manager to move the performed subnet parameters out of the storage of the training device.

Fig. 17 illustrates the flow of the scheduler, which essentially checks for all subnets in the waiting list: for a subnet in the queue (queue sequence number qidx, global sequence number qval), its layer selection on the segment K is compared with the layer selection on the segment K for all subnets whose global sequence is smaller than the subnet and which are not complete (not in the completion queue), if there is no coincidence, the forward propagation of the subnet (qidx, qval) can enter execution.

Fig. 18 shows the flow of the predictor, which essentially falls into two cases, for scheduling prediction (lines 4-9) before the start of the back propagation, the subnet number corresponding to the back propagation is added to the completion list, the scheduler is re-invoked, and the next forward propagation execution is predicted. For forward propagation, the execution information (10-11 rows) returned by the adjacent segment (next segment) is checked first, if the segment 3 performs the backward propagation of the subnet 3, the information of the subnet 3 is sent to the segment 2 before the segment 3 performs the backward propagation, the segment 2 can know that the backward propagation result of the subnet 3 reaches the segment 2 immediately, the next backward propagation of the segment 2 can be predicted to be the subnet 3, and the like, if no backward propagation (16-18 rows) is to be performed, the forward propagation to be performed is added into a waiting queue, the scheduling period is readjusted, and the next forward propagation execution is predicted.

In one possible implementation, embodiments of the present application may be implemented based on PyTorch, for example, may be implemented entirely using the Python programming language. Fig. 19 illustrates the basic architecture of an embodiment of the present system. It may be deployed in a multi-GPU cluster, where each GPU corresponds to a pyrerch process (i.e., a runtime instance of an embodiment of the present application) that are connected using a Gloo network communication library. Note that GPU a and GPU B may be within the same computer machine, where the Gloo communicates through GPU a- > PCIe- > CPU memory- > PCIe- > GPU B, and GPU a and GPU B may be within different computer machines, where the Gloo communicates through GPU a- > PCIe- > CPU memory- > a network communication device on the computer machine- > CPU memory- > PCIe- > GPU B. An operating instance may comprise four main modules: scheduler, predictor, executor, and storage manager.

FIG. 20 illustrates a call relationship between modules in an embodiment of the present application, where the main process is an Executor (Executor), and the internal execution logic may refer to FIG. 16, where the Executor may call a scheduler function and obtain a scheduling result returned by the scheduler function. The executor can also call the storage manager to asynchronously migrate the parameters of the subnet candidate unit after the execution (from GPU storage to CPU storage). The executor can call the predictor function before each subnet is executed, the predictor function can generate a predicted result, and the predicted result is directly called the storage manager to asynchronously migrate the predicted subnet candidate unit parameters. The storage manager is an independent thread and asynchronously manages parameters of the super network.

FIG. 21 illustrates the definition and forward propagation execution design of a super-net, which consists of a list of elements each representing a DNN layer, which consists of a list of candidate unit operators. In the forward propagation, according to the input subnet selection list, the forward propagation is executed according to a given selection path, the forward propagation functions are executed by using a PyTorch module, the PyTorch module automatically generates a backward propagation execution diagram according to the forward propagation record, and the entry of the execution diagram is the forward propagation error tensor (x in the diagram), so that the executor only needs to call the backward propagation diagram entry (x.backward ()) when executing the propagation.

Fig. 22 shows a comparison of normalized throughput for the system (darkest bar in the figure) and baseline system over search spaces of different sizes (c 0> c1> c2> c 3) in an embodiment of the present application, overall, the system of the present application has significant throughput improvement over the baseline system over large search spaces (c 0 and c 1) for two reasons: firstly, the system of the application minimizes the super-network parameters in the storage of the training equipment and maximizes and releases the storage space which can be used by the training calculation through accurate scheduling prediction and storage management, so that the system of the application can support larger batch size in the super-network training, achieve higher utilization rate of the whole training equipment and further achieve higher throughput. Secondly, the maximum parallelism is realized on the premise that the parallel scheduling workflow of the application ensures causal relation dependence among the subnets.

Fig. 23 shows reproducibility of the calculation results of the system (CSP) in the embodiment of the present application, where the super-network error and the final search accuracy of the training of the present application are completely consistent with each other under different numbers of training devices (GPUs), and the synchronization patterns BSP (VPipe and GPipe) and ASP (Pipedream) used by the baseline system cannot ensure consistency of loss and accuracy under different GPU clusters. Moreover, both BSP and ASP have different degrees of error increase and accuracy decrease compared to CSP, indicating that CSP training accuracy is highest.

In one possible implementation, the method further comprises:

In one possible implementation, the target condition further includes:

In one possible implementation, the method further comprises:

Said processing said first input data by a first sub-model comprising:

In order to better implement the above-described scheme of the embodiment of the present application on the basis of the embodiments corresponding to fig. 1 to 23, a related apparatus for implementing the above-described scheme is further provided below. Referring specifically to fig. 24, fig. 24 is a schematic structural diagram of a data processing apparatus 2400 according to an embodiment of the present application, where the data processing apparatus 2400 may be applied to a target computing node, where the target computing node belongs to a computing node cluster, and the computing node cluster includes a plurality of computing nodes of the target computing node, where the plurality of computing nodes are used for training a super-network in parallel, and each of the computing nodes is used for training a part of a model of the super-network; the training samples of the super-network comprise a first batch of latches and a second batch of latches, and in the process of training the super-network, a forward propagation process according to the first batch is configured to be performed after a feed-forward process according to the second batch; the device comprises:

An acquisition module 2401 for acquiring first input data; the first input data is input of the target computing node in the forward propagation process of training the super network by the computing node cluster according to the first batch;

the specific description of the acquiring module 2401 may refer to the description of step 1201 in the above embodiment, and will not be repeated here.

A data processing module 2402, configured to process the first input data through a first sub-model to obtain first output data based on the target condition being satisfied; the target conditions include:

The specific description of the data processing module 2402 may refer to the description of step 1202 in the above embodiment, which is not repeated here.

In one possible implementation, the apparatus further includes:

In one possible implementation, the target condition further includes:

In one possible implementation, the acquiring module is further configured to:

the apparatus further comprises:

Said processing said first input data by a first sub-model comprising:

Referring to fig. 25, fig. 25 is a schematic structural diagram of a training device according to an embodiment of the present application, specifically, training device 2500 is implemented by one or more servers, and training device 2500 may have a relatively large difference due to configuration or performance, and may include one or more central processing units (central processing units, CPU) 2525 (e.g., one or more processors) and a memory 2532, and one or more storage media 2530 (e.g., one or more mass storage devices) storing application programs 2542 or data 2544. Wherein memory 2532 and storage medium 2530 may be transitory or persistent. The program stored on storage medium 2530 can include one or more modules (not shown), each of which can include a series of instruction operations on the training device. Still further, central processor 2525 may be configured to communicate with storage medium 2530 to perform a series of instruction operations in storage medium 2530 on exercise device 2500.

Training device 2500 may also include one or more power sources 2526, one or more wired or wireless network interfaces 2550, one or more input/output interfaces 2558; or, one or more operating systems 2541, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

In an embodiment of the present application, the cpu 2525 is configured to perform the data processing method shown in fig. 12.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the steps as performed by the aforementioned performing device, or causes the computer to perform the steps as performed by the aforementioned training device.

The embodiment of the present application also provides a computer-readable storage medium having stored therein a program for performing signal processing, which when run on a computer, causes the computer to perform the steps performed by the aforementioned performing device or causes the computer to perform the steps performed by the aforementioned training device.

The training device provided by the embodiment of the application can be a chip, and the chip comprises: a processing unit, which may be, for example, a processor, and a communication unit, which may be, for example, an input/output interface, pins or circuitry, etc. The processing unit may execute the computer-executable instructions stored in the storage unit to cause the chip in the execution device to perform the data processing method described in the above embodiment, or to cause the chip in the training device to perform the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the wireless access device side located outside the chip, such as a read-only memory (ROM) or other type of static storage device that may store static information and instructions, a random access memory (random access memory, RAM), etc.

Specifically, referring to fig. 26, fig. 26 is a schematic structural diagram of a chip provided in an embodiment of the present application, where the chip may be represented as a neural network processor NPU 2600, where the NPU 2600 is mounted as a coprocessor on a main CPU (Host CPU), and the Host CPU distributes tasks. The NPU has a core part of an arithmetic circuit 2603, and the controller 2604 controls the arithmetic circuit 2603 to extract matrix data in a memory and perform multiplication.

In some implementations, the arithmetic circuit 2603 includes a plurality of processing units (PEs) inside. In some implementations, the operational circuitry 2603 is a two-dimensional systolic array. The operation circuit 2603 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry 2603 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2602 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 2601 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 2608.

The unified memory 2606 is used for storing input data and output data. The weight data is directly transferred to the weight memory 2602 through the memory unit access controller (Direct Memory Access Controller, DMAC) 2605. The input data is also carried into the unified memory 2606 through the DMAC.

BIU is Bus Interface Unit, bus interface unit 2610, for the AXI bus to interact with the DMAC and finger memory (Instruction Fetch Buffer, IFB) 2609.

The bus interface unit 2610 (Bus Interface Unit, abbreviated as BIU) is used for the instruction fetch memory 2609 to fetch instructions from an external memory, and is also used for the storage unit access controller 2605 to fetch raw data of the input matrix a or the weight matrix B from the external memory.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2606 or to transfer weight data to the weight memory 2602 or to transfer input data to the input memory 2601.

The vector calculation unit 2607 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, as needed. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization (batch normalization), pixel-level summation, up-sampling of a characteristic plane and the like.

In some implementations, the vector calculation unit 2607 can store the vector of processed outputs to the unified memory 2606. For example, the vector calculation unit 2607 may perform a linear function; alternatively, a nonlinear function is applied to the output of the operation circuit 2603, such as linear interpolation of the feature planes extracted by the convolutional layer, and further such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 2607 generates a normalized value, a pixel-level summed value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 2603, e.g., for use in subsequent layers in a neural network.

A fetch memory (instruction fetch buffer) 2609 connected to the controller 2604, for storing instructions used by the controller 2604;

the unified memory 2606, the input memory 2601, the weight memory 2602, and the finger memory 2609 are all On-Chip memories. The external memory is proprietary to the NPU hardware architecture.

The processor mentioned in any of the above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.

It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the application, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines.

From the above description of the embodiments, it will be apparent to those skilled in the art that the present application may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. However, a software program implementation is a preferred embodiment for many more of the cases of the present application. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk of a computer, etc., comprising several instructions for causing a computer device (which may be a personal computer, a training device, a network device, etc.) to perform the method according to the embodiments of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a training device, a data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Claims

1. A data processing method, wherein the method is applied to a target computing node, wherein the target computing node belongs to a computing node cluster, the computing node cluster comprises a plurality of computing nodes including the target computing node, the plurality of computing nodes are used for training a super-network in parallel, and each computing node is used for training a part of a model of the super-network; the training samples of the super-network comprise a first batch of latches and a second batch of latches, and in the process of training the super-network, a forward propagation process according to the first batch is configured to be performed after a feed-forward process according to the second batch; the method comprises the following steps:

acquiring first input data; the first input data is input of the target computing node in the forward propagation process of training the super network by the computing node cluster according to the first batch;

processing the first input data through a first sub-model based on meeting a target condition to obtain first output data; the target conditions include:

2. The method according to claim 1, wherein the method further comprises:

before the first input data is processed through the first sub-model, the first sub-model in a first memory is stored into a second memory; the first memory is not in the target computing node, and the second memory is in the target computing node;

3. The method of claim 2, wherein the target computing node is a graphics processor GPU, a tensor processor TPU, or a neural network processor NPU, and the first memory is a memory in a central processor CPU.

4. A method according to any one of claims 1 to 3, wherein the target network layer in the first sub-model is obtained after the target computing node completes updating parameters of the target network layer in the second sub-model according to the second input data.

5. The method of any one of claims 1 to 4, wherein the target conditions further comprise:

6. The method according to any one of claims 1 to 5, further comprising:

and updating parameters of the first sub model according to the third input data.

7. The method of claim 6, wherein updating the first sub-model based on the third input data comprises:

8. The method according to any one of claims 1 to 7, further comprising:

said processing said first input data by a first sub-model comprising:

9. The method of any one of claims 1 to 8, wherein the first batch and the second batch are at least one of image data, text data, audio data, video data.

10. A data processing apparatus, the apparatus being applied to a target computing node, wherein the target computing node belongs to a cluster of computing nodes, the cluster of computing nodes comprising a plurality of computing nodes including the target computing node, the plurality of computing nodes being for training a super-network in parallel, each of the computing nodes being for training a portion of a model of the super-network; the training samples of the super-network comprise a first batch of latches and a second batch of latches, and in the process of training the super-network, a forward propagation process according to the first batch is configured to be performed after a feed-forward process according to the second batch; the device comprises:

11. The apparatus of claim 10, wherein the apparatus further comprises:

12. The apparatus of claim 11, wherein the target computing node is a graphics processor GPU, a tensor processor TPU, or a neural network processor NPU, and the first memory is a memory in a central processor CPU.

13. The apparatus according to any one of claims 10 to 12, wherein the target network layer in the first sub-model is obtained after the target computing node completes parameter updating of the target network layer in the second sub-model according to the second input data.

14. The apparatus of any one of claims 10 to 13, wherein the target condition further comprises:

15. The apparatus of any one of claims 10 to 14, wherein the acquisition module is further configured to:

the apparatus further comprises:

16. The apparatus according to claim 15, wherein the model updating module is specifically configured to:

17. The apparatus of any one of claims 10 to 16, wherein the data storage module is further configured to:

Said processing said first input data by a first sub-model comprising:

18. The apparatus of any one of claims 10 to 17, wherein the first batch and the second batch are at least one of image data, text data, audio data, video data.

19. A data processing apparatus, the apparatus comprising a memory and a processor; the memory stores code, the processor being configured to retrieve the code and to perform the method of any of claims 1 to 9.

20. A computer storage medium storing one or more instructions which, when executed by one or more computers, cause the one or more computers to implement the method of any one of claims 1 to 9.

21. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 9.