CN116562364A

CN116562364A - Deep learning model collaborative deduction method, device and equipment based on knowledge distillation

Info

Publication number: CN116562364A
Application number: CN202310305693.4A
Authority: CN
Inventors: 王莉; 徐连明; 费爱国; 李靚; 彭鲜; 吴鑫
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-08-08

Abstract

The invention provides a knowledge distillation-based deep learning model collaborative deduction method, a device and equipment, which comprise the following steps: obtaining FLOPS, storage capacity and data successful transmission probability corresponding to each edge node; clustering each edge node based on FLOPS, storage capacity and data successful transmission probability to obtain K target node clusters for redundancy backup; performing set division processing on a plurality of convolution filters of the last convolution layer of a preset teacher model to obtain K filter sets; determining a model to be trained which is adaptive to the capacity of each target node cluster based on K target node clusters, K filter sets and a plurality of preset models; based on a plurality of preset sample data, adopting a knowledge distillation technology to jointly train the model to be trained of each target node cluster, and obtaining a student model of each target node cluster; aiming at each target node cluster, the student models of the target node clusters are deployed to each edge node in the target node cluster respectively so as to improve deduction performance.

Description

Deep learning model collaborative deduction method, device and equipment based on knowledge distillation

Technical Field

The invention relates to the technical field of edge intelligence, in particular to a method, a device and equipment for collaborative deduction of a highly robust deep learning model based on knowledge distillation.

Background

As the computing processing functions of the internet of things devices are transferred to the network edge, the intelligent service requirements of the edge devices at the network edge are increased, and the deep neural network (Deep Neural Networks, DNN) plays an important role in the intelligent service requirements.

At present, in the distributed DNN deduction technology, a method based on input in a division DNN model layer is used for dividing model calculation load and distributing the model calculation load to a plurality of edge devices to execute collaborative deduction; and the other class directly divides the original DNN model to obtain a plurality of sub-models, and each sub-model is respectively deployed on single edge equipment to perform distributed DNN collaborative deduction. Although the method can effectively divide the calculation load of the DNN model, due to the characteristic of convolution operation, frequent communication synchronization intermediate results are needed among edge devices, and the communication overhead generated by the method limits the performance improvement caused by distributed deduction. Furthermore, the method of splitting the input requires that each device can accommodate the complete DNN model, which is a strong assumption for resource-constrained edge devices. In order to reduce communication overhead, a distributed deduction method with independent models is proposed, a plurality of light-weight models with the same structure and independent functions are derived from an original DNN model through a knowledge distillation technology, and distributed deduction is cooperatively realized.

However, since the storage capacity and the computing capacity of each edge device have isomerism, the method uniformly divides the computing load, which causes the mismatching of the model and the device capacity, and is unfavorable for improving the resource utilization rate and the deduction performance; on the other hand, the deduction process runs the risk of losing the result due to equipment failure or communication failure, thereby reducing accuracy, due to the instability of the edge equipment status and the wireless communication link. The lack of robust design in the method thus faces failure lack of resilience.

Disclosure of Invention

The invention provides a knowledge distillation-based high-robustness deep learning model collaborative deduction method, device and equipment, which are used for solving the problems that the collaborative deduction performance is low and the elasticity is lacking in the face of faults due to the fact that synchronous communication between models is poor in capacity of the models and equipment or the nodes are faulty in the prior art, and achieving the purpose of improving the collaborative deduction performance.

In a first aspect, the present invention provides a knowledge distillation-based deep learning model collaborative deduction method, including:

acquiring floating point operation times FLOPS, storage capacity and data successful transmission probability of each second corresponding to each edge node;

clustering each edge node based on the FLOPS, the storage capacity and the probability of successful data transmission to obtain K target node clusters;

Performing set division processing on a plurality of convolution filters of the last convolution layer in a preset teacher model to obtain K filter sets;

determining a model to be trained of each target node cluster based on the K target node clusters, the K filter sets and a plurality of preset models;

based on a plurality of preset sample data, performing joint training on the model to be trained of each target node cluster by adopting a knowledge distillation technology to obtain student models of each target node cluster;

aiming at each target node cluster, respectively deploying a student model of the target node cluster into each edge node in the target node cluster; and each edge node in each target node cluster is used for executing collaborative deduction when a corresponding student model is operated.

In a second aspect, the present invention further provides a deep learning model collaborative deduction device based on knowledge distillation, including:

the acquisition module is used for acquiring floating point operation times FLOPS, storage capacity and data successful transmission probability corresponding to each edge node;

the clustering module is used for carrying out clustering processing on each edge node based on the FLOPS, the storage capacity and the data successful transmission probability to obtain K target node clusters;

The division module is used for carrying out set division processing on a plurality of convolution filters of the last convolution layer in a preset teacher model to obtain K filter sets;

the determining module is used for determining a model to be trained of each target node cluster based on the K target node clusters, the K filter sets and a plurality of preset models;

the training module is used for carrying out joint training on the to-be-trained models of the target node clusters by adopting a knowledge distillation technology based on a plurality of preset sample data to obtain student models of the target node clusters;

the deployment module is used for respectively deploying the student models of the target node clusters to the edge nodes in the target node clusters aiming at the target node clusters; and each edge node in each target node cluster is used for executing collaborative deduction when a corresponding student model is operated.

In a third aspect, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the knowledge-distillation-based deep learning model collaborative deduction method as described in any one of the above when executing the program.

In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a knowledge-distillation-based deep learning model collaborative deduction method as described in any one of the above.

In a fifth aspect, the present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a knowledge-based distillation based deep learning model collaborative deduction method as described in any of the above.

The invention provides a deep learning model collaborative deduction method, device and equipment based on knowledge distillation, wherein in the method, clustering processing is carried out on all edge nodes based on FLOPS, storage capacity and data successful transmission probability to obtain K target node clusters, and the purpose is to carry out redundancy backup on a student model in the node clusters, so that under the condition that a certain edge node in the node clusters fails, other edge nodes can continue to carry out deduction, and the accumulated successful transmission probability of each node cluster is ensured to meet the requirement, thereby improving the elasticity of collaborative deduction to the failure. In addition, a plurality of convolution filters of the last convolution layer in a preset teacher model are subjected to set division processing to obtain K filter sets, and important filters are ensured to be uniformly distributed in the K filter sets, so that each filter set has equal importance on a deduction result. Further, when determining the model to be trained, comprehensively considering FLOPS, storage capacity and data transmission rate of each edge node, and distributing a preset model and a filter set which are adaptive to the capacity of each node cluster, so that deduction time delay of each node cluster is close, resources are fully utilized, and deduction performance is improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application scenario provided by the present invention;

FIG. 2 is a schematic flow chart of a knowledge distillation-based deep learning model collaborative deduction method;

FIG. 3 is a schematic flow chart of a method for obtaining K target node clusters according to the present invention;

FIG. 4 is a flow chart of a method for obtaining K filter sets according to the present invention;

FIG. 5 is a schematic flow chart of a method for determining a model to be trained of each target node cluster according to the present invention;

FIG. 6 is a schematic flow chart of a method for optimizing a model to be trained of a target node cluster;

FIG. 7 is a schematic diagram showing a comparison of collaborative deduction performance simulation results of a knowledge distillation-based deep learning model collaborative deduction method;

FIG. 8 is a second comparison diagram of collaborative deduction performance simulation results of the knowledge-based deep learning model collaborative deduction method according to the present invention;

FIG. 9 is a third comparison diagram of collaborative deduction performance simulation results of a knowledge distillation-based deep learning model collaborative deduction method according to the present invention;

fig. 10 is a schematic structural diagram of a deep learning model collaborative deduction device based on knowledge distillation provided by the invention;

fig. 11 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the present disclosure, the term "include" and variations thereof may refer to non-limiting inclusion; the term "or" and variations thereof may refer to "and/or". The terms "first," "second," and the like, herein, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. In the present invention, "at least one" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Next, an application scenario of the technical solution shown in the present invention will be described with reference to fig. 1.

Fig. 1 is a schematic view of an application scenario provided by the present invention. As shown in fig. 1, for example, the application scenario includes: the system comprises a plurality of node clusters, a plurality of student models, a geographic area and a deduction result receiving end.

For example, the plurality of node clusters includes node cluster 1, node cluster 2, node cluster 3, and node cluster 4. Each node cluster includes at least one edge node therein. Each edge node in the node cluster can deploy a student model corresponding to the node cluster. Alternatively, the edge node may be a drone, a smart camera, or a shock sensor, among others.

For example, the plurality of student models includes a student model 1, a student model 2, a student model 3, and a student model 4.

The node clusters are in one-to-one correspondence with the student models, for example: node cluster 1 corresponds to student model 1.

Optionally, the deduction result receiving end may be a user terminal, and the user terminal may be a mobile phone, a tablet computer or a notebook computer, for example.

In the case that an edge node in the node cluster 1 acquires data to be processed from a geographical area, the edge node transmits the data to be processed to all edge nodes in other node clusters (e.g., node cluster 2, node cluster 3, and node cluster 4).

Aiming at each edge node of other node clusters, the edge nodes run corresponding student models, the data to be processed are deduced through the student models, partition results are obtained, and the partition results are returned to the edge nodes in the node cluster 1.

Under the condition that the edge nodes in the node cluster 1 receive a plurality of partition results, each edge node in the node cluster 1 splices the partition results to obtain a splicing result, performs calculation on the splicing result to obtain a deduction result, and sends the deduction result to a deduction result receiving end.

It should be noted that, deduction represents the node operation model, and the process of the model processing the data to be processed.

The deep learning model collaborative deduction method based on knowledge distillation provided by the invention is described below with reference to the specific embodiment of fig. 2.

Fig. 2 is a schematic flow chart of a knowledge distillation-based deep learning model collaborative deduction method provided by the invention. As shown in fig. 2, the deep learning model collaborative deduction method based on knowledge distillation provided in this embodiment includes:

step 201, obtaining floating point operation times FLOPS, storage capacity and data successful transmission probability corresponding to each edge node.

Optionally, the implementation subject of the knowledge distillation-based deep learning model collaborative deduction method provided by the invention can be electronic equipment, and also can be a knowledge distillation-based deep learning model collaborative deduction device arranged in the electronic equipment. The knowledge distillation based deep learning model collaborative deduction means may be implemented by a combination of software and/or hardware.

For example, the electronic device may be a device having a data transmitting/receiving function, such as a server or a desktop computer.

Optionally, the floating point number of operations per second is the floating point number of operations performed per second by the edge node (flow-point operations per second).

Optionally, the probability of successful data transmission is the probability of successful data transmission between the edge node and the data node (preset node of each edge node, the data node is used for acquiring the data to be processed).

Step 202, clustering each edge node based on FLOPS, storage capacity and probability of successful data transmission to obtain K target node clusters.

Optionally, each target node cluster comprises at least one edge node.

Optionally, the K target node clusters satisfy the following clustering rules:

wherein ,representing a set comprising a plurality of target node clusters, M _k Representing a target node cluster, m _i Representing the edge node(s),represents m _i Storage capacity of>Represents M _k Average storage capacity of>Represents m _i FLOPS, & gt>Represents M _k Mean FLOPS, & gt>Represents M _k Cumulative transmission success probability of p _th Represents a preset probability threshold, M _i Represents the ith target node cluster, M _j Represents the j-th target node cluster and n represents the intersection.

Alternatively, it can be obtained by equation 1

Wherein I is a node cluster M _k The number of edge nodes included in the list.

Alternatively, it can be obtained by equation 2

Optionally, the probability of successful data transmission of each edge node in the node cluster is processed through a probability calculation model,obtaining cumulative transmission success probability of node cluster

The probability calculation model is as follows:

wherein ,represents M _k Cumulative transmission success probability of>Represents m _i Is used to represent the probability of successful data transmission, pi (·) for the cumulative operation.

In step 202, clustering is performed on each edge node based on the flow, the storage capacity and the probability of successful data transmission to obtain K target node clusters, so that the edge nodes in the target node clusters can meet a preset distance threshold, and the cumulative probability of successful data transmission meets a preset probability threshold.

In detail, for a detailed description of obtaining K target node clusters, please refer to the embodiment of fig. 3.

And 203, performing set division processing on a plurality of convolution filters of the last convolution layer in a preset teacher model to obtain K filter sets.

Optionally, the preset teacher model is a pre-trained DNN model.

Optionally, each of the K filter sets comprises at least one convolution filter.

In particular, for a detailed description of the K filter sets, please refer to the embodiment of fig. 4.

In step 203, the set division process is performed on the multiple convolution filters of the last convolution layer in the preset teacher model to obtain K filter sets, so that the important filters can be uniformly distributed in each filter set.

Step 204, determining a model to be trained of each target node cluster based on the K target node clusters, the K filter sets and a plurality of preset models.

In detail, for a detailed description of determining the model to be trained for each target node cluster, please refer to the embodiment of fig. 5.

In step 204, based on the K target node clusters, the K filter sets, and a plurality of preset models, a model to be trained of each target node cluster is determined, so that the capability of each target node cluster can be adapted to the model to be trained.

Step 205, based on a plurality of preset sample data, performing joint training on the model to be trained of each target node cluster by adopting a knowledge distillation technology to obtain student models of each target node cluster;

optionally, knowledge distillation and a classical random gradient descent method are adopted, and based on a plurality of preset sample data, the model to be trained of each target node cluster is subjected to combined training, so that the student model of each target node cluster is obtained.

In the process of joint training, the calculation model of the loss function value is as follows:

wherein ,L(θ_S ) Represents the loss function value, θ _s Representing a parameter vector (including parameters of the model to be trained for each target node cluster),representing standard cross entropy operations, < >>Representing hard tag loss in a knowledge distillation loss function, y representing a true tag vector (a true tag comprising a plurality of preset sample data), P _S Representing predictive label vector (predictive label output by model to be trained of each target node cluster), +.>Representing soft tag loss in knowledge distillation loss function, < ->Probability distribution of labels output by softmax layer representing preset teacher model, P _S ^τ Probability distribution of labels output by softmax layer of model to be trained representing each target node cluster, P represents +.>Is a filter set of +.>Representing a set comprising K sets of filters, < +.>A convolutional layer activation value vector representing P in a predetermined teacher model,>a convolutional layer activation value vector representing P in the student model to be trained,/->Representing the active migration loss of the knowledge of the convolution filter from the teacher model to the model to be trained, α represents the weight of the hard tag loss, and β represents the weight of the soft tag loss.

Optionally, the model to be trained of each target node cluster satisfies the following construction and allocation rules:

wherein S represents a set of models to be trained comprising target node clusters, S _i Represents the ith model to be trained in S _j Represents the j-th model to be trained in S, f (S _i ) Representing the ith model to be trainedAnd a target node cluster, g (S _i ) Representing the corresponding relation between the ith model to be trained and the corresponding partition, T _f Representing a set of a plurality of convolution filters in a last convolution layer of a preset teacher model, rule R representing a preset Rule to be satisfied by obtaining K filter sets, and n representing set intersection operation,representing the storage capacity requirement of the ith model to be trained,/->Representing the target storage capacity, t (S), of the node cluster corresponding to the ith model to be trained _i ,f(S _i ) S) representing the execution of the target node cluster _i Is a deduction delay of (a).

In particular, please refer to the embodiment of fig. 4 for a detailed description of Rule R.

Step 206, aiming at each target node cluster, respectively deploying the student models of the target node clusters into each edge node in the target node cluster; each edge node in each target node cluster is used to perform collaborative deduction when running a corresponding student model.

In the embodiment of fig. 2, clustering is performed on each edge node based on the FLOPS, the storage capacity and the probability of successful data transmission to obtain K target node clusters, so that the purpose of the clustering is to perform redundancy backup on a student model in the node clusters, so that under the condition that a fault occurs in one edge node in the node clusters, other edge nodes can continue deduction, and the accumulated successful transmission probability of each node cluster is ensured to meet the requirement, thereby improving the fault-facing elasticity of collaborative deduction. In addition, a plurality of convolution filters of the last convolution layer in a preset teacher model are subjected to set division processing to obtain K filter sets, and important filters are ensured to be uniformly distributed in the K filter sets, so that each filter set has equal importance on a deduction result. Further, when determining the model to be trained, comprehensively considering FLOPS and storage capacity of each edge node, and distributing a preset model and a filter set which are adaptive to the capacity of each node cluster to each node cluster, so that deduction time delay of each node cluster is close, resources are fully utilized, and deduction performance is improved.

Fig. 3 is a schematic flow chart of a method for obtaining K target node clusters provided by the present invention. As shown in fig. 3, the method includes:

Step 301, clustering each edge node based on FLOPS, storage capacity and probability of successful data transmission to obtain a plurality of initial node clusters.

In some embodiments, step 301 specifically includes:

arranging a plurality of edge nodes according to the sequence from small to large of FLOPS and the sequence from small to large of storage capacity under the condition that FLOPS is the same, so as to obtain a node set;

determining a first edge node in the node set as a cluster center node of a preset node cluster;

performing node division operation on each edge node except for the first edge node in the node set: acquiring an ith node cluster set; determining distances between the edge nodes and cluster center nodes of each node cluster respectively based on FLOPS and storage capacity of the edge nodes and FLOPS and storage capacity of the cluster center nodes of each node cluster in the ith node cluster set; arranging the node clusters according to the sequence from small to large distances to obtain a target node cluster set; based on the successful transmission probability of the data, determining the cumulative transmission success probability of each node cluster in the target node cluster set; if a first node cluster with the accumulated transmission success probability smaller than a preset probability threshold value and the distance between the cluster center node and the edge node smaller than a preset distance threshold value exists in the target node cluster set, dividing the edge node into the first node cluster to obtain an i+1th node cluster set, and updating the cluster center node of the first node cluster; otherwise, creating a node cluster, and dividing the edge node into the created node cluster to obtain an (i+1) th node cluster set, wherein the (i+1) th node cluster set comprises each node cluster in the (i) th node cluster set and the created node cluster;

Updating the ith node cluster set into an (i+1) th node cluster set, and repeatedly executing N times of node dividing operations to obtain a plurality of initial node clusters; wherein N is the total number of other edge nodes in the node set except the first edge node;

initially, i is equal to 1, and the ith node cluster set includes a preset node cluster.

The following describes the resulting node set with reference to specific example 1.

Example 1, including m at each edge node ₁ 、m ₂ and m₃ In the case of (1), if m ₁ FLOPS of 30 megabytes per second (M), storage capacity of 32 megabytes (GB), M ₂ FLOPS of 40M, 128GB, M ₃ FLOPS of 40M, storage capacity of 64GB, for M ₁ 、m ₂ and m₃ According to the order from small to large of FLOPS and the storage capacity from small to large under the condition of the same FLOPS, the obtained node set is { m } ₁ ,m ₃ ,m ₂ }。

Next, the description will be given of the i+1th node cluster obtained in connection with specific example 2.

Example 2, for edge node m in the node set except for the first edge node ₁ The i-th node cluster set is { M ] ₁ ,M ₂ ,M ₃ In the case of }, if M ₁ Cluster core node of (c) and m ₁ Is 30, M ₂ Cluster core node of (c) and m ₁ Is 20, M ₃ Cluster core node of (c) and m ₁ The distance of (2) is 21, the preset distance threshold is 25, and the target node cluster set is { M } ₂ ,M ₃ ,M ₁ -a }; if M ₁ The cumulative transmission success probability of (a) is 90%, M ₂ The cumulative transmission success probability of (a) is 91%, M ₃ The cumulative transmission success probability of (a) is 91%, and the preset probability threshold is 95%, the edge node m is ₁ Dividing into M ₂ Obtaining an i+1th node cluster set, wherein the i+1th node cluster set comprises M ₃ 、M ₁ And is divided into m ₁ M of (2) ₂ 。

In some embodiments, determining the distance between the edge node and the cluster core node of each node cluster, respectively, based on the FLOPS and the storage capacity of the edge node, and the FLOPS and the storage capacity of the cluster core node of each node cluster in the ith node cluster set, comprises:

processing FLOPS and storage capacity of the edge nodes and FLOPS and storage capacity of cluster core nodes of each node cluster through a distance calculation model to obtain distances between the edge nodes and the cluster core nodes of each node cluster;

the distance calculation model is as follows:

wherein ,m_i Representing edge nodes, M _k Representing a node cluster in the ith node cluster set,represents M _k Is a cluster core node of d represents m _i And M is as follows _k Distance between cluster core nodes, +.>Represents m _i Storage capacity of>Represents M _k Storage capacity of cluster core node, +.>Represents m _i FLOPS, & gt>Represents M _k FLOPS of cluster core nodes of (C).

In some embodiments, the probability calculation model is used to determine the cumulative probability of successful transmission of each node cluster in the target node cluster set based on the probability of successful transmission of data, which is not described herein.

Alternatively, step 301 may also be implemented by the following code:

and 302, adjusting edge nodes in a plurality of initial node clusters to obtain K target node clusters.

In some embodiments, step 302 specifically includes:

performing node adjustment operations: acquiring a plurality of ith node clusters; determining a target ith node cluster with the smallest cumulative successful transmission probability from a plurality of ith node clusters; determining the distance between each edge node in the target ith node cluster and the cluster center node of each other node cluster according to the condition that the accumulated successful transmission probability of the target ith node cluster is smaller than a preset probability threshold; dividing the edge nodes into other node clusters corresponding to the minimum distance to obtain a plurality of i+1th node clusters; the node clusters except the target ith node cluster in the ith node clusters are node clusters, and cluster center nodes of the other node clusters are nodes determined based on FLOPS and storage capacity of each edge node in the other node clusters;

updating the ith node clusters into the (i+1) th node clusters, repeatedly executing node adjustment operation until the respective accumulated successful transmission probabilities of the final plurality of node clusters are greater than or equal to a preset probability threshold, and determining the final plurality of node clusters as K target node clusters;

Initially, i is equal to 1, and the plurality of i-th node clusters are a plurality of initial node clusters.

The acquisition of a plurality of i+1th node clusters is described below in connection with specific example 3.

Example 3, the plurality of ith node clusters includes M ₁ 、M ₂ 、M ₃, wherein ,M₃ The cumulative probability of successful transmission is the smallest, i.e. M ₃ Is the target i-th node cluster. At M ₃ M in the case where the cumulative probability of successful transmission of (a) is less than a preset probability threshold ₃ Including edge node m ₁ and m₂ ，m ₁ To M ₁ Is 20, m ₁ To M ₂ Is 25 due to m ₁ And M is as follows ₁ The distance of cluster center node of (2) is smallest, thus m ₁ Dividing into M ₁ In (a) and (b); m is m ₂ To M ₁ Is 21, m ₂ To M ₂ Is 19 due to m ₂ And M is as follows ₂ The distance of cluster center node of (2) is smallest, thus m ₂ Dividing into M ₂ Is a kind of medium.

Alternatively, step 302 may also be implemented by the following code:

sequencing a plurality of node clusters according to the accumulated successful transmission probability from small to large;

first node cluster M after sequencing ₁ Is performed if the cumulative successful transmission probability of (a) is less than a preset probability threshold;

will M ₁ Edge node partitioning into M ₁ In other node clusters corresponding to the minimum distance between the edge node and the cluster center node of each other node cluster

Sequencing a plurality of node clusters from small to large according to the accumulated successful transmission probability;

end while

Returnand/outputting K target node clusters.

In the embodiment of fig. 3, based on the FLOPS, the storage capacity and the probability of successful data transmission, clustering is performed on each edge node to obtain a plurality of initial node clusters, edge nodes in the plurality of initial node clusters are adjusted to obtain K target node clusters, so that intra-cluster similarity of each target node cluster is maximum, resources are fully utilized when the same model is deployed, deduction completion time delay of each node in the same cluster is guaranteed to be similar, and under the condition that some edge node batteries in the target node clusters are exhausted or communication faults cannot operate, other normally operating edge nodes in each target node cluster can be used as redundant nodes, and the student model is not interfered, so that deduction robustness is improved.

Fig. 4 is a schematic flow chart of a method for obtaining K filter sets according to the present invention. As shown in fig. 4, the method includes:

step 401, obtaining average activation values of each of a plurality of convolution filters.

The average activation value is an importance evaluation index of the convolution filter.

Alternatively, the average activation value of each convolution filter may be obtained by: and inputting preset verification data into each channel of the convolution filter aiming at each convolution filter to obtain verification values output on each channel, and determining the average scalar value of each verification value as the average activation value of the convolution filter.

Step 402, determining an adjacency weight matrix of the target graph based on the average activation value; wherein the target graph is a complete graph constructed based on a plurality of convolution filters.

Optionally, by a Graph partitioning (Graph Partition) method, a complete Graph g=g (V, E) is constructed based on a plurality of convolution filters in a preset teacher model, where V represents a set of vertices and v=t _f E represents an edge set, E= { E _i,j }，e _i,j Representing the ith convolution filter T _fi And a jth convolution filter T _fj With an edge present therebetween.

Optionally, the elements in the adjacency weight matrix of the target graph are edge weights between convolution filters in the target graph.

Alternatively, the edge weights between the convolution filters in the target graph can be obtained by the following equation 3:

w _ij ＝w _ji ＝∑ _val a _i a _j |a _i -a _j i formula 3;

wherein ,w_ij Convolution filter T in a representation target graph _fi and T_fj Edge weights, w _ji Convolution filter T in a representation target graph _fj and T_fi Edge weight between, a _i Representing a convolution filter T _fi Average activation value of a) _j Representing a convolution filter T _fj Is used for the activation of the activation signal.

Alternatively, steps 401 to 402 may also be implemented by the following codes.

And step 403, dividing the target graph based on the adjacency weight matrix and K by a standard division algorithm of spectral clustering to obtain K filter sets.

From equation 3, at T _fi and T_fJ The closer the average activation value of (2), the edge weight w _IJ The smaller the average activation value is, the importance evaluation index of the convolution filter, so in the complete graph g=g (V, E), the importance of the convolution filter and the distribution relationship of the convolution filter in the complete graph are: the convolution filters with high importance are distributed more closely with the convolution filters with general importance, and the two convolution filters with close importance are distributed more distantly.

Based on the above distribution relation, under the condition that the sum of edge weights of the complete graph segmentation is minimum and the sum of edge weights in each sub-graph is maximum, segmenting the complete graph g=g (V, E) by a canonical segmentation algorithm of spectral clustering to obtain K sub-graphs, and confirming all convolution filters included in one sub-graph as one filter set to obtain K filter sets.

Alternatively, step 403 may also be implemented by the following code:

the// w represents the adjacency weight matrix (represented by w) of the complete graph constructed based on multiple convolution filters _Ij Composition of the componentsAnd dividing the target graph based on a canonical division algorithm of spectral clustering to obtain P, wherein P represents a set comprising K filter sets.

It should be noted that, the Rule R is steps 401 to 403.

In the embodiment of fig. 4, the target graph is segmented based on the adjacency weight matrix and K by using a canonical segmentation algorithm of spectral clustering, so as to obtain K filter sets, so that important convolution filters in the filter sets can be uniformly distributed among the K filter sets, the defect that the deduction precision is affected due to the fact that the importance difference of the convolution filter sets is large is avoided, and the deduction precision is improved.

Fig. 5 is a schematic flow chart of a method for determining a model to be trained of each target node cluster according to the present invention.

As shown in fig. 5, the method includes:

step 501, determining K target preset models in a plurality of preset models based on the target storage capacity of each target node cluster and the storage capacity requirement of each preset model; the target storage capacity of the target node cluster is the minimum storage capacity of at least one edge node in the corresponding target node cluster.

Optionally, for each target node cluster, determining a plurality of to-be-selected models in a plurality of preset models based on the target storage capacity of the target node cluster and the storage capacity requirement of each preset model; the storage capacity requirements of each of the multiple candidate models are smaller than the target storage capacity of the target node cluster;

And determining one model randomly selected from the multiple models to be selected as a target preset model.

For example, M is the target node cluster ₁ ，M ₁ Including edge node m ₁ 、m ₂ and m₃ In the case of (1), if m ₁ Is 32GB, m ₂ Is 64GB, m ₃ Is 16GB, M ₁ Is 16GB.

Step 502, constructing K models to be trained based on K target preset models and K filter sets.

Optionally, sorting the K target preset models according to the order of floating point operation times (floating point operations, FLPs) from small to large to obtain sorted K target preset models;

sequencing the K filter sets according to the sequence from small to large of the filter sets to obtain sequenced K filter sets;

according to the order from small to large, performing one-to-one correspondence on the K ordered target preset models and the K ordered filter sets to obtain filter sets corresponding to the target preset models;

aiming at each target preset model, a model to be trained is constructed based on a filter set corresponding to the target preset model and the target preset model.

The size of the filter set is the number of convolution filters included in the filter set.

For example, if there are 3 ordered target preset models { Y1, Y3, Y2}, 3 ordered filter sets { P3, P1, P2}, the 3 models to be trained that are constructed are s1=buildmodel (Y1, P3), s2=buildmodel (Y3, P1), and s3=buildmodel (Y2, P2). The buildmodel represents replacing the filter of the last convolution layer in the target preset model with a corresponding filter set. For example, buildmodel (Y3, P1) indicates that the filter of the last convolutional layer in Y3 is replaced with a filter set P1.

Alternatively, steps 501 to 502 may also be implemented by the following code:

for i＝1→K do

targeted node cluster M _i Target storage capacity of (2), selection of

A preset model;

end for

and constructing a model to be trained based on FLPs of the preset model and the size of the target filter set.

Step 503, determining a set of models to be selected of each target node cluster based on the target storage capacity of each target node cluster and the K models to be trained.

Optionally, for each target node cluster, determining a model to be trained with a storage capacity smaller than the target storage capacity of the target node cluster from the K models to be trained as a model set to be selected of the target node cluster.

For example, M is the target node cluster ₁ In the case that the 3 models to be trained are S1, S2 and S3 respectively, if M ₁ Is 16GB, S1 is 2GB, S2 is 32GB, S3 is 12GB, then M ₁ Is { S1, S3}.

Alternatively, step 503 may also be implemented by the following code:

for i＝1→K do

S _j e S// adding a model to be trained with the storage capacity requirement less than or equal to the target storage capacity of the target node cluster into a model set to be selected, wherein S represents K

A set of the individual models to be trained,representing the storage capacity requirement of the model to be trained, M _i Representing a cluster of target nodes,representing a target node cluster M _i Is a candidate model set of S _j Representing to be trained

ModelRepresents M _i Target storage capacity of (a);

end for

step 504, sorting each target node cluster based on the number of to-be-trained models included in the to-be-selected model set and the average FLOPS of each target node cluster, so as to obtain a node cluster set.

Optionally, determining, for each target node cluster, a sum of FLOPS of edge nodes in the target node cluster; the ratio of the sum value to the total number of edge nodes in the target node cluster is determined as the average FLOPS of the target node cluster.

Optionally, according to the sequence from the small number to the large number of the models to be trained included in the model set to be selected, under the condition that the number of the models to be trained included in the model set to be selected is the same, sorting all target node clusters according to the sequence from the small number to the large number of the average FLOPS of the target node clusters, and obtaining the node cluster set.

For example, M is the target node cluster ₁ 、M ₂ and M₃ In the case of (1), if M ₁ Is an average FLOPS of 10M, M ₂ Average FLOPS of 15M, M ₃ Average FLOPS of 20M, M ₁ Comprises 4 models to be trained, M ₂ Comprises 3 models to be trained, M ₃ The candidate model set of (1) comprises 3 models to be trained, then M is calculated ₁ 、M ₂ and M₃ Sequencing to obtain a node cluster set as

{M ₂ ，M ₃ ，M ₁ }。

Step 505, determining, for each target node cluster in the node cluster set, a model to be trained that satisfies a preset condition as a model to be trained of the target node cluster when the model to be trained that satisfies the preset condition exists in the set of models to be selected of the target node cluster.

The preset conditions include at least one of the following:

the model set to be selected comprises a model to be trained;

the floating point operation times FLOPs of the model to be trained are minimum;

no corresponding target node cluster is allocated.

For example, in the node cluster set { M } ₂ ,M ₃ ,M ₁ In the case of M ₂ The set of models to be selected is { S1, S2}, M ₃ The set of models to be selected is { S4}, M ₁ Is { S1, S2, S3}, if the FLOPs of S1 is 20M, S and the FLOPs of S25M, S are 20M, then M ₂ The model to be trained is S1, M ₃ The model to be trained is S4, M ₁ The model to be trained is S3.

Alternatively, steps 504 to 505 may also be implemented by the following code:

m ∈dissolved (M)// ordering each target node cluster according to the order of the average FLOPS of the target node clusters from small to large under the condition that the number of the to-be-trained models included in the to-be-selected model set is the same in the order of the number of the to-be-trained models included in the to-be-selected model set from small to large;

/>

in the embodiment of fig. 5, based on the target storage capacity of each target node cluster and the storage capacity requirement of each preset model, K target preset models are determined in a plurality of preset models, K models to be trained are built based on the K target preset models and K filter sets, for each target node cluster in the node cluster set, under the condition that the models to be trained meeting preset conditions exist in the candidate model set of the target node cluster, the models to be trained meeting preset conditions are determined as the models to be trained of the target node cluster, the storage capacity limitation for each edge node is met, and meanwhile, the models to be trained are distributed to each target node cluster so as to minimize the deduction delay, thereby reducing the deduction delay, further, determining the models to be trained meeting preset conditions as the models to be trained of the target node cluster, and avoiding the problem that computing resources are wasted in a uniform division manner in the prior art, and further maximizing the utilization rate of computing resources.

In some embodiments, after determining the model to be trained of the target node cluster, tuning operation may also be performed on the model to be trained, and the tuning process is described below in connection with the embodiment of fig. 6.

Fig. 6 is a schematic flow chart of a method for tuning a model to be trained of a target node cluster. As shown in fig. 6, the method includes:

step 601, determining deduction time delay of a model to be trained of each target node cluster.

Alternatively, the deduction delay is determined by the following equation 4:

wherein ,t(S_i ,M _k ) Representing a target node cluster M _k Executing the model S to be trained _i Is used for the deduction time delay of the (a),representing a model S to be trained _i FLOPs, calif.)>Representing a target node cluster M _k Mean FLOPS, & gt>Representing a model S to be trained _i The resulting data quantity of the corresponding filter set (i.e. the total number of bits of the output data of all convolution filters in the filter set), ->Representing a target node cluster M _k Average data transmission rate from edge node to data node (representing the average of the data transmission rates from edge node to data node, wherein the data transmission from edge node to data nodeThe rate is the transmission rate obtained in advance).

Step 602, determining at least one second node cluster set comprising the to-be-trained models with the same structure in the to-be-trained models of each target node cluster, and performing partition adjustment processing on each second node cluster set until the partition is stable, so as to obtain a first intermediate to-be-trained model of each target node cluster.

The model to be trained with the same structure is the model to be trained with the same layers except the last convolution layer.

For example, M is included in the target node cluster ₁ 、M ₂ 、M ₃ 、M ₄ In the case of (1), if M ₁ and M₂ The structure of the model to be trained is the same, M is ₁ and M₂ Dividing into a second cluster set of nodes, if M ₃ and M₄ The structure of the model to be trained is the same, M is ₃ and M₄ Divided into a second set of node clusters.

Specifically, the partition adjustment processing includes:

step 6021, determining the maximum deduction time delay t for each second node cluster set _maX Corresponding first node cluster M _I And a minimum deduction time delay t _MIN Corresponding second node cluster M _J ；

Step 6022, judging whether the maximum integer k satisfies a preset adjustment condition:

will M _I The last convolved filter set P in the corresponding model to be trained _i Of the k convolution filters divided into M _j The last convolved filter set P in the corresponding model to be trained _j In (1) can make M _j The corresponding deduction time delay does not exceed M _I Corresponding deduction time delay, but for any k'>k, divide P _i Of (1) are convolved with filters to P _j Will lead to M _j Corresponding deduction time delay exceeds M _i Corresponding deduction time delay;

step 6023, if yes, based on the spectral clustering principle, comparing P _i K convolution filters with minimum correlation among the k convolution filters are divided into P _j ；

Step 6024, repeating the steps 6021 to 6023 until the partition is stable, and obtaining a first intermediate model to be trained of each target node cluster.

M _j The corresponding deduction time delay is M _j Execution includes P _i The deduction time delay of the module to be trained of the convolution filter.

M _i The corresponding deduction time delay is M _i Performing division into P _j The deduction time delay of the module to be trained outside the convolution filter.

In the application, judging whether the maximum integer k meets the preset adjustment condition or not can minimize the influence of the adjustment on the importance of the filter set, and the filter set is adjusted to enable the deduction time delay of the two clusters to be closest, thereby reducing t _max 。

Alternatively, step 602 may also be implemented by the following code.

Step 603, determining a deduction time delay of a first intermediate model to be trained of each target node cluster, and determining a target maximum deduction time delay t in the deduction time delay _m 。

Step 604, adjusting the first intermediate model to be trained of each target node cluster to have larger FLPs and the deduction time delay not exceeding the target maximum deduction time delay t _m Is the first target node cluster of (a) and two middle models to be trained.

Alternatively, step 604 may be implemented by the following code.

Meeting the model adjustment condition indicates that the FLPs in the plurality of preset models are larger than those of the first intermediate model to be trained, and the deduction time delay is not more than t _m Is a preset model of (a).

Second intermediate training mode of target node clusterThe FLOPs existing in the plurality of preset models are larger than those of the first intermediate model to be trained, and the deduction time delay is not more than t _m Is a preset model of (a).

And step 605, carrying out partition adjustment processing on the second intermediate model to be trained of each target node cluster until the partition is stable, and obtaining the target model to be trained of each target node cluster.

Specifically, the specific process of step 605 is the same as the specific process of step 602, and will not be described here again.

Further, based on a plurality of preset sample data, the target to-be-trained models of all target node clusters are subjected to joint training, and the student models of all target node clusters are obtained.

In the embodiment of fig. 6, a portion of the filters may be divided from the node cluster with the largest deduction delay to the node cluster with the smallest deduction delay, so as to average the largest deduction delay and the smallest deduction delay, thereby achieving the purpose of reducing the deduction delay. In addition, the partition adjustment processing based on the spectral clustering principle can make up for the defect of larger size difference of each filter set, and further improve the deduction precision and robustness. Further, when determining the model to be trained, comprehensively considering FLOPS, storage capacity and data transmission rate of each edge node, and distributing a preset model and a filter set which are adaptive to the capacity of each node cluster, so that deduction time delay of each node cluster is close, resources are fully utilized, and deduction performance is improved.

Further, in step 604, a model with stronger learning ability can be adopted without affecting the maximum deduction delay, which is beneficial to improving the computing resource utilization rate of the target node cluster and maximizing the deduction accuracy.

The performance of the knowledge-based deep learning model collaborative deduction method according to the present invention is described below with reference to tables 1 and 2 and fig. 7 to 9, and simulation deduction results of the knowledge-based deep learning model collaborative deduction method (i.e., my_non), my_baseline, non_Redundancy and uniformin) provided by the present invention are compared based on a teacher model trained on a CIFAR-10 data set.

The my_baseline is close to the my_non idea, while taking into account the heterogeneous nature of the edge network and the robustness of the deduction. my_baseline differs from my_non as follows: (1) Guiding a node clustering process in the my_baseline by using the average successful transmission probability; (2) And the construction and distribution process of the model to be trained strictly meets the constraint of deduction delay. That is, given the communication delay threshold and the calculation delay threshold as constraints, the size and model structure of the filter set allocated to each node cluster are determined according to formula 4, and then the plurality of convolution filters are allocated one by one after the order of importance.

The non-Redundancy only considers the difference of the storage capacity and FLOPS of the edge nodes, and does not consider the clustering process of setting redundant nodes in one node cluster to improve the robustness and without the edge nodes.

The uniform_NoNN does not consider the difference of storage capacity and FLOPS of edge nodes, does not consider that the edge nodes are clustered to obtain node clusters so as to set redundant nodes to improve robustness, but each edge node runs a student model, the sizes of filter sets corresponding to the student models are balanced, and the adopted preset model structures are the same.

The my_NoNN, teacher model (Teacher), non_Redundanc and unimorph_NoNN are all implemented based on the Pytorch framework, and for all simulation deductions, 8 edge nodes are used for deduction.

Table 1 shows exemplary simulated collaborative deduction performance comparison results for Teacher, my_NoNN, my_baseline, non_Redundancy and uniformin_NoNN.

TABLE 1

It can be seen from table 1 that the my_non can obtain student models with significantly reduced parameters and FLOPs relative to the teacher model, so that the loss of deduction accuracy is only less than 0.7% due to the difference of storage capacity and FLOPs of the edge nodes, and the deduction accuracy of my_non relative to the non_reduce bind method and the deduction_non is highest.

Fig. 7 is a schematic diagram showing a comparison of collaborative deduction performance simulation results of a deep learning model collaborative deduction method based on knowledge distillation. As shown in fig. 7, includes: and adopting the my NoNN, teacher, non _Redunc and the unimorph_NoNN to obtain simulation results.

In the simulation deduction process of fig. 7, the robustness of each scheme by the fault edge node is set, and as the number of the fault nodes increases, the lost deduction result increases, so that the deduction accuracy of the my_non, non_Redundanc and uniformin_non all shows a remarkable decline trend. The deduction accuracy decline curve of my_non is significantly flatter than two schemes of uniorm_non and non_redundanc that do not introduce redundant student models. When 4 edge nodes with faults exist, the my_NoNN can still reach 88.08% of average deduction accuracy, and the result shows that the my_NoNN can effectively improve deduction accuracy and deduction robustness.

Fig. 8 is a second comparison diagram of collaborative deduction performance simulation results of the knowledge distillation-based deep learning model collaborative deduction method provided by the invention. As shown in fig. 8, includes: the edge nodes use non-Redundanc, my_ baseline, uniform _NoNN and my_NoNN based on 6 heterogeneous grades, resulting in simulation results.

The results shown in fig. 8 were obtained based on the 6 isomerism grades in table 2.

TABLE 2

Grade of isomerism	0	1	2	3	4	5
							FLOPS maximum difference	0	10	15	20	25	30
Maximum difference in data transmission rate	0	100	200	300	400	500

It can be seen from table 2 and fig. 8 that the co-deduction performance of my_non is substantially identical to the other three schemes when all devices within the cluster are homogenous. As the degree of isomerism of the edge node set becomes larger, the uniformity_non does not consider the isomerism of the edge node set, and all the edge nodes adopt almost the same preset model, so that the deduction time delay is obviously increased; the three other schemes can be well adapted to the isomerism of the edge node set, and the deduction time delay shows a gentle change trend. In the tuning stage, the my_non adjusts the filter set and the model structure through iteration to adapt to the isomerism of the edge node set, so that the minimum deduction time delay can be obtained under different isomerism levels.

Fig. 9 is a third comparison diagram of collaborative deduction performance simulation results of the knowledge distillation-based deep learning model collaborative deduction method according to the present invention. As shown in fig. 9, includes: and adopting the my NoNN, teacher, non _Redunc and the unimorph_NoNN to obtain simulation results.

In the simulation deduction of fig. 9, the average transmission success probability of the edge node set is changed and a preset probability threshold p _th 。

It can be seen from fig. 9 that as the average transmission success probability of the edge node set increases, which means that the communication condition becomes better, the deduction delay of the system is different p _th The lower part is gradually lowered. This is because the better the communication condition is, the fewer the redundancy of the student model is, and therefore the number of node clusters is increased, and the smaller the size of the filter set is, the lower the deduction delay is. Larger p in determining average probability of successful transmission for a set of edge nodes _th System robustness can be improved but generally results in greater deduction delay. Thus p is _th The choice of (c) needs to take into account the trade-off between robustness and derived delay.

The knowledge distillation-based deep learning model collaborative deduction device provided by the invention is described below, and the knowledge distillation-based deep learning model collaborative deduction device and the knowledge distillation-based deep learning model collaborative deduction method described below can be correspondingly referred to each other.

Fig. 10 is a schematic structural diagram of a deep learning model collaborative deduction device based on knowledge distillation. As shown in fig. 10, the deep learning model collaborative deduction device based on knowledge distillation includes:

An obtaining module 1010, configured to obtain floating point operation times per second, storage capacity, and probability of successful data transmission corresponding to each edge node;

a clustering module 1020, configured to perform clustering processing on the edge nodes based on the flow, the storage capacity, and the probability of successful data transmission, to obtain K target node clusters;

the dividing module 1030 is configured to perform set division processing on a plurality of convolution filters of a last convolution layer in a preset teacher model, so as to obtain K filter sets;

a determining module 1040, configured to determine a model to be trained of each target node cluster based on the K target node clusters, the K filter sets, and a plurality of preset models;

the training module 1050 is configured to perform joint training on the model to be trained of each target node cluster by using a knowledge distillation technology based on a plurality of preset sample data, so as to obtain a student model of each target node cluster;

a deployment module 1060, configured to deploy, for each target node cluster, a student model of the target node cluster to each edge node in the target node cluster; and each edge node in each target node cluster is used for executing collaborative deduction when a corresponding student model is operated.

According to the deep learning model collaborative deduction device based on knowledge distillation provided by the invention, the clustering module 1020 is specifically configured to:

clustering each edge node based on the FLOPS, the storage capacity and the probability of successful data transmission to obtain a plurality of initial node clusters;

and adjusting the edge nodes in the initial node clusters to obtain the K target node clusters.

arranging the plurality of edge nodes according to the sequence from small to large of the FLOPS and the sequence from small to large of the storage capacity under the condition that the FLOPS is the same to obtain a node set;

performing node division operation for each edge node except the first edge node in the node set: acquiring an ith node cluster set; determining distances between the edge node and cluster center nodes of each node cluster respectively based on FLOPS and storage capacity of the edge node and FLOPS and storage capacity of the cluster center nodes of each node cluster in the ith node cluster set; arranging the node clusters according to the sequence from small to large of the distances to obtain a target node cluster set; based on the probability of successful data transmission, determining the cumulative probability of successful data transmission of each node cluster in the target node cluster set; if a first node cluster with the accumulated transmission success probability smaller than the preset probability threshold and the distance between the cluster center node and the edge node smaller than the preset distance threshold exists in the target node cluster set, dividing the edge node into the first node cluster to obtain an i+1th node cluster set, and updating the cluster center node of the first node cluster; otherwise, creating a node cluster, and dividing the edge node into the newly created node cluster to obtain an i+1th node cluster set, wherein the i+1th node cluster set comprises each node cluster in the i node cluster set and the newly created node cluster;

Updating the ith node cluster set into the (i+1) th node cluster set, and repeatedly executing the node dividing operation for N times to obtain a plurality of initial node clusters; wherein N is the total number of other edge nodes in the node set except the first edge node;

initially, i is equal to 1, and the ith node cluster set comprises the preset node cluster.

processing FLOPS and storage capacity of the edge node and FLOPS and storage capacity of the cluster center nodes of each node cluster through a distance calculation model to obtain distances between the edge node and the cluster center nodes of each node cluster;

the distance calculation model is as follows:

wherein ,m_i Representing the edge node, M _k Representing the cluster of the nodes in question,cluster core nodes representing the node cluster, d representing the node clusterDistance between edge node and cluster center node of the node cluster, +.>Representing the storage capacity of said edge node, +.>Storage capacity of cluster core nodes representing said cluster of nodes,/->FLOPS, indicative of said edge node, +. >FLOPS representing cluster core nodes of the cluster of nodes.

processing the successful transmission probability of the data of each edge node in the node cluster by a probability calculation model aiming at each node cluster to obtain the cumulative transmission success probability of the node cluster;

the probability calculation model is as follows:

wherein ,representing the cumulative transmission success probability of the node cluster, M _k Representing the node cluster, m _i Representing edge nodes in said cluster of nodes, < >>Representing the probability of successful data transmission of the edge node, and pi (·) represents the accumulation operation.

performing node adjustment operations: acquiring a plurality of ith node clusters; determining a target ith node cluster with the smallest cumulative successful transmission probability from the plurality of ith node clusters; determining the distance between each edge node in the target ith node cluster and the cluster center node of each other node cluster according to the situation that the accumulated successful transmission probability of the target ith node cluster is smaller than a preset probability threshold; dividing the edge node into other node clusters corresponding to the minimum distance to obtain a plurality of i+1th node clusters; the other node clusters are node clusters except the target ith node cluster in the plurality of ith node clusters, and cluster center nodes of the other node clusters are nodes determined based on FLOPS and storage capacity of each edge node in the other node clusters;

Updating the plurality of ith node clusters into the plurality of (i+1) th node clusters, repeatedly executing the node adjustment operation until the respective accumulated successful transmission probabilities of the final plurality of node clusters are greater than or equal to the preset probability threshold, and determining the final plurality of node clusters as the K target node clusters;

initially, i is equal to 1, and the plurality of ith node clusters are the plurality of initial node clusters.

According to the deep learning model collaborative deduction device based on knowledge distillation provided by the invention, the dividing module 1030 is specifically configured to:

acquiring average activation values of the convolution filters respectively;

determining an adjacency weight matrix of the target graph based on the average activation value; wherein the target graph is a complete graph constructed based on the plurality of convolution filters;

and dividing the target graph based on the adjacency weight matrix and K through a standard division algorithm of spectral clustering to obtain the K filter sets.

According to the deep learning model collaborative deduction device based on knowledge distillation provided by the invention, the determining module 1040 is specifically configured to:

determining K target preset models in the preset models based on the target storage capacity of each target node cluster and the storage capacity requirement of each preset model; the target storage capacity of the target node cluster is the minimum storage capacity of at least one edge node in the corresponding target node cluster;

Constructing K models to be trained based on the K target preset models and the K filter sets;

determining a model set to be selected of each target node cluster based on the target storage capacity of each target node cluster and the K models to be trained;

sequencing each target node cluster based on the number of to-be-trained models included in the to-be-selected model set and the average FLOPS of each target node cluster to obtain a node cluster set;

aiming at each target node cluster in a node cluster set, determining the model to be trained meeting the preset conditions as the model to be trained of the target node cluster under the condition that the model to be trained meeting the preset conditions exists in a model to be selected of the target node cluster;

the preset conditions include at least one of the following:

the model set to be selected comprises a model to be trained;

no corresponding target node cluster is allocated.

Fig. 11 is a schematic structural diagram of an electronic device provided by the present invention. As shown in fig. 11, the electronic device may include: a processor 1110, a communication interface Communications Interface 1120, a memory 1130, and a communication bus 1140. Processor 1110, communication interface 1120, and memory 1130 perform communication with each other through communication bus 1140. Processor 1110 may invoke logic instructions in memory 1130 to perform a knowledge-based distillation deep learning model collaborative deduction method.

Further, the logic instructions in the memory 1130 described above may be implemented in the form of software functional units and sold or used as a stand-alone product, stored on a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, where the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, the computer can execute the deep learning model collaborative deduction method based on knowledge distillation provided by the above methods. In yet another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the knowledge distillation based deep learning model collaborative deduction method provided by the above methods.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden. From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments. Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A knowledge distillation-based deep learning model collaborative deduction method is characterized by comprising the following steps of:

2. The method of claim 1, wherein the clustering the edge nodes based on the flow, the storage capacity, and the probability of successful data transmission to obtain K target node clusters comprises:

3. The method of claim 2, wherein the clustering the edge nodes based on the flow, the storage capacity, and the probability of successful data transmission to obtain a plurality of initial node clusters comprises:

4. The method of claim 3, wherein the determining the distance between the edge node and the cluster core node of each node cluster based on the FLOPS and storage capacity of the edge node and the FLOPS and storage capacity of the cluster core node of each node cluster in the set of ith node clusters, respectively, comprises:

the distance calculation model is as follows:

wherein ,m_i Representing the edge node, M _k Representing the cluster of the nodes in question,a cluster core node representing the node cluster, d representing a distance between the edge node and the cluster core node of the node cluster,/a- >Representing the storage capacity of the edge node,storage capacity of cluster core nodes representing said cluster of nodes,/->FLOPS, indicative of said edge node, +.>FLOPS representing cluster core nodes of the cluster of nodes.

5. The method of claim 3, wherein determining the cumulative probability of transmission success for each node cluster in the set of target node clusters based on the probability of successful transmission of data comprises:

the probability calculation model is as follows:

6. The method of claim 2, wherein said adjusting the edge nodes in the plurality of initial node clusters to obtain the K target node clusters comprises:

7. The method according to any one of claims 1 to 6, wherein the performing set division processing on the plurality of convolution filters of the last convolution layer in the preset teacher model to obtain K filter sets includes:

acquiring average activation values of the convolution filters respectively;

8. The method according to any one of claims 1 to 6, wherein the determining a model to be trained for each target node cluster based on the K target node clusters, the K filter sets, and a plurality of preset models comprises:

the preset conditions include at least one of the following:

the model set to be selected comprises a model to be trained;

no corresponding target node cluster is allocated.

9. The deep learning model collaborative deduction device based on knowledge distillation is characterized by comprising:

the acquisition device is used for acquiring floating point operation times FLOPS, storage capacity and data successful transmission probability corresponding to each edge node;

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements a knowledge-distillation based deep learning model collaborative deduction method according to any one of claims 1 to 8 when executing the program.