CN116152240B

CN116152240B - Industrial defect detection model compression method based on knowledge distillation

Info

Publication number: CN116152240B
Application number: CN202310412539.7A
Authority: CN
Inventors: 陈宇; 陈震
Original assignee: Xiamen Weitu Software Technology Co ltd
Current assignee: Xiamen Weitu Software Technology Co ltd
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-07-25
Anticipated expiration: 2043-04-18
Also published as: CN116152240A

Abstract

The invention discloses a method for compressing an industrial defect detection model based on knowledge distillation, which comprises the following steps: constructing and training a teacher model; distilling the teacher model to the student model by a knowledge distillation mode, comprising: intermediate network layer feature distillation: the student model adopts a characteristic distillation mode to learn the characteristic information of the middle network layer of the teacher model, so that the student model repeatedly approaches the characteristic information of the teacher model in the corresponding middle network layer in the middle network layers with different depths, and then the soft label of the teacher model is used for training the student model network; softmax layer target distillation: the student model learns softmax layer characteristic information of the teacher model in a mode of combining a soft target and a hard target. The invention adopts a knowledge distillation mode to distill the capability of the model for detecting the industrial defects with complex training in advance and high precision onto the lightweight model with simple structure, and simplifies the model parameters and improves the detection efficiency on the premise of ensuring the high precision detection rate.

Description

Industrial defect detection model compression method based on knowledge distillation

Technical Field

The invention relates to the technical field of industrial product defect detection, in particular to a method for compressing an industrial defect detection model based on knowledge distillation.

Background

With the continuous development of deep learning technology and the enhancement of hardware computing capability, the model detection capability is more and more powerful, and model network parameters are more and more complex, but in actual application scenes, the characteristics of light weight, high efficiency and the like are more focused. In the defect detection of the industrial field, the running beat and the memory occupation condition always need to be considered, so the defect detection of the industrial field is more biased to reduce the model parameters as much as possible under the condition of not remarkably reducing the detection precision, namely the lightweight property and the real-time property of the model are required.

The model network of the current mainstream deep learning-based industrial defect detection model generally has a convolutional network layer (input layer), an intermediate network layer and a finally output softmax layer (output layer). Taking depth residual network model res net-50 as an example, the intermediate network layer generally comprises a pooling layer, a plurality of residual layers, and the like, and intermediate layer characteristic information is generated in the model training process.

The output value of the output layer is a probability value of each category of the image, the output value of the softmax layer contains a large number of negative labels, the softmax function of the core is used in the multi-classification process, the output of a plurality of negative labels is mapped into a (0, 1) interval and can be seen as probability to be understood, so that multi-classification is carried out, and the softmax function is expressed as:

wherein the method comprises the steps ofRepresenting the final output of the network, which is the input image belonging toiThe likelihood of a category, the greater the value the greater the likelihood; />Represent the firstiThe probability value after the individual categories pass softmax is that the input image belongs toiProbability of category;Nis the number of categories.

The deep learning model with deeper and wider depth often needs a large amount of parameter calculation, and has higher requirements on hardware calculation resources, so the defect detection model based on the deep learning has higher precision, but is also limited in practical application, and in order to solve the problems, the classification model parameters are changed from complicated to simple and light, the parameter quantity is reduced, the model is light, and the adaptation to equipment with limited calculation force is required. Moreover, as the application scenes of industrial defect detection are more and more, the targeted compression of the model is particularly important for the desirability of industrial defects.

Aiming at the problems of complex parameters and large calculated amount of the existing industrial detection algorithm, the invention adopts a knowledge distillation method to lighten the model, reduces the parameter of the algorithm model, accelerates the detection speed and reduces the memory occupation under the condition of not losing the detection precision.

Disclosure of Invention

The invention aims to provide an industrial defect detection model compression method based on knowledge distillation, which distills the capability of an industrial defect detection model with complex training in advance and high precision onto a lightweight model with simple structure by adopting a knowledge distillation mode of a teacher model and a student model, and can simplify model parameters and improve detection efficiency on the premise of meeting the required high-precision detection rate.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

an industrial defect detection model compression method based on knowledge distillation comprises the following steps:

constructing and training a teacher model;

distilling the teacher model to the student model by a knowledge distillation mode, comprising:

intermediate network layer feature distillation: the student model adopts a characteristic distillation mode to learn the characteristic information of the middle network layer of the teacher model, so that the student model repeatedly approaches the characteristic information of the teacher model in the corresponding middle network layer in the middle network layers with different depths, and then the soft label of the teacher model is used for training the student model network;

softmax layer target distillation: the student model learns softmax layer characteristic information of the teacher model in a mode of combining a soft target and a hard target; the soft target refers to the learning of the teacher model by using the probability value output by the softmax layer of the teacher model, and the hard target refers to the learning of the teacher model by using the last label value of the teacher model.

Preferably, the distillation of the characteristics of the intermediate network layer specifically comprises:

selecting an intermediate network layer to be distilled, and connecting a convolution layer behind the intermediate network layer of the student model to ensure that the output characteristic information size of the student model is consistent with the output characteristic information size of the teacher model;

training parameters of an intermediate network layer of the student model according to the characteristic distillation mode;

the intermediate network layer of the student model is trained using soft labels of the teacher model intermediate network layer.

Further, in the middle network layer characteristic distillation process, the output characteristic information size of the middle network layer output characteristic information of the student model and the difference loss L between the middle network layer output characteristic information of the teacher model are kept consistent by calculating the calculation formula of the output characteristic information difference loss L:

in the above-mentioned description of the invention,is the distillation coefficient of the different stages,iis a certain stage layer of student model and teacher model in the middle network layer, and is a part of the middle network layer>Is the output characteristic information of the i-th stage teacher model,>is the output characteristic information of the student model of the ith stage.

Preferably, the softmax layer is distilled target, specifically comprising:

setting a probability value output by a teacher model softmax layer as an initial soft label, and generating a deformed soft label according to a set distillation temperature T based on the initial soft label, wherein a label value finally obtained by the teacher model is a hard label;

simultaneously using the deformed soft tag and the hard tag to carry out high-temperature distillation on the student model;

and dynamically adjusting the contribution degree of the deformed soft tag and the hard tag to the whole high-temperature distillation process according to the feedback result of the high-temperature distillation.

Further, the softmax function of the deformed soft label based on the initial soft label according to the set distillation temperature T is expressed as:

in the above formula, whereinRepresenting the final output of the model, representing that the input image belongs toiPossibility of category->Representing the probability value that the i-th category developed after passing through the softmax layer,Tis the distillation temperature.

Further, in the high temperature distillation process, an objective function is weighted by a distillation loss value Loft and a student model loss value Lhard, and the objective function is expressed as:

αandβrespectively represent distillation loss valuesL _sof And student model loss valueL _hard Weighting values in an objective function, wherein the distillation loss valueL _sof The formula is as follows:

in the above-mentioned description of the invention,is the temperature of the teacher modelTOutput at the first through softmax layeriValue of class->Is the temperature of the student modelTOutput at the first through softmax layeriA class value, wherein:

in the above-mentioned description of the invention,y _i is the teacher model softmax layer output,z _i is the student model softmax layer output,Nis the number of categories;

the student model loss valueL _hard The formula is as follows:

in the above-mentioned description of the invention,c _i is the firstiThe true tag value of the class is used,，/>is thatTStudent model output at 1 through softmax layeriClass values.

Further, according to the feedback result of the high-temperature distillation, the contribution degree of the deformed soft tag and the hard tag to the whole distillation is dynamically adjusted, which specifically comprises: and dynamically adjusting alpha and beta weight values in the objective function according to a feedback result of the high-temperature distillation to balance the contribution degree of the soft and hard labels to the whole high-temperature distillation process.

Further, the student model adopts a depth separable convolution DenseNet as an input layer.

After the scheme is adopted, the invention has the following beneficial effects:

aiming at a complex industrial defect detection Model (teacher Model), the invention learns the characteristic information of an intermediate network layer of the teacher Model in a characteristic distillation mode, learns the characteristic information of a softmax layer of the teacher Model in a target distillation mode, and obtains a lightweight teacher Model Lightweight Model (student Model), which can ensure that the running rate of the Model is greatly improved on the premise that the precision of an original Model is basically not lost, and can meet the requirement of high detection rate under the condition of limited industrial quality inspection computing resources.

The industrial defect detection model compression method can be applied to industrial defect detection modules of equipment or systems such as a top cover welding visual detection system, an automatic feeding and discharging machine (universal), a Mylar machine bag Mylar CCD detection device, a sealing nail welding visual detection system, an EPD burning and lighting AOI detection device, a battery cell appearance detection machine, a blade battery six detection system, a two-dimensional bar code reader VCR, a bending machine, a PSA small material attaching machine (single channel) and the like.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other variants can be obtained according to these drawings without the aid of inventive efforts to a person skilled in the art.

FIG. 1 is a flow chart of the steps of a method for compressing an industrial defect detection model based on knowledge distillation in accordance with the present invention;

FIG. 2 is a block diagram of the overall scheme design of the knowledge-based industrial defect detection model compression method of the present invention;

FIG. 3 is a schematic diagram of a characteristic distillation structure;

FIG. 4 is a schematic diagram of a target distillation structure;

FIG. 5 is a generic convolutional layer parameter resolution diagram;

fig. 6 is a DenseNet layer parameter resolution diagram.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides an industrial defect detection model compression method based on knowledge distillation, which is a model compression method based on knowledge distillation, wherein the model compression method can damage the structure of an original model through parameter cutting, precision conversion and neural network structure searching, namely, the knowledge learned from data of the original model can be damaged, the knowledge learned in the original model is protected through the knowledge distillation, and then the knowledge is transferred into a compression model, so that the compression model has much smaller volume than the original model, but can learn the same knowledge. The Model to be compressed in knowledge distillation is called a Teacher neural network (Teacher Model) or a "Teacher Model" described below, and the Model after compression is called a Student neural network (Student Model) or a "Student Model" described below.

As shown in fig. 1-2, the above-mentioned industrial defect detection algorithm based on the standardized flow model specifically includes the following two stages:

s1, an original model training stage, namely constructing and utilizing a data set to train a teacher model;

s2, distilling the teacher model to the student model in a knowledge distillation mode, wherein the knowledge distillation step specifically comprises the following steps:

s21, distilling the characteristics of an intermediate network layer: the student model adopts a characteristic distillation mode to learn the characteristic information of an intermediate network layer of the teacher model, so that the student model repeatedly approaches the characteristic information of the teacher model in the corresponding intermediate network layer in the intermediate network layers with different depths, and then a soft label of the teacher model is used for training the student model network, wherein the soft label is a probability value output by a softmax layer of the teacher model; the student model fits not only the soft-label of the teacher model, but also the output of several hidden layers in the intermediate network layer (feature information extracted by the teacher model). Referring to fig. 3, the middle network layer characteristic distillation process specifically includes:

s211, selecting an intermediate network layer to be distilled, wherein the output characteristic value sizes of the intermediate network layer of the teacher model and the intermediate network layer of the student model can be different, and a convolution layer is required to be connected behind the intermediate network layer of the student model so that the output characteristic information size of the intermediate network layer is consistent with the output characteristic information size of the teacher model; specifically, the difference loss L between the output characteristic information of the middle network layer of the student model and the output characteristic information of the middle network layer of the teacher model is calculated so that the sizes of the output characteristic information of the student model and the output characteristic information are consistent, and the calculation formula of the difference loss L of the output characteristic information is as follows:

in the above-mentioned description of the invention,is the distillation coefficient of the different stages,iis a certain stage layer of student model and teacher model in the middle network layer, and is a part of the middle network layer>Is the firstiOutput characteristic information of stage layer teacher model, +.>Is the firstiAnd outputting characteristic information of the stage layer student model.

S212, training parameters of an intermediate network layer of the student model according to a characteristic distillation mode, wherein the student model learns output characteristic information of the teacher model;

s213, training the middle network layer of the student model by using the soft label of the middle network layer of the teacher model.

S22, softmax layer target distillation: the student model learns softmax layer characteristic information of the teacher model in a mode of combining a soft target and a hard target; in a classification model, the final output of the model network is typically a softmax layer, the layer output values are probability values for each class, and when the probability values output by the softmax layer of the teacher model are used to guide the student model to learn, this approach is called a soft target; when the final tag value of the teacher model is used to guide the student model to learn, this approach is called a hard target. The output value of the last softmax layer of the network model contains a large amount of negative label reasoning information, the information is favorable for the model to extract the characteristic information of the image more fully and comprehensively, and the hard target is favorable for the characteristic information learning of the current sample, so that the embodiment plays the advantages of two modes and guides the student model according to the teaching of the material. Referring to fig. 4, the softmax layer target distillation specifically includes:

s221, setting a probability value output by a soft max layer of a teacher model as an initial soft label (soft-target), and generating a deformed soft label according to a set distillation temperature T based on the initial soft label, wherein a label value finally obtained by the teacher model is a Hard label (Hard-target); the softmax function of the deformed soft label is expressed as:

in the above formula, whereinRepresenting the final output of the model, representing the likelihood that the input image belongs to the i category, the greater the value the greater the likelihood +.>Representing a probability value obtained after the ith category passes through the softmax layer, wherein T is distillation temperature, when T=1, the formula is a softmax function formula of the initial soft label, the higher T is, the smoother the function output probability is, the greater the distribution entropy is, the corresponding amplification is carried out on the characteristic information of the negative label, and the model learning is more focused on the negative label;

s222, performing high-temperature distillation on the student model by using the deformed soft tag and the hard tag simultaneously; in the high temperature distillation process, the objective function is determined by distillation loss valueL _soft (distill loss) and student model loss valuesL _hard (student) weighting, the objective function is expressed as:

αandβrespectively representDistillation loss valueL _soft And student model loss valueL _hard Weighting values in an objective function, wherein the distillation loss valueL _soft The formula is as follows:

in the above-mentioned description of the invention,is the teacher model outputs the value in class i,/through softmax layer at temperature T>Is the student model outputs a value in class i through the softmax layer at temperature T, where the specific formula is as follows:

the student model loss valueL _hard The formula is as follows:

in the above-mentioned description of the invention,c _i is the firstiThe true tag value of the class is used,when the label is positivec _i 1, negative labelc _i 0->Is thatTFor 1 student model pass softmax layer output at the firstiClass values.

S223, dynamically adjusting the contribution degree of the deformed soft tag and the hard tag to the whole high-temperature distillation process according to the feedback result of the high-temperature distillation, specifically: according to the feedback result of high-temperature distillation, the alpha and beta weight values in the objective function are dynamically regulated to balance the contribution degree of the soft and hard labels to the whole high-temperature distillation process, the softmax layer output of the student model under the condition that the distillation temperature T is 1 and the cross entropy loss function cross entropy of the real labels are student model loss values Lhard, and introducing Lhard can be understood as effectively correcting the learning effect of the student model by using the real label values so as to prevent the student model from being misdeviated by a teacher model occasionally. Experiments show that whenL _hard With smaller weight, better effect can be produced, andL _soft the contribution degree of (2)L _hard A kind of electronic deviceTherefore, if the contribution degree of the two losses needs to be kept consistent, the two losses need to be inL _soft The weight of (2) is multiplied by a T2 coefficient.

In this embodiment, preferably, the student model uses a depth separable convolution DenseNet as an input layer, and the learning model uses the depth separable convolution DenseNet to replace a common convolution layer in a teacher model, under the same input and output conditions, the parameter amount and the calculation amount of the depth separable convolution DenseNet (hereinafter referred to as a "DenseNet layer") are smaller, and the DenseNet layer is in a single channel form, and convolutions are performed on each channel of image data, so that the number of characteristic information channels and the number of input channels obtained after passing through the DenseNet layer are kept consistent.

In the general convolution layer shown in fig. 5, the convolution layer is a convolution with n convolution kernels of k×k×m, and the number of convolution parameters is: kxkxmxn, a feature map of wxhxn is obtained by this convolution, that is, the calculated amount of this convolution is: kxkxmxnxwxh. The DenseNet layer consists of a depth convolution layer and a point-by-point convolution layer, as shown in FIG. 6, wherein the convolution kernel of the depth convolution has a size of k x m; the point-by-point convolution is a convolution with n convolution kernels of 1×1×m, so its parameters are: kxkxm+mxn, and then a w×h×n feature map is generated by passing through the DenseNet layer, namely, the calculated amount of the DenseNet layer is as follows: kxkxmxwxh+mxnxwxh.

The parameter number and the calculated amount of the common convolution layer and the DenseNet layer are calculated to be 1/n+1/K2 of the common convolution layer, so that the model has a lighter structure by using the DenseNet layer instead of the common convolution layer.

The following is an application example of the industrial defect detection model compression method based on knowledge distillation in this embodiment:

according to the invention, knowledge distillation is applied to compression of an industrial laser welding defect detection model, test pictures are 1000 welded defect sample pictures, the pictures are randomly divided into 10 groups, 100 pictures are respectively input into the model for testing, the results of all groups are taken to obtain an average value, and the test results are shown in the following table 1:

table 1: comparison of test results

The method for compressing the industrial defect detection model based on knowledge distillation can greatly improve the running speed of the model on the premise that the precision of the original model is basically not lost, and can meet the requirement of high detection speed under the condition of limited industrial quality inspection computing resources.

The industrial defect detection model compression method based on knowledge distillation can be applied to the defect detection of various industrial products, such as the defect detection of electronic products, mechanical equipment, precise instruments, parts and the like, and has the industrial product defect detection modules which can be applied to equipment or systems such as a top cover welding visual detection system, an automatic feeding and discharging machine (universal), a Mylar machine package Mylar CCD detection device, a sealing nail welding visual detection system, an EPD burning and lighting inspection AOI device, a battery cell appearance detection machine, a six-sided blade battery detection system, a two-dimensional bar code reader VCR, a bending machine, a PSA small-size attachment machine (single channel) and the like.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "examples," "particular examples," or "an alternative embodiment," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-described embodiments do not limit the scope of the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the above embodiments should be included in the scope of the present invention.

Claims

1. The industrial defect detection model compression method based on knowledge distillation is characterized by comprising the following steps of:

constructing and training a teacher model;

intermediate network layer feature distillation: the student model adopts a characteristic distillation mode to learn the characteristic information of the middle network layer of the teacher model, so that the student model repeatedly approaches the characteristic information of the teacher model in the corresponding middle network layer in the middle network layers with different depths, and then the soft label of the teacher model is used for training the student model network; the method specifically comprises the following steps:

training an intermediate network layer of the student model using the soft labels of the teacher model intermediate network layer;

softmax layer target distillation: the student model learns softmax layer characteristic information of the teacher model in a mode of combining a soft target and a hard target; the soft target refers to the fact that the probability value output by the softmax layer of the teacher model is used for guiding the student model to learn the knowledge of the teacher model, and the hard target refers to the fact that the final label value of the teacher model is used for guiding the knowledge of the teacher model when the student model learns; the method specifically comprises the following steps:

dynamically adjusting the contribution degree of the deformed soft tag and the hard tag to the whole high-temperature distillation process according to the feedback result of the high-temperature distillation;

the softmax function of the deformed soft label based on the initial soft label according to the set distillation temperature T is expressed as:

in the above formula, whereinRepresenting the final output of the model, representing that the input image belongs toiPossibility of category->Represent the firstiProbability values obtained after the individual categories pass through the softmax layer, T being the distillation temperature;

in the high temperature distillation process, the objective function is determined by distillation loss valueL _soft And student model loss valueL _hard Weighting is obtained, and the objective function is expressed as:

alpha and beta respectively represent distillation loss valuesL _soft And student model loss valueL _hard Weighting values in an objective function, wherein the distillation loss valueL _soft The formula is as follows:

in the above-mentioned description of the invention,is the teacher model outputs the value in class i,/through softmax layer at temperature T>Is the student model outputs a value at class i through the softmax layer at temperature T, where:

in the above-mentioned description of the invention,y _i is the teacher model softmax layer output,z _i is student model softmax layer output, N is the number of categories;

the student model loss valueL _hard The formula is as follows:

in the above-mentioned description of the invention,c _i is the firstiThe true tag value of the class is used, ，/>is output at the first layer of softmax of student model when T is 1iA value of the class;

according to the feedback result of high-temperature distillation, the contribution degree of the deformed soft tag and the hard tag to the whole distillation is dynamically adjusted, and the method specifically comprises the following steps: according to the feedback result of high-temperature distillation, dynamically adjusting alpha and beta weight values in an objective function to balance the contribution degree of the soft and hard labels to the whole high-temperature distillation process;

the student model employs a depth separable convolution DenseNet as an input layer.

2. The method for compressing the industrial defect detection model based on knowledge distillation as set forth in claim 1, wherein: in the middle network layer characteristic distillation process, the difference loss between the middle network layer output characteristic information of the student model and the middle network layer output characteristic information of the teacher model is calculatedLThe sizes of the output characteristic information of the two are kept consistent, and the difference loss of the output characteristic informationLThe calculation formula is as follows: