WO2023185209A1 - 模型剪枝 - Google Patents

模型剪枝 Download PDF

Info

Publication number
WO2023185209A1
WO2023185209A1 PCT/CN2023/071540 CN2023071540W WO2023185209A1 WO 2023185209 A1 WO2023185209 A1 WO 2023185209A1 CN 2023071540 W CN2023071540 W CN 2023071540W WO 2023185209 A1 WO2023185209 A1 WO 2023185209A1
Authority
WO
WIPO (PCT)
Prior art keywords
pruning
output
model
target model
information
Prior art date
Application number
PCT/CN2023/071540
Other languages
English (en)
French (fr)
Inventor
余昉
王萌
黄堃
程远
Original Assignee
支付宝(杭州)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 支付宝(杭州)信息技术有限公司 filed Critical 支付宝(杭州)信息技术有限公司
Publication of WO2023185209A1 publication Critical patent/WO2023185209A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the embodiments of this specification relate to the field of computer technology, and in particular to methods, devices and computer equipment for model pruning.
  • deep learning has achieved great success in the application of artificial intelligence, including computer vision, speech recognition, natural language processing, etc.
  • artificial intelligence including computer vision, speech recognition, natural language processing, etc.
  • deep learning models are usually large in scale and require a high amount of storage and computing resources, making it difficult to efficiently apply deep learning models to various hardware devices.
  • Embodiments of this specification provide a model pruning method, device and computer equipment to reduce the occupation of storage resources and computing resources.
  • the technical solutions of the embodiments of this specification are as follows.
  • a first aspect of the embodiments of this specification provides a model pruning method, which includes: determining mask information according to pruning parameters; the mask information is used to indicate the effective status of the pruning object in the target model; Input to the target model after adding the mask information to obtain the first output of the target model; according to the first output, optimize the parameter information; the parameter information includes the model parameters and pruning parameters of the target model; perform the above steps iteratively until The end condition is met; the pruning object is pruned based on the mask information.
  • a second aspect of the embodiments of this specification provides a model pruning method, which includes: inputting samples into a target model after adding a plug-in output module, and obtaining the first output of the target model and the second output of the plug-in output module;
  • the target model includes a plurality of unit modules with the same structure stacked in sequence, and the plug-in output module is connected to the unit module; parameter information is optimized according to the first output and the second output, and the parameter information includes model parameters of the target model and Model parameters of the plug-in output module; perform the above steps iteratively until the end condition is met; prune multiple unit modules according to the performance indicators of the plug-in output module.
  • a third aspect of the embodiment of this specification provides a model pruning device, including: an iterative unit, used to iteratively execute the following steps until the end condition is met: determine mask information according to the pruning parameters; the mask information is Indicate the effective state of the pruning object in the target model; input the sample to the target model after adding the mask information to obtain the first output of the target model; optimize the parameter information according to the first output; the parameter information includes the target model The model parameters and pruning parameters; the pruning unit is used to prune the pruning object according to the mask information.
  • the fourth aspect of the embodiments of this specification provides a model pruning device, including: an iterative unit, used to iteratively execute the following steps until the end condition is met: input the sample to the target model after adding a plug-in output module, and obtain the target The first output of the model and the second output of the plug-in output module; the target model includes a plurality of unit modules with the same structure stacked in sequence, and the plug-in output module is connected to the unit module; according to the first output and the second output, optimize Parameter information, which includes model parameters of the target model and model parameters of the plug-in output module; and a pruning unit, which is used to prune multiple unit modules according to the performance indicators of the plug-in output module.
  • a fifth aspect of the embodiments of this specification provides a computer device, including at least one processor and a memory storing program instructions, wherein the program instructions are configured to be adapted to be executed by the at least one processor, so The program instructions include instructions for performing the method described in the first aspect or the second aspect.
  • the mask information can be determined based on the pruning parameters, and the pruning parameters can be adaptively adjusted during the learning process of the target model. In this way, pruning the pruning object based on the mask information can improve the pruning effect, for example, the pruning accuracy can be improved, thereby reducing the occupation of storage resources and computing resources.
  • the technical solution provided by the embodiments of this specification adds a plug-in output module to the target model, and the plug-in output module can be trained together with the target model. Use the performance indicators of the plug-in output module to prune multiple unit modules in the target model. This achieves pruning of the target model in the depth dimension, reducing the occupation of storage resources and computing resources.
  • Figure 1 is a schematic diagram of pruning a fully connected neural network in an embodiment of this specification
  • Figure 2 is a schematic structural diagram of the visual converter in the embodiment of this specification.
  • Figure 3 is a schematic flow chart of the model pruning method in the embodiment of this specification.
  • Figure 4 is a schematic diagram of the first mask information of the linear layer in the embodiment of this specification.
  • Figure 5 is a schematic diagram of the second mask information of the multi-head self-attention layer in the embodiment of this specification.
  • Figure 6 is a schematic diagram of pruning the target model in the depth dimension in the embodiment of this specification.
  • Figure 7 is a schematic flow chart of the model pruning method in the embodiment of this specification.
  • Figure 8 is a schematic structural diagram of the model pruning device in the embodiment of this specification.
  • Figure 9 is a schematic structural diagram of the model pruning device in the embodiment of this specification.
  • Figure 10 is a schematic structural diagram of a computer device in an embodiment of this specification.
  • Model pruning can include structured pruning and unstructured pruning. By pruning the model, model redundancy can be reduced and a lightweight model can be obtained.
  • the lightweight model is smaller in scale and takes up less storage and computing resources.
  • Figure 1 is a schematic diagram of pruning a fully connected neural network.
  • the fully connected neural network may include an L1 layer, an L2 layer, an L3 layer, and an L4 layer. Pruning the fully connected neural network may include: deleting one neuron in the L2 layer, deleting one neuron in the L3 layer, deleting 14 connecting edges between the L1 layer and the L2 layer, deleting the L2 layer and L3 8 connecting edges between layers and 4 connecting edges between L3 and L4 layers are deleted. The connecting edges are used to represent weight parameters.
  • pruning parameters can be preset, and the model can be pruned according to the preset pruning parameters.
  • the convolutional neural network model may include a batch normalization layer (BN, Batch Normalize).
  • the batch normalization layer has a scaling factor ⁇ (a learnable model parameter).
  • the L1 norm penalty can be applied to the scaling coefficient ⁇ , and the convolutional neural network model after applying the L1 norm penalty can be sparsely trained.
  • the loss function used in sparse training can be expressed as
  • l(f(x), y) represents the loss function used in normal training
  • x represents the sample
  • f(x) represents the prediction result of the convolutional neural network model
  • y represents the sample label of the sample
  • represents the sparsification coefficient
  • represents the sum of the absolute values of the scaling coefficient ⁇ of each batch of normalized layers (i.e., the L1 norm of the scaling coefficient ⁇ ).
  • pruning parameters need to be preset based on manual experience. Such pruning parameters set based on manual experience may not match the actual situation of the model, thus affecting the pruning effect (for example, lower pruning accuracy).
  • the model may include multiple structurally identical unit modules stacked in sequence.
  • the depth of the model may refer to the number of unit modules.
  • the width of the model can refer to the number of processing units in the unit module.
  • the model may include a convolutional neural network model.
  • the convolutional neural network model may include a plurality of intermediate layers with the same structure stacked in sequence.
  • the intermediate layers may include convolutional layers, linear layers, etc.
  • the depth of the convolutional neural network model may refer to the number of intermediate layers.
  • the width of the convolutional neural network model may refer to the number of neurons in the intermediate layer.
  • the model may include a Vision Transformer.
  • the visual converter may include multiple encoding modules (Encoder Blocks) with the same structure stacked in sequence.
  • the depth of the visual converter may refer to the number of encoding modules.
  • the width of the visual transformer may refer to the number of self-attention heads in the multi-head self-attention layer.
  • models are often pruned in the width dimension, and models are rarely pruned in the depth dimension. This makes it impossible to reduce model redundancy in the depth dimension.
  • Embodiments of this specification provide a data processing system.
  • the data processing system may include pruning equipment and service equipment.
  • the pruning device and the service device may be a personal computer, a server, or a server cluster including multiple servers.
  • the pruning device can be used to perform pruning processing on the target model to obtain a lightweight target model; and can send the lightweight target model to the business device.
  • the business device can receive a lightweight target model; business data can be input into the lightweight target model to obtain prediction results.
  • the target model may include an image classification model, a text classification model, an audio classification model, etc.
  • the service data may include image data, text data, audio data, etc.
  • the prediction result may be used to indicate whether the business data is target business data.
  • the prediction result can be used to indicate whether the business data is abnormal business data.
  • the abnormal business data may include business data involving illegal content such as fraud.
  • the target model may include a neural network model.
  • the neural network model may include a fully connected neural network model, a neural network model based on a self-attention mechanism, etc.
  • the neural network model based on the self-attention mechanism may include a transformer (Transformer) and the like.
  • the transformer may include a vision transformer.
  • the vision converter can be applied to image classification scenarios.
  • the image classification scenarios include but are not limited to product classification scenarios (such as similar product recommendation), prohibited item detection scenarios (such as detection of prohibited items), face recognition scenarios (such as live body detection), urban planning scenarios (such as road network extraction), Meteorological scenes (such as cloud layer extraction), car insurance claim scenarios (determining the use of images, such as car damage images, property damage images, document images), etc.
  • the visual transformer may include a basic visual transformer and its variants. See Figure 2.
  • the visual converter may include a vector conversion layer, a plurality of sequentially stacked encoding modules (Encoder Blocks) with the same structure, and a classifier (Classifier).
  • the vector conversion layer is used to perform vector conversion to generate vector representation (Embedding).
  • the number of coding modules can be P.
  • the P can be 8, 9, 12, etc.
  • the encoding module can be used to extract feature vectors.
  • the classifier can be used for prediction.
  • Each encoding module can include a normalization layer (Norm), a multi-head attention layer (Multi-Head Attention), and a multi-layer perceptron (MLP).
  • the normalization layer is used for normalization.
  • the multi-head self-attention layer may include a linear layer (Linear, also known as a fully connected layer, hereafter referred to as the first linear layer), and multiple self-attention heads.
  • the first linear layer is used to aggregate the outputs of multiple self-attention heads.
  • the number of self-attention heads can be H.
  • the H can be 3, 4, 6, etc.
  • the plurality of self-attention heads have the same structure.
  • Each self-attention head may include a linear layer (Linear, hereinafter referred to as the second linear layer) and a self-attention layer (Self-Attention).
  • the second linear layer is used to perform linear transformation.
  • the self-attention layer is used to perform operations based on the attention mechanism.
  • image data can be segmented to obtain multiple image patches (Patch).
  • the plurality of image blocks have the same height and width.
  • the image block can be flattened to obtain the data sequence of the image block; the data sequence can be input to the vector conversion layer to obtain the vector representation of the image block (Patch Embedding).
  • the vector representation of the image block can be input to multiple coding modules to obtain the feature vector output by the last coding module; the feature vector output by the last coding module can be input to the classifier to obtain the prediction result of the visual converter.
  • the feature vector can be fused with the output of the multi-head self-attention layer, and the fusion result can be used as the input of the normalization layer.
  • Embodiments of this specification provide a model pruning method.
  • the model pruning method can be applied to pruning equipment.
  • the model pruning method can be used to prune the target model to reduce model redundancy and obtain a lightweight target model. See Figure 3.
  • the model pruning method may include the following steps.
  • Step S11 Determine the mask information according to the pruning parameters.
  • the mask information is used to indicate the effective status of the pruning object in the target model.
  • the effective status can be understood as the participation status of the target model in prediction, such as whether to participate in prediction.
  • the pruning objects may include linear layers and/or multi-head self-attention layers.
  • the mask information of the linear layer is called the first mask information below, and the mask information of the multi-head self-attention layer is called the second mask information.
  • the first mask information is used to indicate the effective state of the linear layer.
  • the second mask information is used to indicate the effective state of the multi-head self-attention layer.
  • the target model may include a visual transformer. Therefore, the pruning object may include linear layers in the self-attention head, such as a first linear layer and a second linear layer. Of course, pruning objects can also include other linear layers, such as linear layers in multi-layer perceptrons.
  • the weight parameters of the linear layer may include multiple sub-weight parameters.
  • the first mask information may include a plurality of first sub-mask information. There is a corresponding relationship between the first sub-mask information and the sub-weight parameters.
  • the first sub-mask information is used to indicate the valid status of the sub-weight parameter.
  • the value of the first sub-mask information may be 0 or 1. 0 is used to indicate that the sub-weight parameter is invalid, and 1 is used to indicate that the sub-weight parameter is valid. Of course, 0 or 1 here are only examples, and the first submask information can also have other values.
  • the weight parameters can be expressed as weight matrices. Elements in the weight matrix can represent sub-weight parameters.
  • the first mask information can be expressed as a mask matrix.
  • the elements in the mask matrix can represent the first sub-mask information.
  • the mask matrix and the weight matrix are matrices of the same order.
  • the first sub-mask information with the same two-dimensional coordinates has a corresponding relationship with the sub-weight parameter.
  • the two-dimensional coordinates may include row numbers and column numbers.
  • the multi-head self-attention layer may include multiple self-attention heads.
  • the second mask information may include a plurality of second sub-mask information. There is a correspondence between the second sub-mask information and the self-attention head.
  • the second submask information is used to indicate the valid state of the self-attention head.
  • the value of the second sub-mask information may be 0 or 1. 0 is used to indicate that the self-attention head is invalid, and 1 is used to indicate that the self-attention head is valid. Of course, 0 or 1 here are only examples, and the second submask information can also have other values.
  • both the number of self-attention heads and the number of second sub-mask information can be H. Each second sub-mask information can correspond to a self-attention head.
  • the pruning parameters include pruning threshold and value information of the pruned object.
  • the first pruning rate of the pruning object can be calculated based on the pruning threshold; the mask information of the pruning object can be determined based on the first pruning rate and the value information.
  • the pruning threshold may be a critical value used to define a valid state.
  • the initial value of the pruning threshold may be an empirical value, such as 10, 16, or 20.
  • the first pruning rate can be understood as the pruning ratio of the pruning object.
  • Value information is used to measure the importance of the pruned object.
  • Value information may include multiple sub-value information.
  • the mask information may include multiple sub-mask information. There may be a corresponding relationship between sub-value information and sub-mask information.
  • the comparison benchmark can be determined based on the first pruning rate and the plurality of sub-value information; the sub-value information can be compared with the comparison benchmark; and the sub-mask information corresponding to the sub-value information can be assigned a value based on the comparison result. Specifically, if the sub-value information is smaller than the comparison reference, the corresponding sub-mask information may be assigned a first preset value (for example, a value of 0). If the sub-value information is greater than or equal to the comparison benchmark, the corresponding sub-mask information may be assigned a second preset value (for example, a value of 1).
  • the value information of the linear layer is called the first value information below, and the value information of the multi-head self-attention layer is called the second value information.
  • the first value information is used to measure the importance of the linear layer.
  • the importance of the linear layer can be characterized by the effective state of the sub-weight parameters in the weight parameters.
  • the second value information is used to measure the importance of the multi-head self-attention layer.
  • the importance of the multi-head self-attention layer can be characterized by the effective state of the self-attention heads in the multi-head self-attention layer.
  • the first mask information of the linear layer may be determined according to the first pruning rate and the first value information.
  • the first value information may include a plurality of first sub-value information.
  • the first mask information may include a plurality of first sub-mask information. Each first sub-value information may correspond to multiple first sub-mask information.
  • the comparison benchmark can be determined based on the first pruning rate and the plurality of first sub-value information; the first sub-value information can be compared with the comparison benchmark, and the comparison result can be based on the comparison result corresponding to the first sub-value information.
  • the first submask information assignment is the comparison benchmark.
  • the first pruning rate can be expressed as r, 0 ⁇ r ⁇ 1.
  • the number of first sub-value information in the first value information may be m.
  • the r ⁇ m-th smallest first sub-value information can be selected as the comparison benchmark.
  • the m first sub-value information can be arranged in ascending order; the r ⁇ m-th first sub-value information can be selected as the comparison benchmark.
  • the corresponding first sub-mask information may be assigned a first preset value (for example, a value of 0), thereby indicating that the sub-weight parameter corresponding to the first sub-mask information is invalid. . If the first sub-value information is greater than or equal to the comparison benchmark, the corresponding first sub-mask information can be assigned a second preset value (for example, a value of 1), thereby indicating the sub-weight corresponding to the first sub-mask information. The parameters are valid.
  • the weight parameters of the linear layer can be expressed as a weight matrix. Elements in the weight matrix can represent sub-weight parameters.
  • the first mask information can be expressed as a mask matrix. The elements in the mask matrix can represent the first sub-mask information. The number of rows of the mask matrix and the weight matrix is m, and the number of columns is n. The first sub-mask information with the same two-dimensional coordinates has a corresponding relationship with the sub-weight parameter.
  • the first value information may include m first sub-value information. Each first sub-value information may correspond to a row of first sub-mask information in the mask matrix.
  • the corresponding first sub-mask information may be assigned a value of 0, thereby indicating that the sub-weight parameter corresponding to the first sub-mask information is invalid. If the first sub-value information is greater than or equal to the comparison benchmark, the corresponding first sub-mask information may be assigned a value of 1, thereby indicating that the sub-weight parameter corresponding to the first sub-mask information is valid.
  • the gray first sub-value information indicates the first sub-value information that is less than the comparison benchmark, and the white first sub-value information indicates the first sub-value information that is greater than or equal to the comparison benchmark.
  • the gray first submask information represents the first submask information with a value of 0, and the white first submask information represents the first submask information with a value of 1.
  • Gray sub-weight parameters represent invalid sub-weight parameters, and white sub-weight parameters represent valid sub-weight parameters.
  • the second mask information of the multi-head self-attention layer may be determined based on the first pruning rate and the second value information.
  • the second value information may include a plurality of second sub-value information.
  • the second mask information may include a plurality of second sub-mask information. Each second sub-value information may correspond to one second sub-mask information.
  • the comparison benchmark can be determined based on the first pruning rate and the plurality of second sub-value information; the second sub-value information can be compared with the comparison benchmark, and the comparison result can be corresponding to the second sub-value information.
  • the second submask information assignment is described in this way, the comparison benchmark can be determined based on the first pruning rate and the plurality of second sub-value information; the second sub-value information can be compared with the comparison benchmark, and the comparison result can be corresponding to the second sub-value information.
  • the first pruning rate can be expressed as r, 0 ⁇ r ⁇ 1.
  • the number of second sub-value information in the second value information may be H.
  • the r ⁇ Hth smallest second sub-value information can be selected as the comparison benchmark.
  • the H second sub-value information can be arranged in ascending order, and the r ⁇ H second sub-value information can be selected as the comparison benchmark.
  • the corresponding second sub-mask information may be assigned a first preset value (for example, a value of 0), thereby indicating the self-attention head corresponding to the second sub-mask information. invalid. If the second sub-value information is greater than or equal to the comparison reference, the corresponding second sub-mask information may be assigned a second preset value (for example, a value of 1), thereby indicating the self-attention corresponding to the second sub-mask information. The force head is effective.
  • the multi-head self-attention layer can include H self-attention heads.
  • the second mask information may include H pieces of second sub-mask information. Each second sub-mask information can correspond to a self-attention head.
  • the second value information may include H pieces of second sub-value information. Each second sub-value information may correspond to one second sub-mask information.
  • the corresponding second sub-mask information may be assigned a value of 0, thereby indicating that the self-attention head corresponding to the second sub-mask information is invalid. If the second sub-value information is greater than or equal to the comparison reference, the corresponding second sub-mask information may be assigned a value of 1, thereby indicating that the self-attention head corresponding to the second sub-mask information is valid. As shown in Figure 5.
  • the gray second sub-value information represents the second sub-value information that is less than the comparison benchmark, and the white second sub-value information represents the second sub-value information that is greater than or equal to the comparison benchmark.
  • the gray second submask information represents the second submask information with a value of 0, and the white second submask information represents the second submask information with a value of 1.
  • Gray self-attention heads represent invalid self-attention heads, and white self-attention heads represent effective self-attention heads.
  • Step S13 Input the sample to the target model with mask information added to obtain the first output of the target model.
  • the target model after adding mask information is: a target model after masking the pruning object using the mask information. Through mask processing, you can set the participation status of pruning objects in the prediction of the target model.
  • the first mask information may be used to perform mask processing on the linear layer.
  • the first mask information can be expressed as a mask matrix
  • the weight parameters of the linear layer can be expressed as a weight matrix.
  • the Hadamard Product of the mask matrix and the weight matrix can be calculated. So that, after masking the linear layer, the output of the linear layer can contain m elements, and the j-th element can be expressed as
  • M represents the mask matrix.
  • the number of rows of the mask matrix M is m and the number of columns is n.
  • M j,k represents the j-th row and k-th column element in the mask matrix M.
  • W represents the weight matrix.
  • the number of rows of the weight matrix W is m and the number of columns is n.
  • W j, k represents the j-th row and k-th column element in the weight matrix W.
  • x represents input.
  • the x contains n elements.
  • x k represents the k-th element in x.
  • the second mask information can be used to perform mask processing on the multi-head self-attention layer.
  • the second mask information may include a plurality of second sub-mask information
  • the multi-head self-attention layer may include a plurality of self-attention heads.
  • Each second sub-mask information can correspond to a self-attention head.
  • the second sub-mask information can be multiplied with the output of the corresponding self-attention head. So that, after masking the multi-layer self-attention heads, the output of the multi-head self-attention layer can be expressed as
  • W proj represents the weight matrix of the first linear layer in the multi-head self-attention layer.
  • H represents the number of self-attention heads.
  • M h represents the second sub-mask information corresponding to the h-th self-attention head.
  • Attn h (x) represents the output of the h-th self-attention head.
  • ⁇ h represents the model parameters of the self-attention layer in the h-th self-attention head
  • W h and v represent the weight matrix of the second linear layer in the h-th self-attention head.
  • the samples may include image samples, text samples, audio text, etc.
  • the sample has a sample label.
  • the sample label may be used to represent the category of the sample.
  • the target model may include an image classification model, a text classification model, an audio classification model, etc.
  • One or more samples can be input to the target model with mask information added to obtain the first output of the target model.
  • the first output may be the prediction result of the target model.
  • Step S15 Optimize parameter information according to the first output.
  • parameter information may be optimized based on a loss function.
  • the parameter information may include model parameters and pruning parameters of the target model.
  • the loss function may include a first term and a second term.
  • the first term is used to constrain the second pruning rate of the target model, so that the overall target model approaches or reaches the expected pruning rate.
  • the first term may include an augmented Lagrangian function, such as a two-stage augmented Lagrangian function.
  • the second pruning rate can be understood as the pruning ratio of the target model, which can be calculated based on the first pruning rate.
  • the second pruning rate can be based on the formula calculated.
  • L represents the number of pruning objects
  • r l represents the first pruning rate of the l-th pruning object
  • n l represents the number of parameters of the l-th pruning object
  • N represents the number of parameters of the target model.
  • the second term is used to represent the deviation of the first output from the sample label.
  • the second term may include cross-entropy loss function (Cross-Entropy Loss), mean square error loss function (MSE), etc.
  • the loss function can be expressed as
  • L L CE +L p .
  • L p represents the first item.
  • the first term may be a two-stage augmented Lagrangian function.
  • ⁇ 1 and ⁇ 2 represent Lagrange multipliers.
  • ⁇ 1 and ⁇ 2 can be fixed values.
  • ⁇ 1 and ⁇ 2 can also be learnable parameters.
  • the initial values of ⁇ 1 and ⁇ 2 can be empirical values, such as 0, 1, 3, etc.
  • R represents the second pruning rate of the target model
  • R t represents the expected pruning rate of the target model.
  • L CE represents the second item.
  • the second term may be a cross-entropy loss function.
  • the loss function can be used to calculate the loss information based on the first output; the parameter information can be optimized using the backpropagation mechanism based on the loss information.
  • the backpropagation mechanism can be used to calculate the gradient of parameter information; the parameter information can be adjusted according to the gradient of parameter information.
  • the parameter information may include pruning parameters
  • the pruning parameters may include pruning thresholds and value information.
  • the gradient of the pruning threshold can be calculated according to the following formula.
  • the gradient of the pruning threshold it can be seen that if the number of parameters of the pruning object is larger, the pruning object is more likely to be pruned.
  • the gradient of the first value information can be calculated according to the following formula.
  • the gradient of the second value information can be calculated according to the following formula.
  • Step S17 Perform pruning processing on the pruning object according to the mask information.
  • steps S11 to S15 may be executed iteratively until the end condition is met.
  • the end condition can be flexibly set according to actual needs.
  • the end condition may be that the number of iterations reaches a preset threshold.
  • the end condition may also be that the second pruning rate of the target model reaches the expected pruning rate.
  • the pruning objects in the target model can be pruned based on the current mask information.
  • the mask information can be determined based on the pruning parameters. Pruning parameters can be adaptively adjusted during the learning process of the target model. In this way, pruning the pruning object based on the mask information can improve the pruning effect, for example, the pruning accuracy can be improved.
  • the pruning object is pruned according to the mask information, and the target model is pruned in the width dimension.
  • the mask information may include first mask information.
  • the first mask information may include a plurality of first sub-mask information.
  • the pruning objects may include linear layers.
  • the weight parameters of the linear layer can include multiple sub-weight parameters.
  • the sub-weight parameter can be deleted according to the value of the first sub-mask information. For example, the first sub-mask information has a corresponding relationship with the sub-weight parameter.
  • the sub-weight parameter corresponding to the first sub-mask information whose value is the first preset value may be deleted.
  • the first mask information can be expressed as a mask matrix.
  • the elements in the mask matrix can represent the first sub-mask information.
  • the first value information may include a plurality of first sub-value information. Each first sub-value information may correspond to a row of first sub-mask information in the mask matrix. Therefore, for each row of the mask matrix, the values of the first sub-mask information in the row are equal.
  • the weight parameters of the linear layer can be expressed as a weight matrix. Elements in the weight matrix can represent sub-weight parameters. Therefore, by deleting the sub-weight parameters according to the value of the first sub-mask information, the sub-weight parameters in the weight matrix can be deleted in row units, thereby reducing the number of rows of the weight matrix. Reduce the number of rows of the weight matrix, i.e. reduce the number of elements contained in the output of the linear layer. In other words, this scenario example can prune the output of the linear layer.
  • the mask information may include second mask information.
  • the second mask information may include a plurality of second sub-mask information.
  • the pruning object may include a multi-head self-attention layer.
  • the multi-head self-attention layer may include multiple self-attention heads.
  • the self-attention head can be deleted according to the value of the second sub-mask information.
  • the second sub-mask information has a corresponding relationship with the self-attention head.
  • the self-attention head corresponding to the second sub-mask information whose value is the first preset value can be deleted.
  • an output module can also be added to the target model.
  • the added output module will be called plug-in output module below.
  • the plug-in output module can be used for prediction.
  • the plug-in output module may be a plug-in classification module, such as a plug-in classifier (Plugged-in Classifier).
  • the plug-in classification module can be used for classification prediction.
  • the plug-in output model can also be a plug-in regression module.
  • the plug-in regression module can be used for regression prediction.
  • the target model may include a plurality of unit modules with the same structure stacked in sequence. Among the plurality of unit modules with the same structure, the output of the previous unit module can be passed as input to the next unit module for processing by the next unit module.
  • the external output module can be connected to the unit module and used to make predictions based on the output of the unit module.
  • the number of external output modules may be one or more. Each external output module can be connected to a unit module. Each unit module can be connected to zero or one external output module.
  • the target model itself may also include an output module.
  • the output modules in the target model are called internal output modules below.
  • the internal output module may be used for prediction, for example, prediction based on the output of the last unit module among the plurality of unit modules with the same structure.
  • the internal output module may be an internal classification module, such as an internal classifier.
  • the internal classification module can be used for classification prediction.
  • the internal output model can also be an internal regression module.
  • the internal regression module can be used for regression prediction.
  • the target model may include a visual transformer.
  • the plurality of modules with the same structure may include an encoding module.
  • Multiple plug-in classifiers can be added to the visual converter.
  • Each plug-in classifier can be connected to an encoding module.
  • the visual converter may include P encoding modules with the same structure stacked in sequence. The P can be 8, 9, 12, etc.
  • P-1 plug-in classifiers can be added to the visual converter.
  • the P-1 plug-in classifiers can be connected to the other P-1 encoding modules except the last encoding module.
  • Each plug-in classifier can be connected to an encoding module.
  • the last encoding module can be connected to the internal classifier of the visual converter.
  • the sample in step S13, can be input to the target model after adding the mask information and the plug-in output module, to obtain the first output of the target model and the second output of the plug-in output module.
  • the parameter information may be optimized based on the first output and the second output.
  • the loss function can be used to calculate the loss information based on the first output and the second output; the parameter information can be optimized using the backpropagation mechanism based on the loss information.
  • the backpropagation mechanism can be used to calculate the gradient of parameter information; the parameter information can be adjusted according to the gradient of parameter information.
  • the first output may be the prediction result of the target model.
  • the second output may be the prediction result of the plug-in output module.
  • the parameter information may include model parameters of the target model, model parameters of the plug-in output module, and pruning parameters.
  • the loss function may include a first term, a second term, and a third term.
  • the first term is used to constrain the second pruning rate of the target model, so that the overall target model approaches or reaches the expected pruning rate.
  • the second term is used to represent the deviation of the first output from the sample label.
  • the third term is used to represent the deviation between the second output and the sample label.
  • the third item may include multiple sub-items.
  • the third item may be the sum of the plurality of sub-items. Each sub-item is used to represent the deviation between the second output of a plug-in output module and the sample label.
  • the sub-terms may include cross entropy loss function, mean square error loss function, etc.
  • the loss function can be expressed as
  • L L CE + L p + L C .
  • L p represents the first item.
  • L CE represents the second item.
  • L C represents the third item.
  • L ci represents the deviation between the second output of the i-th plug-in output module and the sample label.
  • L ci represents the cross-entropy loss function.
  • the performance indicators may include accuracy, recall, precision, F1-Score and any combination thereof.
  • the verification data can be used to test the performance of the external output module and obtain the performance indicators of the external output module; the target unit module can be determined based on the performance indicators of the external output module; the unit modules after the target unit module can be deleted. For example, you can select the plug-in output module with the best performance as the target plug-in output module; you can use the unit module connected to the target plug-in output module as the target unit module; you can delete the unit module after the target unit module. For another example, you can also use verification data to test the performance of the internal output module and obtain the performance indicators of the internal output module.
  • the output module with the best performance can be selected as the target output module; the unit module connected to the target output module can be used as the target unit module; and the unit modules after the target unit module can be deleted.
  • the output module with the best performance can be an internal output module or an external output module.
  • the target output module may be an internal output module or an external output module.
  • Output modules other than the target output module may include internal output modules and/or plug-in output modules.
  • the mask information can be determined based on the pruning parameters, and the pruning parameters can be adaptively adjusted during the learning process of the target model. In this way, pruning the pruning object based on the mask information can improve the pruning effect, for example, the pruning accuracy can be improved.
  • the embodiment of this specification also provides another model pruning method.
  • the model pruning method can be applied to pruning equipment.
  • the model pruning method is used to prune the target model to reduce model redundancy and obtain a lightweight target model.
  • the model pruning method may include the following steps.
  • Step S21 Input the sample into the target model after adding the plug-in output module, and obtain the first output of the target model and the second output of the plug-in output module.
  • an output module can be added to the target model.
  • the added output module will be called plug-in output module below.
  • the plug-in output module can be used for prediction.
  • the plug-in output module may be a plug-in classification module, such as a plug-in classifier (Plugged-in Classifier).
  • the plug-in classification module can be used for classification prediction.
  • the plug-in output model can also be a plug-in regression module.
  • the plug-in regression module can be used for regression prediction.
  • the target model may include a plurality of unit modules with the same structure stacked in sequence. Among the plurality of unit modules with the same structure, the output of the previous unit module can be passed as input to the next unit module for processing by the next unit module.
  • the external output module can be connected to the unit module and used to make predictions based on the output of the unit module.
  • the number of external output modules may be one or more. Each external output module can be connected to a unit module. Each unit module can be connected to zero or one external output module.
  • the target model itself may also include an output module.
  • the output modules in the target model are called internal output modules below.
  • the internal output module may be used for prediction, for example, prediction based on the output of the last unit module among the plurality of unit modules with the same structure.
  • the internal output module may be an internal classification module, such as an internal classifier.
  • the internal classification module can be used for classification prediction.
  • the internal output model can also be an internal regression module.
  • the internal regression module can be used for regression prediction.
  • the samples may include image samples, text samples, audio text, etc.
  • the sample has a sample label.
  • the sample label may be used to represent the category of the sample.
  • the target model may include an image classification model, a text classification model, an audio classification model, etc.
  • One or more samples can be input into the target model after adding the plug-in output module, to obtain the first output of the target model and the second output of the plug-in output module.
  • Step S23 Optimize the parameter information according to the first output and the second output.
  • a loss function can be used to calculate the loss information based on the first output and the second output; a backpropagation mechanism can be used to optimize the parameter information based on the loss information.
  • the backpropagation mechanism can be used to calculate the gradient of parameter information; the parameter information can be adjusted according to the gradient of parameter information.
  • the first output may be the prediction result of the target model.
  • the second output may be the prediction result of the plug-in output module.
  • the parameter information may include model parameters of the target model and model parameters of the plug-in output module.
  • the loss function may include a second term and a third term.
  • the second term is used to represent the deviation of the first output from the sample label.
  • the third term is used to represent the deviation between the second output and the sample label.
  • the third item may include multiple sub-items.
  • the third item may be the sum of the plurality of sub-items.
  • Each sub-item is used to represent the deviation between the second output of a plug-in output module and the sample label.
  • the sub-terms may include cross entropy loss function, mean square error loss function, etc.
  • the loss function can be expressed as
  • L CE represents the second item.
  • L C represents the third item.
  • L ci represents the deviation between the second output of the i-th plug-in output module and the sample label.
  • L ci represents the cross-entropy loss function.
  • Step S25 Prune multiple unit modules according to the performance indicators of the external output module.
  • steps S21 to S25 may be executed iteratively until the end condition is met.
  • the end condition can be flexibly set according to actual needs. For example, the end condition may be that the number of iterations reaches a preset threshold. After the iteration process is completed, multiple unit modules can be pruned according to the performance indicators of the plug-in output module.
  • the performance indicators may include accuracy, recall, precision, F1-Score and any combination thereof.
  • the verification data can be used to test the performance of the external output module and obtain the performance indicators of the external output module; the target unit module can be determined based on the performance indicators of the external output module; the unit modules after the target unit module can be deleted. For example, you can select the plug-in output module with the best performance as the target output module; you can use the unit module connected to the target output module as the target unit module; you can delete the unit modules after the target unit module. For another example, you can also use verification data to test the performance of the internal output module and obtain the performance indicators of the internal output module.
  • the output module with the best performance can be selected as the target output module; the unit module connected to the target output module can be used as the target unit module; and the unit modules after the target unit module can be deleted.
  • the output module with the best performance can be an internal output module or an external output module.
  • the target output module may be an internal output module or an external output module.
  • Output modules other than the target output module may include internal output modules and/or plug-in output modules.
  • the model pruning method in the embodiment of this specification adds a plug-in output module to the target model, and the plug-in output module can be trained together with the target model. Use the performance indicators of the plug-in output module to prune multiple unit modules in the target model. This enables the target model to be pruned in the depth dimension, enabling the target model to achieve greater parallel operation efficiency.
  • An embodiment of this specification also provides a model pruning device.
  • the model pruning device may be provided on a computer device.
  • the computer device may be a personal computer, a server, a server cluster containing multiple servers, or the like.
  • the model pruning device may include the following units.
  • the iteration unit 31 is used to iteratively execute the following steps until the end condition is met: determine the mask information according to the pruning parameters; the mask information is used to indicate the effective status of the pruning object in the target model; input the sample to the added mask After encoding the target model information, the first output of the target model is obtained; according to the first output, the parameter information is optimized; the parameter information includes the model parameters and pruning parameters of the target model; the pruning unit 33 is used to calculate the target model according to the mask information. , perform pruning processing on the pruning object.
  • the embodiment of this specification also provides another model pruning device.
  • the model pruning device may be provided on a computer device.
  • the computer device may be a personal computer, a server, a server cluster containing multiple servers, or the like.
  • the model pruning device may include the following units.
  • the iterative unit 41 is used to iteratively execute the following steps until the end condition is met: input the sample to the target model after adding the plug-in output module, and obtain the first output of the target model and the second output of the plug-in output module; the target model includes Multiple unit modules with the same structure are stacked in sequence, and the plug-in output module is connected to the unit module; parameter information is optimized according to the first output and the second output, and the parameter information includes the model parameters of the target model and the model of the plug-in output module Parameter: Pruning unit 43, used to prune multiple unit modules according to the performance indicators of the external output module.
  • Figure 10 is a schematic diagram of the hardware structure of the computer device in this embodiment.
  • the computer device may include one or more (only one is shown in the figure) processors, memories and transmission modules.
  • processors may include one or more (only one is shown in the figure)
  • memories may include one or more (only one is shown in the figure)
  • transmission modules may include one or more (only one is shown in the figure)
  • the hardware structure shown in FIG. 10 is only illustrative, and it does not limit the hardware structure of the above-mentioned computer equipment.
  • the computer device may also include more or less component units than shown in Figure 10; or, have a different configuration than that shown in Figure 10.
  • the memory may include high-speed random access memory; or may also include non-volatile memory, such as one or more magnetic storage devices, flash memory or other non-volatile solid-state memory. Of course, the memory may also include remotely located network storage.
  • the memory may be used to store program instructions or modules of application software, such as the program instructions or modules of the embodiments corresponding to Figure 3 or Figure 7 of this specification.
  • the processor may be implemented in any suitable manner.
  • the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (eg, software or firmware) executable by the (micro)processor, logic gates, switches, application specific integrated Circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers and embedded microcontrollers, etc.
  • the processor can read and execute program instructions or modules in the memory.
  • the transmission module may be used for data transmission via a network, for example, data transmission via a network such as the Internet, an intranet, a local area network, a mobile communication network, etc.
  • the computer storage media includes but is not limited to Random Access Memory (RAM), Read-Only Memory (ROM), cache (Cache), hard disk (Hard Disk Drive, HDD), memory card ( Memory Card) and so on.
  • RAM Random Access Memory
  • ROM Read-Only Memory
  • cache cache
  • HDD hard disk
  • Memory Card Memory Card
  • the computer storage medium stores computer program instructions. When the computer program instructions are executed, the program instructions or modules of the embodiment corresponding to Figure 3 or Figure 7 of this specification are implemented.
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • HDL Hardware Description Language
  • HDL High-Speed Integrated Circuit Hardware Description Language
  • ABEL Advanced Boolean Expression Language
  • AHDL Altera Hardware Description Language
  • HDCal Joint CHDL
  • JHDL Java Hardware Description Language
  • Lava Lava
  • Lola MyHDL
  • PALASM RHDL
  • Verilog Verilog
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
  • the computer software product can be stored in a storage medium, such as ROM/RAM, disk , optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments of this specification.
  • a computer device which can be a personal computer, a server, or a network device, etc.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • the present description may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through communications networks.
  • program modules may be located in both local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Stored Programmes (AREA)

Abstract

本说明书实施例公开了一种模型剪枝方法、装置和计算机设备。所述方法包括:根据剪枝参数,确定掩码信息;所述掩码信息用于指示目标模型中剪枝对象的有效状态;将样本输入至增添了掩码信息后的目标模型,得到目标模型的第一输出;根据第一输出,优化参数信息;所述参数信息包括目标模型的模型参数和剪枝参数;迭代执行以上步骤,直至满足结束条件;根据掩码信息,对剪枝对象进行剪枝处理。本说明书实施例可以对目标模型进行剪枝处理,以减少存储资源和计算资源的占用。

Description

模型剪枝 技术领域
本说明书实施例涉及计算机技术领域,特别涉及模型剪枝的方法、装置和计算机设备。
背景技术
近年来,深度学习在人工智能的应用中,包括计算机视觉、语音识别、自然语言处理等方面,取得了巨大成功。然而,深度学习模型的规模通常较大,需要占用高额的存储资源和计算资源,导致深度学习模型难以高效的应用在各种硬件设备中。
因此,需要对深度学习模型进行剪枝处理,以减少存储资源和计算资源的占用。
发明内容
本说明书实施例提供一种模型剪枝方法、装置和计算机设备,以减少存储资源和计算资源的占用。本说明书实施例的技术方案如下。
本说明书实施例的第一方面,提供了一种模型剪枝方法,包括:根据剪枝参数,确定掩码信息;所述掩码信息用于指示目标模型中剪枝对象的有效状态;将样本输入至增添了掩码信息后的目标模型,得到目标模型的第一输出;根据第一输出,优化参数信息;所述参数信息包括目标模型的模型参数和剪枝参数;迭代执行以上步骤,直至满足结束条件;根据掩码信息,对剪枝对象进行剪枝处理。
本说明书实施例的第二方面,提供了一种模型剪枝方法,包括:将样本输入至增添了外挂输出模块后的目标模型,得到目标模型的第一输出和外挂输出模块的第二输出;所述目标模型包括顺次堆叠的多个结构相同的单元模块,外挂输出模块与单元模块相连接;根据第一输出和第二输出,优化参数信息,所述参数信息包括目标模型的模型参数和外挂输出模块的模型参数;迭代执行以上步骤,直至满足结束条件;根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。
本说明书实施例的第三方面,提供了一种模型剪枝装置,包括:迭代单元,用于迭代执行以下步骤直至满足结束条件:根据剪枝参数,确定掩码信息;所述掩码信息用于指示目标模型中剪枝对象的有效状态;将样本输入至增添了掩码信息后的目标模型,得到目标模型的第一输出;根据第一输出,优化参数信息;所述参数信息包括目标模型的模型参数和剪枝参数;剪枝单元,用于根据掩码信息,对剪枝对象进行剪枝处理。
本说明书实施例的第四方面,提供了一种模型剪枝装置,包括:迭代单元,用于迭代执行以下步骤直至满足结束条件:将样本输入至增添了外挂输出模块后的目标模型,得到目标模型的第一输出和外挂输出模块的第二输出;所述目标模型包括顺次堆叠的多个结构相同的单元模块,外挂输出模块与单元模块相连接;根据第一输出和第二输出,优化参数信息,所述参数信息包括目标模型的模型参数和外挂输出模块的模型参数;剪枝单元,用于根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。
本说明书实施例的第五方面,提供了一种计算机设备,包括至少一个处理器以及存储有程序指令的存储器,其中,所述程序指令被配置为适于由所述至少一个处理器执行,所述程序指令包括用于执行如第一方面或者第二方面所述方法的指令。
本说明书实施例提供的技术方案,掩码信息可以根据剪枝参数确定得到,剪枝参数可以在目标模型的学习过程中自适应调整。这样根据掩码信息对剪枝对象进行剪枝处理, 可以提高剪枝效果,例如可以提高剪枝精度,从而减少了存储资源和计算资源的占用。另外,本说明书实施例提供的技术方案,在目标模型中增添外挂输出模块,外挂输出模块可以与目标模型一起进行训练。利用外挂输出模块的性能指标,对目标模型中的多个单元模块进行剪枝处理。从而实现了在深度维度上对目标模型进行剪枝,减少了存储资源和计算资源的占用。
附图说明
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,下面描述中的附图仅仅是本说明书中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本说明书实施例中对全连接神经网络进行剪枝的示意图;
图2为本说明书实施例中视觉转换器的结构示意图;
图3为本说明书实施例中模型剪枝方法的流程示意图;
图4为本说明书实施例中线性层的第一掩码信息的示意图;
图5为本说明书实施例中多头自注意力层的第二掩码信息的示意图;
图6为本说明书实施例中在深度维度上对目标模型进行剪枝的示意图;
图7为本说明书实施例中模型剪枝方法的流程示意图;
图8为本说明书实施例中模型剪枝装置的结构示意图;
图9为本说明书实施例中模型剪枝装置的结构示意图;
图10为本说明书实施例中计算机设备的结构示意图。
具体实施方式
下面将结合本说明书实施例中的附图,对本说明书实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本说明书一部分实施例,而不是全部的实施例。基于本说明书中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本说明书保护的范围。
模型剪枝可以包括结构化剪枝和非结构化剪枝等。通过对模型进行剪枝处理,可以减小模型冗余,获得轻量化模型。所述轻量化模型的规模较小,占用的存储资源和计算资源较少。
例如,图1为对全连接神经网络进行剪枝的示意图。如图1所示,所述全连接神经网络可以包括L1层、L2层、L3层、L4层。对所述全连接神经网络进行剪枝可以包括:删除L2层中的一个神经元、删除L3层中的一个神经元、删除L1层和L2层之间的14条连接边、删除L2层和L3层之间的8条连接边、删除L3和L4层之间的4条连接边。所述连接边用于表示权重参数。
在相关技术中,可以预先设定剪枝参数(例如剪枝阈值),可以根据预先设定的剪枝参数对模型进行剪枝。以卷积神经网络模型为例,所述卷积神经网络模型可以包括批归一化层(BN,Batch Normalize)。所述批归一化层具有缩放系数γ(一种可学习的模型参数)。可以对缩放系数γ施加L1范数惩罚,可以对施加L1范数惩罚后的卷积神经网络模型进行稀疏化训练。稀疏化训练所使用的损失函数可以表示为
L=l(f(x),y)+λ∑|γ|。
l(f(x),y)表示正常训练所使用的损失函数,x表示样本,f(x)表示卷积神经网络模型的预测结果,y表示样本的样本标签,λ表示稀疏化系数,∑|γ|表示各批归一化层缩放系数γ的绝对值之和(即,缩放系数γ的L1范数)。在稀疏化训练完成后,可以根据剪枝参数选择数值较小的缩放系数γ;可以利用选择的缩放系数γ进行剪枝。例如,删除较小的缩放系数γ所对应的卷积层通道。
上述相关技术中,剪枝参数需要根据人工经验预先设定。这种根据人工经验设定的剪枝参数,有可能会与模型的实际情况不匹配,从而影响了剪枝效果(例如剪枝精度较低)。
另外,模型可以包括顺次堆叠的多个结构相同的单元模块。模型的深度可以指所述多个单元模块的数量。模型的宽度可以指单元模块中处理单元的数量。例如,所述模型可以包括卷积神经网络模型。所述卷积神经网络模型可以包括顺次堆叠的多个结构相同的中间层。所述中间层可以包括卷积层、线性层等。所述卷积神经网络模型的深度可以指中间层的数量。所述卷积神经网络模型的宽度可以指中间层中神经元的数量。再比如,所述模型可以包括视觉转换器(Vision Transformer)。所述视觉转换器可以包括顺次堆叠的多个结构相同的编码模块(Encoder Block)。所述视觉转换器的深度可以指编码模块的数量。所述视觉转换器的宽度可以指多头自注意力层中自注意力头的数量。在相关技术中,往往是在宽度维度上对模型进行剪枝,较少在深度维度上对模型进行剪枝。使得,无法在深度维度上减小模型冗余。
本说明书实施例提供一种数据处理***。
所述数据处理***可以包括剪枝设备和业务设备。所述剪枝设备和所述业务设备可以为个人计算机、服务器、或者包含多个服务器的服务器集群等。其中,所述剪枝设备可以用于对目标模型进行剪枝处理,得到轻量化的目标模型;可以向业务设备发送轻量化的目标模型。所述业务设备可以接收轻量化的目标模型;可以将业务数据输入至轻量化的目标模型,得到预测结果。其中,所述目标模型可以包括图像分类模型、文本分类模型、音频分类模型等。所述业务数据可以包括图像数据、文本数据、音频数据等。所述预测结果可以用于表示业务数据是否为标的业务数据。例如,所述预测结果可以用于表示业务数据是否为异常业务数据。所述异常业务数据可以包括涉及欺诈等违法内容的业务数据。
所述目标模型可以包括神经网络模型。所述神经网络模型可以包括全连接神经网络模型、基于自注意力机制的神经网络模型等。所述基于自注意力机制的神经网络模型可以包括转换器(Transformer)等。所述转换器可以包括视觉转换器(Vision Transformer)。
所述视觉转换器能够应用于图像分类场景。所述图像分类场景包括但不限于商品分类场景(例如同类商品推荐)、违禁物品检测场景(例如检测违禁物品)、人脸识别场景(例如活体检测)、城市规划场景(例如路网提取)、气象场景(例如云层提取)、车险理赔场景(确定图像的用途,例如车损图像、物损图像、证件图像)等。所述视觉转换器可以包括基本的视觉变换器及其变形。请参阅图2。所述视觉转换器可以包括向量转换层、顺次堆叠的多个结构相同的编码模块(Encoder Block)、以及分类器(Classifier)。所述向量转换层用于进行向量转换以生成向量表示(Embedding)。在实际应用中,编码模块的数量可以为P个。所述P可以为8、9、12等。所述编码模块可以用于提取特征向量。所述分类器可以用于预测。
每个编码模块可以包括归一化层(Norm)、多头自注意力层(Multi-Head Attention)、以及多层感知器(MLP)等。所述归一化层用于归一化。所述多头自注意力层可以包括 线性层(Linear,又称为全连接层,以下称为第一线性层)、以及多个自注意力头。所述第一线性层用于汇总多个自注意力头的输出。在实际应用中,自注意力头的数量可以为H个。所述H可以为3、4、6等。所述多个自注意力头具有相同的结构。每个自注意力头可以包括线性层(Linear,以下称为第二线性层)、以及自注意力层(Self-Attention)。所述第二线性层用于进行线性变换。所述自注意力层用于进行基于注意力机制的运算。
在图像分类场景中,可以对图像数据进行分割处理,得到多个图像块(Patch)。所述多个图像块具有相同的高度和宽度。可以对图像块进行压平(Flatten)处理,得到图像块的数据序列;可以将数据序列输入至向量转换层,得到图像块的向量表示(Patch Embedding)。可以将图像块的向量表示输入至多个编码模块,得到最后一个编码模块输出的特征向量;可以将最后一个编码模块输出的特征向量输入至分类器,得到所述视觉转换器的预测结果。
值得说明的是,在图2中,
Figure PCTCN2023071540-appb-000001
表示融合。例如在编码模块中,特征向量可以与多头自注意力层的输出进行融合,融合结果可以作为归一化层的输入。
本说明书实施例提供一种模型剪枝方法。所述模型剪枝方法可以应用于剪枝设备。所述模型剪枝方法可以用于对目标模型进行剪枝处理,以减小模型冗余,获得轻量化的目标模型。请参阅图3。所述模型剪枝方法可以包括以下步骤。
步骤S11:根据剪枝参数,确定掩码信息。
在一些实施例中,所述掩码信息用于指示目标模型中剪枝对象的有效状态,所述有效状态可以理解为目标模型预测时的参与状态,例如是否参与预测。所述剪枝对象可以包括线性层和/或多头自注意力层。为了便于描述,以下将线性层的掩码信息称为第一掩码信息,将多头自注意力层的掩码信息称为第二掩码信息。所述第一掩码信息用于指示线性层的有效状态。所述第二掩码信息用于指示多头自注意力层的有效状态。在一些场景示例中,所述目标模型可以包括视觉转换器。从而,剪枝对象可以包括自注意力头中的线性层,例如第一线性层、第二线性层。当然,剪枝对象还可以包括其它的线性层,例如多层感知器中的线性层。
在本实施例的一些实施方式中,线形层的权重参数可以包括多个子权重参数。第一掩码信息可以包括多个第一子掩码信息。第一子掩码信息与子权重参数之间具有对应关系。第一子掩码信息用于指示子权重参数的有效状态。例如,第一子掩码信息的取值可以为0或者1。0用于指示子权重参数无效,1用于指示子权重参数有效。当然,此处的0或者1仅为示例,第一子掩码信息还可以为其它的取值。在一些场景示例中,权重参数可以表示为权重矩阵。权重矩阵中的元素可以表示子权重参数。第一掩码信息可以表示为掩码矩阵。掩码矩阵中的元素可以表示第一子掩码信息。掩码矩阵与权重矩阵为同阶矩阵。具有相同二维坐标的第一子掩码信息与子权重参数具有对应关系。所述二维坐标可以包括行数和列数。
在本实施例的另一些实施方式中,多头自注意力层可以包括多个自注意力头。第二掩码信息可以包括多个第二子掩码信息。第二子掩码信息与自注意力头之间具有对应关系。第二子掩码信息用于指示自注意力头的有效状态。例如,第二子掩码信息的取值可以为0或者1。0用于指示自注意力头无效,1用于指示自注意力头有效。当然,此处的0或者1仅为示例,第二子掩码信息还可以为其它的取值。在一些场景示例中,自注意力头的数量以及第二子掩码信息的数量均可以为H。每个第二子掩码信息可以对应一个自注意力头。
在一些实施例中,剪枝参数包括剪枝对象的剪枝阈值和价值信息。可以根据剪枝阈值,计算剪枝对象的第一剪枝率;可以根据第一剪枝率和价值信息,确定剪枝对象的掩 码信息。
所述剪枝阈值可以为用于界定有效状态的临界值。所述剪枝阈值的初始值可以为经验值,例如可以为10、16或者20等。所述第一剪枝率可以理解为剪枝对象的裁剪比例。可以利用Sigmoid函数、Tanh函数或者ReLU函数计算第一剪枝率。例如,可以根据公式r=σ(β)计算第一剪枝率。r表示第一剪枝率,β表示剪枝阈值,σ表示Sigmoid函数。
所述价值信息用于度量剪枝对象的重要性程度。价值信息可以包括多个子价值信息。掩码信息可以包括多个子掩码信息。子价值信息和子掩码信息之间可以具有对应关系。如此,可以根据第一剪枝率和所述多个子价值信息确定比较基准;可以将子价值信息与比较基准进行比较;可以根据比较结果为与该子价值信息相对应的子掩码信息赋值。具体地,若子价值信息小于比较基准,则可以为相应的子掩码信息赋予第一预设值(例如数值0)。若子价值信息大于或等于比较基准,则可以为相应的子掩码信息赋予第二预设值(例如数值1)。
为了便于描述,以下将线性层的价值信息称为第一价值信息,将多头自注意力层的价值信息称为第二价值信息。所述第一价值信息用于度量线性层的重要性程度。线性层的重要性程度可以通过权重参数中子权重参数的有效状态来表征。所述第二价值信息用于度量多头自注意力层的重要性程度。多头自注意力层的重要性程度可以通过多头自注意力层中自注意力头的有效状态来表征。
在本实施例的一些实施方式中,可以根据第一剪枝率和第一价值信息,确定线性层的第一掩码信息。具体地,第一价值信息可以包括多个第一子价值信息。第一掩码信息可以包括多个第一子掩码信息。每个第一子价值信息可以对应多个第一子掩码信息。如此,可以根据第一剪枝率和所述多个第一子价值信息确定比较基准;可以将第一子价值信息与比较基准进行比较,可以根据比较结果为与该第一子价值信息相对应的第一子掩码信息赋值。
所述第一剪枝率可以表示为r,0≤r≤1。第一价值信息中第一子价值信息的数量可以为m。可以选择第r×m个最小的第一子价值信息作为比较基准。例如,可以对所述m个第一子价值信息按照从小到大的顺序进行排列;可以选择第r×m个第一子价值信息作为比较基准。当然,在实际中并不局限于此,还可以采用其它的方式确定比较基准。
若第一子价值信息小于比较基准,则可以为相应的第一子掩码信息赋予第一预设值(例如数值0),从而可以指示与第一子掩码信息相对应的子权重参数无效。若第一子价值信息大于或等于比较基准,则可以为相应的第一子掩码信息赋予第二预设值(例如数值1),从而可以指示与第一子掩码信息相对应的子权重参数有效。
在一些场景示例中,请参阅图4。线性层的权重参数可以表示为权重矩阵。权重矩阵中的元素可以表示子权重参数。第一掩码信息可以表示为掩码矩阵。掩码矩阵中的元素可以表示第一子掩码信息。掩码矩阵与权重矩阵的行数为m,列数为n。具有相同二维坐标的第一子掩码信息与子权重参数具有对应关系。第一价值信息可以包括m个第一子价值信息。每个第一子价值信息可以对应掩码矩阵中的一行第一子掩码信息。
若第一子价值信息小于比较基准,可以为相应的第一子掩码信息赋予数值0,从而可以指示与第一子掩码信息相对应的子权重参数无效。若第一子价值信息大于或等于比较基准,可以为相应的第一子掩码信息赋予数值1,从而可以指示与第一子掩码信息相对应的子权重参数有效。在图4中。灰色的第一子价值信息表示小于比较基准的第一子价值信息,白色的第一子价值信息表示大于或等于比较基准的第一子价值信息。灰色的第一子掩码信息表示取值为0的第一子掩码信息,白色的第一子掩码信息表示取值为1的第一子掩码信息。灰色的子权重参数表示无效的子权重参数,白色的子权重参数表示 有效的子权重参数。
在本实施例的另一些实施方式中,可以根据第一剪枝率和第二价值信息,确定多头自注意力层的第二掩码信息。具体地,第二价值信息可以包括多个第二子价值信息。第二掩码信息可以包括多个第二子掩码信息。每个第二子价值信息可以对应一个第二子掩码信息。如此,可以根据第一剪枝率和所述多个第二子价值信息确定比较基准;可以将第二子价值信息与比较基准进行比较,可以根据比较结果为与该第二子价值信息相对应的第二子掩码信息赋值。
所述第一剪枝率可以表示为r,0≤r≤1。第二价值信息中第二子价值信息的数量可以为H。可以选择第r×H个最小的第二子价值信息作为比较基准。例如,可以对H个第二子价值信息按照从小到大的顺序进行排列,可以选择第r×H个第二子价值信息作为比较基准。当然,在实际中并不局限于此,还可以采用其它的方式确定比较基准。
若第二子价值信息小于比较基准,则可以为相应的第二子掩码信息赋予第一预设值(例如数值0),从而可以指示与第二子掩码信息相对应的自注意力头无效。若第二子价值信息大于或等于比较基准,则可以为相应的第二子掩码信息赋予第二预设值(例如数值1),从而可以指示与第二子掩码信息相对应的自注意力头有效。
在一些场景示例中,请参阅图5。多头自注意力层可以包括H个自注意力头。第二掩码信息可以包括H个第二子掩码信息。每个第二子掩码信息可以对应一个自注意力头。第二价值信息可以包括H个第二子价值信息。每个第二子价值信息可以对应一个第二子掩码信息。
若第二子价值信息小于比较基准,可以为相应的第二子掩码信息赋予数值0,从而可以指示与第二子掩码信息相对应的自注意力头无效。若第二子价值信息大于或等于比较基准,可以为相应的第二子掩码信息赋予数值1,从而可以指示与第二子掩码信息相对应的自注意力头有效。如图5所示。灰色的第二子价值信息表示小于比较基准的第二子价值信息,白色的第二子价值信息表示大于或等于比较基准的第二子价值信息。灰色的第二子掩码信息表示取值为0的第二子掩码信息,白色的第二子掩码信息表示取值为1的第二子掩码信息。灰色的自注意力头表示无效的自注意力头,白色的自注意力头表示有效的自注意力头。
步骤S13:将样本输入至增添了掩码信息后的目标模型,得到目标模型的第一输出。
在一些实施例中,所述增添了掩码信息后的目标模型为:利用掩码信息对剪枝对象进行掩码处理后的目标模型。通过掩码处理,可以设置剪枝对象在目标模型预测时的参与状态。
在本实施例的一些实施方式中,可以利用第一掩码信息,对线性层进行掩码处理。
例如,第一掩码信息可以表示为掩码矩阵,线性层的权重参数可以表示为权重矩阵。可以计算掩码矩阵与权重矩阵的哈达玛积(Hadamard Product)。使得,在对线性层进行掩码处理后,线性层的输出中可以包含m个元素,其中的第j个元素可以表示为
Figure PCTCN2023071540-appb-000002
M表示掩码矩阵。掩码矩阵M的行数为m,列数为n。M j,k表示掩码矩阵M中的第j行第k列个元素。W表示权重矩阵,权重矩阵W的行数为m,列数为n。W j,k表示权重矩阵W中的第j行第k列个元素。x表示输入。所述x中包含n个元素。x k表示x中的第k个元素。
在本实施例的一些实施方式中,可以利用第二掩码信息对多头自注意力层进行掩码处理。
例如,第二掩码信息可以包括多个第二子掩码信息,多头自注意力层可以包括多个自注意力头。每个第二子掩码信息可以对应一个自注意力头。可以将第二子掩码信息与相应自注意力头的输出相乘。使得,在对多层自注意力头进行掩码处理后,多头自注意力层的输出可以表示为
Figure PCTCN2023071540-appb-000003
W proj表示多头自注意力层中第一线性层的权重矩阵。H表示自注意力头的数量。M h表示第h个自注意力头所对应的第二子掩码信息。Attn h(x)表示第h个自注意力头的输出。
Attn h(x)=α hW h,vx。
其中,α h表示第h个自注意力头中自注意力层的模型参数,W h,v表示第h个自注意力头中第二线性层的权重矩阵。
在一些实施例中,所述样本可以包括图像样本、文本样本、音频文本等。所述样本具有样本标签。所述样本标签可以用于表示样本的类别。所述目标模型可以包括图像分类模型、文本分类模型、音频分类模型等。可以将一个或多个样本输入至增添了掩码信息后的目标模型,得到目标模型的第一输出。所述第一输出可以为目标模型的预测结果。
步骤S15:根据第一输出,优化参数信息。
在一些实施例中,可以根据损失函数优化参数信息。所述参数信息可以包括目标模型的模型参数和剪枝参数。所述损失函数可以包括第一项和第二项。所述第一项用于约束目标模型的第二剪枝率,以使目标模型整体接近或达到预期剪枝率。所述第一项可以包括增广拉格朗日函数,例如二阶段增广拉格朗日函数。当然,在实际中并不限于此,还可以采用其它的函数约束目标模型的第二剪枝率。其中,所述第二剪枝率可以理解为目标模型的裁剪比例,具体可以根据第一剪枝率计算得到。例如,所述第二剪枝率可以根据公式
Figure PCTCN2023071540-appb-000004
计算得到。L表示剪枝对象的数量,r l表示第l个剪枝对象的第一剪枝率,n l表示第l个剪枝对象的参数数量,N表示目标模型的参数数量。所述第二项用于表示第一输出与样本标签的偏差。所述第二项可以包括交叉熵损失函数(Cross-Entropy Loss)、均方误差损失函数(MSE)等。
在一些场景示例中,所述损失函数可以表示为
L=L CE+L p
其中,L p表示第一项。所述第一项可以为二阶段增广拉格朗日函数。
Figure PCTCN2023071540-appb-000005
其中,λ 1和λ 2表示拉格朗日乘子。λ 1和λ 2可以为固定值。或者,λ 1和λ 2还可以为一种可学习的参数。λ 1和λ 2的初始值可以为经验值,例如可以为0、1、3等。R表示目标模型的第二剪枝率,R t表示目标模型的预期剪枝率。通过二阶段增广拉格朗日函数,在学习过程中可以引导目标模型整体接近或达到所述预期剪枝率。L CE表示第二项。所述第二项可以为交叉熵损失函数。
在一些实施例中,可以根据第一输出,利用损失函数计算损失信息;可以根据损失信息,利用反向传播机制,优化参数信息。例如,可以利用反向传播机制计算参数信息的梯度;可以根据参数信息的梯度调整参数信息。
例如,所述参数信息可以包括剪枝参数,所述剪枝参数可以包括剪枝阈值和价值信 息。可以根据以下公式以下计算剪枝阈值的梯度。
Figure PCTCN2023071540-appb-000006
其中,
Figure PCTCN2023071540-appb-000007
表示第l个剪枝对象的剪枝阈值的梯度。
根据剪枝阈值的梯度可知,若剪枝对象的参数数量越大,则剪枝对象越倾向于被剪枝处理。可以根据以下公式计算第一价值信息的梯度。
Figure PCTCN2023071540-appb-000008
其中,
Figure PCTCN2023071540-appb-000009
表示第一价值信息中第j个第一子价值信息的梯度。
可以根据以下公式计算第二价值信息的梯度。
Figure PCTCN2023071540-appb-000010
其中,
Figure PCTCN2023071540-appb-000011
表示第二价值信息中第h个第二子价值信息的梯度。
步骤S17:根据掩码信息,对剪枝对象进行剪枝处理。
在一些实施例中,可以迭代执行步骤S11-步骤S15,直至满足结束条件。所述结束条件可以根据实际需要灵活设定。例如,所述结束条件可以为迭代次数达到预设阈值。再比如,所述结束条件还可以为目标模型的第二剪枝率达到预期剪枝率。
在迭代过程结束后,可以根据当前的掩码信息,对目标模型中的剪枝对象进行剪枝处理。掩码信息可以根据剪枝参数确定得到。剪枝参数可以在目标模型的学习过程中自适应调整。这样根据掩码信息对剪枝对象进行剪枝处理,可以提高剪枝效果,例如可以提高剪枝精度。另外,根据掩码信息对剪枝对象进行剪枝处理,实现了在宽度维度上对目标模型进行剪枝。
在一些实施例中,掩码信息可以包括第一掩码信息。第一掩码信息可以包括多个第一子掩码信息。所述剪枝对象可以包括线性层。线形层的权重参数可以包括多个子权重参数。可以根据第一子掩码信息的取值,删除子权重参数。例如,第一子掩码信息与子权重参数具有对应关系。可以删除取值为第一预设值的第一子掩码信息所对应的子权重参数。
值得说明的是,在一些场景示例中,第一掩码信息可以表示为掩码矩阵。掩码矩阵中的元素可以表示第一子掩码信息。第一价值信息可以包括多个第一子价值信息。每个第一子价值信息可以对应掩码矩阵中的一行第一子掩码信息。使得,对于掩码矩阵的每一行,该行内的各第一子掩码信息的取值相等。线性层的权重参数可以表示为权重矩阵。权重矩阵中的元素可以表示子权重参数。使得,根据第一子掩码信息的取值删除子权重参数,能够以行为单位删除权重矩阵中的子权重参数,从而减少权重矩阵的行数。减少权重矩阵的行数,即减少线性层的输出中所包含的元素数量。也就是说,本场景示例可以对线性层的输出进行剪枝。
在一些实施例中,掩码信息可以包括第二掩码信息。第二掩码信息可以包括多个第二子掩码信息。所述剪枝对象可以包括多头自注意力层。所述多头自注意力层可以包括多个自注意力头。可以根据第二子掩码信息的取值,删除自注意力头。例如,第二子掩码信息与自注意力头具有对应关系。可以删除取值为第一预设值的第二子掩码信息所对 应的自注意力头。
在一些实施例中,还可以在目标模型中增添输出模块。以下将增添的输出模块称为外挂输出模块。所述外挂输出模块可以用于预测。具体地,所述外挂输出模块可以为外挂分类模块,例如外挂分类器(Plugged-in Classifier)。所述外挂分类模块可以用于分类预测。当然,所述外挂输出模型还可以为外挂回归模块。所述外挂回归模块可以用于回归预测。
所述目标模型可以包括顺次堆叠的多个结构相同的单元模块。在所述多个结构相同的单元模块中,先前单元模块的输出可以作为输入传递到下一个单元模块,以供所述下一个单元模块进行处理。外挂输出模块可以与单元模块相连接,用于根据单元模块的输出进行预测。其中,外挂输出模块的数量可以为一个或多个。每个外挂输出模块可以连接一个单元模块。每个单元模块可以连接零个或一个外挂输出模块。
当然,所述目标模型本身还可以包括输出模块。为了与外挂输出模块进行区分,以下将目标模型中的输出模块称为内部输出模块。所述内部输出模块可以用于预测,例如根据所述多个结构相同的单元模块中最后一个单元模块的输出进行预测。具体地,所述内部输出模块可以为内部分类模块,例如内部分类器。所述内部分类模块可以用于分类预测。当然,所述内部输出模型还可以为内部回归模块。所述内部回归模块可以用于回归预测。
例如,请参阅图6。所述目标模型可以包括视觉转换器。所述多个结构相同的模块可以包括编码模块。可以在视觉转换器中增添多个外挂分类器。每个外挂分类器可以连接一个编码模块。具体地,例如,所述视觉转换器可以包括顺次堆叠的P个结构相同的编码模块。所述P可以为8、9、12等。可以在视觉转换器中增添P-1个外挂分类器。所述P-1个外挂分类器可以与除最后一个编码模块以为的其它P-1个编码模块相连接。每个外挂分类器可以连接一个编码模块。最后一个编码模块可以与所述视觉转换器的内部分类器相连接。
在一些实施例中,在步骤S13中,可以将样本输入至增添了掩码信息和外挂输出模块后的目标模型,得到目标模型的第一输出和外挂输出模块的第二输出。在步骤S15中,可以根据第一输出和第二输出,优化参数信息。在迭代过程结束后,可以根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。这样,可以在宽度维度上和深度维度上同时对目标模型进行剪枝。在深度维度上对目标模型进行剪枝,能够使目标模型获得更大的并行运行效率。
具体地,可以根据第一输出和第二输出,利用损失函数计算损失信息;可以根据损失信息,利用反向传播机制,优化参数信息。例如,可以利用反向传播机制计算参数信息的梯度;可以根据参数信息的梯度调整参数信息。其中,所述第一输出可以为目标模型的预测结果。所述第二输出可以为外挂输出模块的预测结果。所述参数信息可以包括目标模型的模型参数、外挂输出模块的模型参数以及剪枝参数。
所述损失函数可以包括第一项、第二项和第三项。所述第一项用于约束目标模型的第二剪枝率,以使目标模型整体接近或达到预期剪枝率。所述第二项用于表示第一输出与样本标签的偏差。所述第三项用于表示第二输出和样本标签的偏差。具体地,所述第三项可以包括多个子项。所述第三项可以为所述多个子项之和。每个子项用于表示一个外挂输出模块的第二输出与样本标签的偏差。所述子项可以包括交叉熵损失函数、均方误差损失函数等。
在一些场景示例中,所述损失函数可以表示为
L=L CE+L p+L C
其中,L p表示第一项。L CE表示第二项。L C表示第三项。
Figure PCTCN2023071540-appb-000012
P表示外挂输出模块的数量。L ci表示第i个外挂输出模块的第二输出与样本标签的偏差。L ci表示交叉熵损失函数。
具体地,所述性能指标可以包括准确率(Accuracy)、召回率(Recall)、精确率(Precision)、F1分数(F1-Score)及其任意组合。可以利用验证数据对外挂输出模块的性能进行测试,得到外挂输出模块的性能指标;可以根据外挂输出模块的性能指标,确定目标单元模块;可以删除目标单元模块之后的单元模块。例如,可以选择性能最优的外挂输出模块作为目标外挂输出模块;可以以目标外挂输出模块所连接的单元模块作为目标单元模块;可以删除目标单元模块之后的单元模块。再比如,还可以利用验证数据对内部输出模块的性能进行测试,得到内部输出模块的性能指标。如此,可以选择性能最优的输出模块作为目标输出模块;可以以目标输出模块所连接的单元模块作为目标单元模块;可以删除目标单元模块之后的单元模块。其中,性能最优的输出模块可以为内部输出模块或者外挂输出模块。
进一步地,还可以删除目标输出模块以外的其它输出模块,将目标输出模块作为目标模型本身的输出模块,将目标输出模块的输出作为目标模型的输出。其中,所述目标输出模块可以为内部输出模块或者外挂输出模块。目标输出模块以外的其它输出模块可以包括内部输出模块和/或外挂输出模块。
本说明书实施例的模型剪枝方法,掩码信息可以根据剪枝参数确定得到,剪枝参数可以在目标模型的学习过程中自适应调整。这样根据掩码信息对剪枝对象进行剪枝处理,可以提高剪枝效果,例如可以提高剪枝精度。
本说明书实施例还提供另一种模型剪枝方法。所述模型剪枝方法可以应用于剪枝设备。所述模型剪枝方法用于对目标模型进行剪枝处理,以减小模型冗余,获得轻量化的目标模型。
请参阅图7。所述模型剪枝方法可以包括以下步骤。
步骤S21:将样本输入至增添了外挂输出模块后的目标模型,得到目标模型的第一输出和外挂输出模块的第二输出。
在一些实施例中,可以在目标模型中增添输出模块。以下将增添的输出模块称为外挂输出模块。所述外挂输出模块可以用于预测。具体地,所述外挂输出模块可以为外挂分类模块,例如外挂分类器(Plugged-in Classifier)。所述外挂分类模块可以用于分类预测。当然,所述外挂输出模型还可以为外挂回归模块。所述外挂回归模块可以用于回归预测。
所述目标模型可以包括顺次堆叠的多个结构相同的单元模块。在所述多个结构相同的单元模块中,先前单元模块的输出可以作为输入传递到下一个单元模块,以供所述下一个单元模块进行处理。外挂输出模块可以与单元模块相连接,用于根据单元模块的输出进行预测。其中,外挂输出模块的数量可以为一个或多个。每个外挂输出模块可以连接一个单元模块。每个单元模块可以连接零个或一个外挂输出模块。
当然,所述目标模型本身还可以包括输出模块。为了与外挂输出模块进行区分,以下将目标模型中的输出模块称为内部输出模块。所述内部输出模块可以用于预测,例如根据所述多个结构相同的单元模块中最后一个单元模块的输出进行预测。具体地,所述内部输出模块可以为内部分类模块,例如内部分类器。所述内部分类模块可以用于分类预测。当然,所述内部输出模型还可以为内部回归模块。所述内部回归模块可以用于回 归预测。
在一些实施例中,所述样本可以包括图像样本、文本样本、音频文本等。所述样本具有样本标签。所述样本标签可以用于表示样本的类别。所述目标模型可以包括图像分类模型、文本分类模型、音频分类模型等。可以将一个或多个样本输入至增添了外挂输出模块后的目标模型,得到目标模型的第一输出和外挂输出模块的第二输出。
步骤S23:根据第一输出和第二输出,优化参数信息。
在一些实施例中,可以根据第一输出和第二输出,利用损失函数计算损失信息;可以根据损失信息,利用反向传播机制,优化参数信息。例如,可以利用反向传播机制计算参数信息的梯度;可以根据参数信息的梯度调整参数信息。其中,所述第一输出可以为目标模型的预测结果。所述第二输出可以为外挂输出模块的预测结果。所述参数信息可以包括目标模型的模型参数、以及外挂输出模块的模型参数。
所述损失函数可以包括第二项和第三项。所述第二项用于表示第一输出与样本标签的偏差。所述第三项用于表示第二输出和样本标签的偏差。具体地,所述第三项可以包括多个子项。所述第三项可以为所述多个子项之和。每个子项用于表示一个外挂输出模块的第二输出与样本标签的偏差。所述子项可以包括交叉熵损失函数、均方误差损失函数等。
在一些场景示例中,所述损失函数可以表示为
L=L p+L C
其中,L CE表示第二项。L C表示第三项。
Figure PCTCN2023071540-appb-000013
其中,P表示外挂输出模块的数量。L ci表示第i个外挂输出模块的第二输出与样本标签的偏差。L ci表示交叉熵损失函数。
步骤S25:根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。
在一些实施例中,可以迭代执行步骤S21-步骤S25,直至满足结束条件。所述结束条件可以根据实际需要灵活设定。例如,所述结束条件可以为迭代次数达到预设阈值。在迭代过程结束后,可以根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。
在一些实施例中,所述性能指标可以包括准确率(Accuracy)、召回率(Recall)、精确率(Precision)、F1分数(F1-Score)及其任意组合。可以利用验证数据对外挂输出模块的性能进行测试,得到外挂输出模块的性能指标;可以根据外挂输出模块的性能指标,确定目标单元模块;可以删除目标单元模块之后的单元模块。例如,可以选择性能最优的外挂输出模块作为目标输出模块;可以以目标输出模块所连接的单元模块作为目标单元模块;可以删除目标单元模块之后的单元模块。再比如,还可以利用验证数据对内部输出模块的性能进行测试,得到内部输出模块的性能指标。如此,可以选择性能最优的输出模块作为目标输出模块;可以以目标输出模块所连接的单元模块作为目标单元模块;可以删除目标单元模块之后的单元模块。其中,性能最优的输出模块可以为内部输出模块或者外挂输出模块。
进一步地,还可以删除目标输出模块以外的其它输出模块,将目标输出模块作为目标模型本身的输出模块,将目标输出模块的输出作为目标模型的输出。其中,所述目标输出模块可以为内部输出模块或者外挂输出模块。目标输出模块以外的其它输出模块可以包括内部输出模块和/或外挂输出模块。
本说明书实施例的模型剪枝方法,在目标模型中增添外挂输出模块,外挂输出模块 可以与目标模型一起进行训练。利用外挂输出模块的性能指标,对目标模型中的多个单元模块进行剪枝处理。从而实现了在深度维度上对目标模型进行剪枝,能够使目标模型获得更大的并行运行效率。
本说明书实施例还提供一种模型剪枝装置。所述模型剪枝装置可以设置于计算机设备。所述计算机设备可以为个人计算机、服务器、或者包含多个服务器的服务器集群等。
请参阅图8。所述模型剪枝装置可以包括以下单元。
迭代单元31,用于迭代执行以下步骤直至满足结束条件:根据剪枝参数,确定掩码信息;所述掩码信息用于指示目标模型中剪枝对象的有效状态;将样本输入至增添了掩码信息后的目标模型,得到目标模型的第一输出;根据第一输出,优化参数信息;所述参数信息包括目标模型的模型参数和剪枝参数;剪枝单元33,用于根据掩码信息,对剪枝对象进行剪枝处理。
本说明书实施例还提供另一种模型剪枝装置。所述模型剪枝装置可以设置于计算机设备。所述计算机设备可以为个人计算机、服务器、或者包含多个服务器的服务器集群等。
请参阅图9。所述模型剪枝装置可以包括以下单元。
迭代单元41,用于迭代执行以下步骤直至满足结束条件:将样本输入至增添了外挂输出模块后的目标模型,得到目标模型的第一输出和外挂输出模块的第二输出;所述目标模型包括顺次堆叠的多个结构相同的单元模块,外挂输出模块与单元模块相连接;根据第一输出和第二输出,优化参数信息,所述参数信息包括目标模型的模型参数和外挂输出模块的模型参数;剪枝单元43,用于根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。
下面介绍本说明书计算机设备的一个实施例。图10是该实施例中计算机设备的硬件结构示意图。如图10所示,该计算机设备可以包括一个或多个(图中仅示出一个)处理器、存储器和传输模块。当然,本领域普通技术人员可以理解,图10所示的硬件结构仅为示意,其并不对上述计算机设备的硬件结构造成限定。在实际中该计算机设备还可以包括比图10所示更多或者更少的组件单元;或者,具有与图10所示不同的配置。
所述存储器可以包括高速随机存储器;或者,还可以包括非易失性存储器,例如一个或者多个磁性存储装置、闪存或者其他非易失性固态存储器。当然,所述存储器还可以包括远程设置的网络存储器。所述存储器可以用于存储应用软件的程序指令或模块,例如本说明书图3或图7所对应实施例的程序指令或模块。
所述处理器可以按任何适当的方式实现。例如,所述处理器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式等等。所述处理器可以读取并执行所述存储器中的程序指令或模块。
所述传输模块可以用于经由网络进行数据传输,例如经由诸如互联网、企业内部网、局域网、移动通信网等网络进行数据传输。
本说明书还提供计算机存储介质的一个实施例。所述计算机存储介质包括但不限于随机存取存储器(Random Access Memory,RAM)、只读存储器(Read-Only Memory,ROM)、缓存(Cache)、硬盘(Hard Disk Drive,HDD)、存储卡(Memory Card)等等。所述计算机存储介质存储有计算机程序指令。在所述计算机程序指令被执行时实现:本说明书图3或图7所对应实施例的程序指令或模块。
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其它实施例的不同之处。尤其,对于装置实施例、计算机设备实施例、以及计算机存储介质实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。另外,可以理解的是,本领域技术人员在阅读本说明书文件之后,可以无需创造性劳动想到将本说明书列举的部分或全部实施例进行任意组合,这些组合也在本说明书公开和保护的范围内。
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字***“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。
上述实施例阐明的***、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本说明书可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本说明书的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本说明书各个实施例或者实施例的某些部分所述的方法。
本说明书可用于众多通用或专用的计算机***环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器***、基于微处理器的***、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何***或设备的分布式计算环境等等。
本说明书可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对 象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
虽然通过实施例描绘了本说明书,本领域普通技术人员知道,本说明书有许多变形和变化而不脱离本说明书的精神,希望所附的权利要求包括这些变形和变化而不脱离本说明书的精神。

Claims (15)

  1. 一种模型剪枝方法,包括:
    根据剪枝参数,确定掩码信息;所述掩码信息用于指示目标模型中剪枝对象的有效状态;
    将样本输入至增添了掩码信息后的目标模型,得到目标模型的第一输出;
    根据第一输出,优化参数信息;所述参数信息包括目标模型的模型参数和剪枝参数;
    迭代执行以上步骤,直至满足结束条件;
    根据掩码信息,对剪枝对象进行剪枝处理。
  2. 根据权利要求1所述的方法,所述剪枝参数包括剪枝阈值和价值信息,所述价值信息用于度量剪枝对象的重要性程度;所述确定掩码信息,包括:
    根据剪枝阈值,计算剪枝对象的第一剪枝率;
    根据第一剪枝率和价值信息,确定剪枝对象的掩码信息。
  3. 根据权利要求1所述的方法,所述优化参数信息,包括:
    根据损失函数,优化参数信息;所述损失函数包括第一项和第二项,所述第一项用于约束目标模型的第二剪枝率,所述第二项用于表示第一输出和样本标签的偏差。
  4. 根据权利要求1所述的方法,所述剪枝对象包括线性层和/或多头自注意力层;
    所述掩码信息包括第一掩码信息和/或第二掩码信息,所述第一掩码信息用于指示线性层的有效状态,所述第二掩码信息用于指示多头自注意力层的有效状态。
  5. 根据权利要求4所述的方法,所述第一掩码信息包括多个第一子掩码信息,所述线性层的权重参数包括多个子权重参数,所述对剪枝对象进行剪枝处理,包括:
    根据第一子掩码信息的取值,删除子权重参数。
  6. 根据权利要求4所述的方法,所述第二掩码信息包括多个第二子掩码信息,所述多头自注意力层包括多个自注意力头,所述对剪枝对象进行剪枝处理,包括:
    根据第二子掩码信息的取值,删除自注意力头。
  7. 根据权利要求1所述的方法,所述将样本输入至增添了掩码信息后的目标模型,包括:
    将样本输入至增添了掩码信息和外挂输出模块后的目标模型,得到外挂输出模块的第二输出;所述目标模型包括顺次堆叠的多个结构相同的单元模块,外挂输出模块与单元模块相连接;
    所述优化参数信息,包括:
    根据第一输出和第二输出,优化参数信息;所述参数信息包括外挂输出模块的模型参数。
  8. 根据权利要求7所述的方法,所述优化参数信息,包括:
    根据损失函数,优化参数信息;
    所述损失函数包括第三项,所述第三项用于表示第二输出和样本标签的偏差。
  9. 根据权利要求7所述的方法,所述方法还包括:
    根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。
  10. 一种模型剪枝方法,包括:
    将样本输入至增添了外挂输出模块后的目标模型,得到目标模型的第一输出和外挂输出模块的第二输出;所述目标模型包括顺次堆叠的多个结构相同的单元模块,外挂输出模块与单元模块相连接;
    根据第一输出和第二输出,优化参数信息,所述参数信息包括目标模型的模型参数和外挂输出模块的模型参数;
    迭代执行以上步骤,直至满足结束条件;
    根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。
  11. 根据权利要求10所述的方法,所述优化参数信息,包括:
    根据损失函数,优化参数信息;所述损失函数包括第二项和第三项,所述第二项用于表示第一输出和样本标签的偏差,所述第三项用于表示第二输出与样本标签的偏差。
  12. 根据权利要求10所述的方法,所述对多个单元模块进行剪枝处理,包括:
    根据外挂输出模块的性能指标,从多个单元模块中选择目标单元模块;
    删除目标单元模块之后的单元模块。
  13. 一种模型剪枝装置,包括:
    迭代单元,用于迭代执行以下步骤直至满足结束条件:根据剪枝参数,确定掩码信息;所述掩码信息用于指示目标模型中剪枝对象的有效状态;将样本输入至增添了掩码信息后的目标模型,得到目标模型的第一输出;根据第一输出,优化参数信息;所述参数信息包括目标模型的模型参数和剪枝参数;
    剪枝单元,用于根据掩码信息,对剪枝对象进行剪枝处理。
  14. 一种模型剪枝装置,包括:
    迭代单元,用于迭代执行以下步骤直至满足结束条件:将样本输入至增添了外挂输出模块后的目标模型,得到目标模型的第一输出和外挂输出模块的第二输出;所述目标模型包括顺次堆叠的多个结构相同的单元模块,外挂输出模块与单元模块相连接;根据第一输出和第二输出,优化参数信息,所述参数信息包括目标模型的模型参数和外挂输出模块的模型参数;
    剪枝单元,用于根据外挂输出模块的性能指标,对多个单元模块进行剪枝处理。
  15. 一种计算机设备,包括:
    至少一个处理器;
    存储有程序指令的存储器,其中,所述程序指令被配置为适于由所述至少一个处理器执行,所述程序指令包括用于执行根据权利要求1至12中任一项所述方法的指令。
PCT/CN2023/071540 2022-03-31 2023-01-10 模型剪枝 WO2023185209A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210330396.0A CN114819140A (zh) 2022-03-31 2022-03-31 模型剪枝方法、装置和计算机设备
CN202210330396.0 2022-03-31

Publications (1)

Publication Number Publication Date
WO2023185209A1 true WO2023185209A1 (zh) 2023-10-05

Family

ID=82532593

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071540 WO2023185209A1 (zh) 2022-03-31 2023-01-10 模型剪枝

Country Status (2)

Country Link
CN (1) CN114819140A (zh)
WO (1) WO2023185209A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114819140A (zh) * 2022-03-31 2022-07-29 支付宝(杭州)信息技术有限公司 模型剪枝方法、装置和计算机设备

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340227A (zh) * 2020-05-15 2020-06-26 支付宝(杭州)信息技术有限公司 通过强化学习模型对业务预测模型进行压缩的方法和装置
CN111667069A (zh) * 2020-06-10 2020-09-15 中国工商银行股份有限公司 预训练模型压缩方法、装置和电子设备
CN112396179A (zh) * 2020-11-20 2021-02-23 浙江工业大学 一种基于通道梯度剪枝的柔性深度学习网络模型压缩方法
CN112949840A (zh) * 2021-04-20 2021-06-11 中国人民解放军国防科技大学 通道注意力引导的卷积神经网络动态通道剪枝方法和装置
CN113569017A (zh) * 2021-01-28 2021-10-29 腾讯科技(深圳)有限公司 一种模型处理方法、装置、电子设备及存储介质
CN113610215A (zh) * 2021-07-09 2021-11-05 北京达佳互联信息技术有限公司 任务处理网络生成、任务处理方法、装置、电子设备及存储介质
CN113988267A (zh) * 2021-11-03 2022-01-28 携程旅游信息技术(上海)有限公司 用户意图识别模型的生成方法、用户意图识别方法和设备
CN114037074A (zh) * 2021-11-09 2022-02-11 北京百度网讯科技有限公司 一种模型剪枝方法、装置、电子设备及存储介质
CN114819140A (zh) * 2022-03-31 2022-07-29 支付宝(杭州)信息技术有限公司 模型剪枝方法、装置和计算机设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10832123B2 (en) * 2016-08-12 2020-11-10 Xilinx Technology Beijing Limited Compression of deep neural networks with proper use of mask
US11200495B2 (en) * 2017-09-08 2021-12-14 Vivante Corporation Pruning and retraining method for a convolution neural network
US11423312B2 (en) * 2018-05-14 2022-08-23 Samsung Electronics Co., Ltd Method and apparatus for universal pruning and compression of deep convolutional neural networks under joint sparsity constraints
US20210256383A1 (en) * 2020-02-13 2021-08-19 Northeastern University Computer-implemented methods and systems for privacy-preserving deep neural network model compression
CN111488986B (zh) * 2020-04-13 2023-06-27 商汤集团有限公司 一种模型压缩方法、图像处理方法以及装置
CN111667068A (zh) * 2020-06-02 2020-09-15 清华大学 一种基于掩码的深度图卷积神经网络模型剪枝方法与***
CN111539224B (zh) * 2020-06-25 2023-08-25 北京百度网讯科技有限公司 语义理解模型的剪枝方法、装置、电子设备和存储介质
KR102256288B1 (ko) * 2021-03-19 2021-05-27 리벨리온 주식회사 인공 신경망의 가속 하드웨어를 위한 가지치기 기반의 훈련 방법 및 시스템

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340227A (zh) * 2020-05-15 2020-06-26 支付宝(杭州)信息技术有限公司 通过强化学习模型对业务预测模型进行压缩的方法和装置
CN111667069A (zh) * 2020-06-10 2020-09-15 中国工商银行股份有限公司 预训练模型压缩方法、装置和电子设备
CN112396179A (zh) * 2020-11-20 2021-02-23 浙江工业大学 一种基于通道梯度剪枝的柔性深度学习网络模型压缩方法
CN113569017A (zh) * 2021-01-28 2021-10-29 腾讯科技(深圳)有限公司 一种模型处理方法、装置、电子设备及存储介质
CN112949840A (zh) * 2021-04-20 2021-06-11 中国人民解放军国防科技大学 通道注意力引导的卷积神经网络动态通道剪枝方法和装置
CN113610215A (zh) * 2021-07-09 2021-11-05 北京达佳互联信息技术有限公司 任务处理网络生成、任务处理方法、装置、电子设备及存储介质
CN113988267A (zh) * 2021-11-03 2022-01-28 携程旅游信息技术(上海)有限公司 用户意图识别模型的生成方法、用户意图识别方法和设备
CN114037074A (zh) * 2021-11-09 2022-02-11 北京百度网讯科技有限公司 一种模型剪枝方法、装置、电子设备及存储介质
CN114819140A (zh) * 2022-03-31 2022-07-29 支付宝(杭州)信息技术有限公司 模型剪枝方法、装置和计算机设备

Also Published As

Publication number Publication date
CN114819140A (zh) 2022-07-29

Similar Documents

Publication Publication Date Title
EP3399460B1 (en) Captioning a region of an image
CN110084281B (zh) 图像生成方法、神经网络的压缩方法及相关装置、设备
EP3627397B1 (en) Processing method and apparatus
US11030997B2 (en) Slim embedding layers for recurrent neural language models
US9916531B1 (en) Accumulator constrained quantization of convolutional neural networks
CN112633419B (zh) 小样本学习方法、装置、电子设备和存储介质
CN108734210B (zh) 一种基于跨模态多尺度特征融合的对象检测方法
CN109766557B (zh) 一种情感分析方法、装置、存储介质及终端设备
WO2023134082A1 (zh) 图像描述语句生成模块的训练方法及装置、电子设备
CN110569814A (zh) 视频类别识别方法、装置、计算机设备及计算机存储介质
CN112966754B (zh) 样本筛选方法、样本筛选装置及终端设备
Moya Rueda et al. Neuron pruning for compressing deep networks using maxout architectures
CN112733701A (zh) 一种基于胶囊网络的鲁棒场景识别方法及***
WO2023185209A1 (zh) 模型剪枝
CN115878805A (zh) 情感分析方法、装置、电子设备及存储介质
CN113421267B (zh) 一种基于改进PointConv的点云语义与实例联合分割方法及***
van Spengler et al. Poincare resnet
EP3166022A1 (en) Method and apparatus for image search using sparsifying analysis operators
CN111126501A (zh) 一种图像识别方法、终端设备及存储介质
CN114155388B (zh) 一种图像识别方法、装置、计算机设备和存储介质
WO2019076095A1 (zh) 处理方法及装置
CN112329925B (zh) 模型生成方法、特征提取方法、装置及电子设备
CN113515920B (zh) 从表格中提取公式的方法、电子设备和计算机可读介质
US20220121926A1 (en) Tensor ring decomposition for neural networks
CN113365072B (zh) 特征图压缩方法、装置、计算设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23777588

Country of ref document: EP

Kind code of ref document: A1