CN110909874A - Convolution operation optimization method and device of neural network model - Google Patents
Convolution operation optimization method and device of neural network model Download PDFInfo
- Publication number
- CN110909874A CN110909874A CN201911155114.2A CN201911155114A CN110909874A CN 110909874 A CN110909874 A CN 110909874A CN 201911155114 A CN201911155114 A CN 201911155114A CN 110909874 A CN110909874 A CN 110909874A
- Authority
- CN
- China
- Prior art keywords
- convolution
- sub
- feature map
- input feature
- channels
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000003062 neural network model Methods 0.000 title claims abstract description 27
- 238000005457 optimization Methods 0.000 title claims abstract description 27
- 230000004913 activation Effects 0.000 claims description 50
- 238000010586 diagram Methods 0.000 claims description 47
- 238000000605 extraction Methods 0.000 claims description 19
- 230000009467 reduction Effects 0.000 claims description 19
- 238000012545 processing Methods 0.000 claims description 17
- 238000000354 decomposition reaction Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 11
- 230000000694 effects Effects 0.000 abstract description 6
- 230000006870 function Effects 0.000 description 27
- 238000001514 detection method Methods 0.000 description 7
- 239000000284 extract Substances 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000006835 compression Effects 0.000 description 4
- 238000007906 compression Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a convolution operation optimization method and a convolution operation optimization device of a neural network model, wherein the method comprises the following steps: segmenting the input feature map in the channel dimension to obtain sub-input feature maps of each group; respectively carrying out different group convolution operations on each sub-input feature map, and extracting feature information of channels contained in each group to obtain different sub-output feature maps; and shuffling and combining the sub-output characteristic graphs of all the groups to obtain an output characteristic graph. The invention reduces the operation time consumption of the network model on the premise of ensuring the model effect.
Description
Technical Field
The invention relates to the technical field of neural network models, in particular to a convolution operation optimization method and device of a neural network model.
Background
In recent years, with the vigorous development of deep neural networks, academic circles and industries have seen major breakthroughs in deep learning in many fields, but the size and the calculation amount of the neural network model become bottlenecks in practical application, so that the neural network model is difficult to be applied to some scenes with high real-time requirements, such as typical application scenes of online video quality detection.
Reducing the amount of computation in the neural network can effectively reduce the operation time consumption of the neural network. However, reducing the amount of computation in the neural network may result in a reduction in the expressive power of the model, thereby affecting the actual effectiveness of the model.
Disclosure of Invention
The invention aims to provide a convolution operation optimization method and a convolution operation optimization device of a neural network model, which reduce the operation time consumption of the network model on the premise of ensuring the model effect.
The technical scheme provided by the invention is as follows:
a convolution operation optimization method of a neural network model is applied to convolution operation for extracting characteristic information from an input characteristic diagram and obtaining an output characteristic diagram, and comprises the following steps: segmenting the input feature map in the channel dimension to obtain sub-input feature maps of each group; respectively carrying out different group convolution operations on each sub-input feature map, and extracting feature information of channels contained in each group to obtain different sub-output feature maps; and shuffling and combining the sub-output characteristic graphs of all the groups to obtain the output characteristic graph.
Further, the segmenting the input feature map in the channel dimension to obtain sub-input feature maps of each group includes: and segmenting the input feature diagram according to the number of the channels of the input feature diagram to obtain a sub-input feature diagram of each channel.
Further, the group convolution operation includes: processing information of each channel of the sub-input feature map respectively by adopting a plurality of depth separable convolutions, wherein the depth separable convolutions are convolution kernels with channels being 1, and the number of the adopted depth separable convolutions is equal to the number of the channels of the sub-input feature map; and combining the features of the different channels extracted by the depth separable convolution through 1 x 1 point convolution to obtain a sub-output feature map.
Further, the group convolution operation includes: performing dimension increasing on the sub-input feature map through 1 × 1 point convolution to obtain a sub-input feature map with increased channel number; performing feature extraction on the sub-input feature map with the increased number of channels through depth separable convolution to obtain feature information; and performing dimensionality reduction on the feature information through 1 × 1 point convolution to obtain a sub-output feature map, wherein the number of channels of the sub-output feature map is equal to that of channels of the sub-input feature map.
Further, the performing dimensionality raising on the sub-input feature map by 1 × 1 point convolution includes: performing dimension increasing on the sub-input feature map through 1 × 1 point convolution, and performing nonlinear operation on a convolution result after dimension increasing by adopting a nonlinear activation function; the feature extraction by depth separable convolution comprises: firstly, carrying out feature extraction through depth separable convolution, and then carrying out nonlinear operation through the nonlinear activation function; the dimension reduction of the characteristic information through 1 × 1 point convolution includes: and reducing the dimension of the characteristic information through 1 multiplied by 1 point convolution, and performing activation processing through a linear activation function.
Further, the nonlinear activation function is a ReLU6 activation function.
The invention also provides a convolution operation optimization device of the neural network model, which is applied to convolution operation for extracting characteristic information from an input characteristic diagram and obtaining an output characteristic diagram, and comprises the following steps: the channel decomposition module is used for segmenting the input feature map in channel dimensions to obtain sub-input feature maps of each group; the grouping convolution module is used for respectively carrying out different grouping convolution operations on each sub-input feature map, extracting feature information of channels contained in each group and obtaining different sub-output feature maps; and the characteristic shuffling module is used for shuffling and combining the sub-output characteristic graphs of all the groups to obtain the output characteristic graph.
Further, the packet convolution module includes: a single-channel feature extraction unit, configured to separately process information of each channel of the sub-input feature map by using a plurality of depth-separable convolutions, where the depth-separable convolutions are convolution kernels with a channel of 1, and the number of the depth-separable convolutions used is equal to the number of channels of the sub-input feature map; and the feature merging unit is used for merging the features of the different channels extracted by the depth separable convolution through 1 x 1 point convolution to obtain a sub-output feature map.
Further, the packet convolution module includes: the dimension increasing unit is used for increasing the dimension of the sub-input feature map through 1 multiplied by 1 point convolution to obtain the sub-input feature map with increased channel number; the characteristic extraction unit is used for extracting the characteristics of the sub-input characteristic graphs with the increased channel number through depth separable convolution to obtain characteristic information; and the dimension reduction unit is used for performing dimension reduction on the feature information through 1 multiplied by 1 point convolution to obtain a sub-output feature map, wherein the number of channels of the sub-output feature map is equal to that of the channels of the sub-input feature map.
Further, the packet convolution module further includes: the first nonlinear activation unit is used for carrying out nonlinear operation on the convolution result after the dimension is increased by adopting a nonlinear activation function; a second nonlinear activation unit for performing a nonlinear operation by the nonlinear activation function after feature extraction by depth separable convolution; and the linear activation unit is used for performing activation processing through a linear activation function after dimension reduction is performed on the characteristic information through 1 × 1 point convolution.
The convolution operation optimization method and device of the neural network model provided by the invention can bring the following beneficial effects:
1. the invention optimizes the calculated amount and parameters of the convolution operation by carrying out the channel decomposition of the characteristic diagram and the channel decomposition of the convolution kernel in the convolution operation, thereby realizing the effect of the convolution operation optimization of the neural network model.
2. The invention optimizes the convolution operation of CNN in the network on the premise of ensuring the model effect, thereby reducing the calculated amount and the operation time consumption of the network model and meeting the real-time requirement under some classical application scenes.
Drawings
The above features, technical features, advantages and implementations of a method and apparatus for optimizing convolution operations of a neural network model will be further described in the following detailed description of preferred embodiments with reference to the accompanying drawings.
FIG. 1 is a flow diagram of one embodiment of a method for convolutional optimization of a neural network model of the present invention;
FIG. 2 is a flow diagram of another embodiment of a method for convolutional optimization of a neural network model of the present invention;
FIG. 3 is a flow diagram of another embodiment of a method for convolutional optimization of a neural network model of the present invention;
FIG. 4 is a schematic structural diagram of an embodiment of an apparatus for optimizing convolution operations of a neural network model according to the present invention;
FIG. 5 is a schematic structural diagram of an apparatus for optimizing convolution operations of a neural network model according to another embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for optimizing convolution operations of a neural network model according to another embodiment of the present invention;
fig. 7 is a block diagram of a Group-incorporation structure of fig. 1.
The reference numbers illustrate:
110. the device comprises a channel decomposition module, 120 a grouping convolution module, 130 a feature shuffling module, 121 a single-channel feature extraction unit, 122 a feature merging unit, 123 a dimension ascending unit, 124 a feature extraction unit, 125 a dimension descending unit, 126 a first nonlinear activation unit, 127 a second nonlinear activation unit and 128 a linear activation unit.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".
In an embodiment of the present invention, as shown in fig. 1, a convolution operation optimization method for a neural network model, applied to a convolution operation for extracting feature information from an input feature map to obtain an output feature map, includes:
step S100 is to segment the input feature map in the channel dimension to obtain sub-input feature maps of each group.
Specifically, the input feature map may be an input original image of a certain model (for example, based on a convolutional neural network), or may be input feature information of a certain convolutional layer in the model. The data volume of the characteristic diagram is represented by H (height) W (width) C (number of channels), and the meaning is: data for C channels, each channel having H × W data. Since the present invention relates only to the amount of data of various feature maps and not to specific feature data, the following description is simplified, and the feature maps are denoted by H × W × C.
And splitting the input feature graph in the channel dimension, namely, grouping certain channels into a group. For example, the input feature map is 27 × 16, and the input feature map is grouped into 4 (16/4) groups by 4 channels, and each group includes data of 4 channels, thereby forming a sub-input feature map.
The above process is referred to as channel decomposition of the input feature map.
Step S200 performs different group convolution operations on each sub-input feature map, and extracts feature information of channels included in each group to obtain different sub-output feature maps.
Specifically, the group convolution operation refers to a convolution operation for each group. The input of the group convolution operation is a sub-input feature map, and the output is a sub-output feature map, so as to extract feature information from the sub-input feature map, wherein the feature information forms the sub-output feature map.
The group convolution operation may use convolution kernels having the same size as the conventional convolution operation, but the number of convolution kernels is only 1/group of the number of conventional convolution kernels (the convolution kernels used in the conventional convolution operation, referred to as conventional convolution kernels). The number of the conventional convolution kernels is equal to the number of channels of the output characteristic diagram, and the number of the convolution kernels of the group convolution is equal to the number of the channels of the output characteristic diagram divided by the number of the groups, which is also called as channel decomposition of the convolution kernels.
In order to ensure that the input feature map can be uniformly segmented and the conventional convolution kernel can be uniformly decomposed during channel decomposition, the group number is usually a common divisor of the number of channels of the input feature map and the number of channels of the output feature map.
Computational efficiency is improved since each group convolution only needs to process feature maps of a few channels.
For example, the input feature map is DF×DF×m,DFThe height and width of the input characteristic diagram are shown, and m is the channel number of the input characteristic diagram; the output characteristic diagram is DF×DFX n, n is the channel number of the output characteristic diagram; the convolution kernel size is s × s, and with conventional convolution processing, n convolution kernels of s × s × m are used, and the amount of computation for each convolution kernel to complete the convolution operation of the input feature map is: s × s × m × DF×DFThe calculated number of n convolution kernels is s × s × m × DF×DF×n。
By using block convolution, assuming that the input data is divided into g groups, each sub-input feature map is DF×DFX (m/g), the corresponding sub-output feature map is DF×DF(n/g), each volume office using (n/g) s × s × g convolution kernels, the calculated amount for each volume office being s × s × (m/g) × DF×DFX (n/g). There are g group volume offices, so the total calculated amount is s × s × (m/g) × DF×DFAnd x (n/g) x g, the calculated amount is reduced to 1/g compared with the traditional convolution processing.
In addition, by adopting the block convolution, compared with the traditional convolution, the parameters required for performing the convolution operation are also reduced, and the analysis is as follows:
the parameters required for conventional convolution are m × s × s × n. Adopting grouping convolution, wherein the parameter required by each grouping convolution is (m/g) multiplied by s multiplied by (n/g); a total of g sets of convolutions, so the total parameter is (m/g) × s × s × (n/g) × g, reduced to the original 1/g.
Step S300 shuffles and combines the sub-output feature maps of all the groups to obtain an output feature map.
Specifically, after segmentation according to the channel dimensions, the features of the sub-output feature maps are relatively sparse, if all the grouped sub-output feature maps are connected in a simple cascade mode to serve as the output feature maps, information isolation of the feature maps among different groups can be caused, and the information expression capacity is reduced, so that the feature maps among different groups are mixed in a channel shuffling mode after group convolution, and feature interaction among the groups is realized.
For example, as shown in the Group-inclusion structure shown in fig. 7, taking an online video detection model as an example, an input video image has three channels of RGB, and pixel values of the three channels form an input feature map, that is, the number of channels of the input feature map is 3. Segmenting and grouping the input characteristic graphs according to three channels of RGB to obtain 3 sub-input characteristic graphs which are respectively as follows: a sub-input characteristic diagram 1, a sub-input characteristic diagram 2 and a sub-input characteristic diagram 3. Each sub-input feature map contains only pixel values of one channel.
And performing group convolution on each sub-input feature map respectively to obtain 3 sub-output feature maps, wherein each sub-output feature map comprises features of 3 channels (11 is data of channel 1 of the first sub-output feature map, 12 is data of channel 2 of the first sub-output feature map, 13 is data of channel 3 of the first sub-output feature map, 21 is data of channel 1 of the second sub-output feature map, 31 is data of channel 1 of the third sub-output feature map, and the other meanings are analogized in turn).
And performing channel shuffling and combination on the 3 sub-output characteristic graphs, mixing the characteristics of different channels, realizing the characteristic interaction among the channels and obtaining the output characteristic graph. The data of the first 3 channels of the output feature map are respectively composed of the data of the 1 st channel in the sub-output feature maps 1-3, and the middle 3 channels and the last 3 channels are processed similarly.
In the embodiment, the operation amount and the parameter amount of the convolution operation are reduced and the convolution operation is optimized in a grouping convolution mode, so that the network model is favorably applied to scenes with high real-time performance.
In another embodiment of the present invention, as shown in fig. 2, a convolution operation optimization method for a neural network model includes:
step S100 is to segment the input feature map in the channel dimension to obtain sub-input feature maps of each group.
Step S200 performs different group convolution operations on each sub-input feature map, and extracts feature information of channels included in each group to obtain different sub-output feature maps.
The group convolution operation on each sub-input feature map specifically comprises the following steps:
step S210, respectively processing information of each channel of the sub-input feature map by using a plurality of depth-separable convolutions, where the depth-separable convolutions are convolution kernels with channels of 1, and the number of the depth-separable convolutions is equal to the number of channels of the sub-input feature map;
step S220 combines the features of the different channels extracted by the depth separable convolution through a 1 × 1 point convolution to obtain a sub-output feature map.
Specifically, the depth separable convolution is a convolution kernel with channel 1, each convolution kernel only processes the features of one channel of the sub-input feature map. However, each depth separable convolution only learns the characteristics of one channel, and therefore the characteristics of different channels are combined through 1 × 1 point convolution, so that the characteristics of different channels can be learned.
For example, the sub-input feature map is DF×DFXm 1, requiring m1 s × s × 1 depth separable convolutions to process the features of all channels of the sub-input feature map. The computational load for all depth separable convolutions is: s × s × m1 × DF×DF。
If the sub-output feature diagram has n1 channels, n1 point convolutions of 1 × 1 × m1 are needed, and the calculation amount of all the point convolutions is m1 × DF×DFX n 1. The calculation amount of this combination is s × s × m1 × DF×DF+m1×DF×DFX n 1. The calculated amount (s × s × m1 × D) relative to the conventional convolution methodF×DFXn 1) for this combination, the computational cost is reduced to:
furthermore, the depth separable convolution also enables compression of the model through a reduction in the number of parameters. Continuing with the previous example, if conventional convolution is used, the required parameters are m1 × s × s × n 1; and the number of parameters for the depth separable convolution is s × s × m1+ n1 × m1, the compression ratio of the parameters is:
step S300 shuffles and combines the sub-output feature maps of all the groups to obtain an output feature map.
In this embodiment, on the basis of the packet convolution, the packet convolution operation is implemented by using the depth separable convolution and the 1 × 1 dot convolution, so that the amount of operation and the amount of parameters of the convolution operation are further reduced, and the convolution operation is optimized.
In another embodiment of the present invention, as shown in fig. 3, a convolution operation optimization method for a neural network model includes:
step S110, the input feature diagram is segmented according to the number of the channels of the input feature diagram, and a sub-input feature diagram of each channel is obtained.
Specifically, taking the online video detection model as an example, the input video image has three channels of RGB, the pixel values of each channel form a two-dimensional matrix, and the pixel values of the three channels form an input feature map, in other words, the number of channels of the input feature map is 3. And segmenting and grouping the input feature map according to three channels of RGB to obtain 3 sub-input feature maps, wherein each sub-input feature map only comprises a pixel value of one channel.
Step S200 performs different group convolution operations on each sub-input feature map, and extracts feature information of channels included in each group to obtain different sub-output feature maps.
The group convolution operation on each sub-input feature map specifically comprises the following steps:
step S230 performs dimension increase on the sub-input feature map by 1 × 1 point convolution, and performs nonlinear operation on the dimension-increased convolution result by using a nonlinear activation function to obtain a sub-input feature map with an increased number of channels.
Step S240, firstly, performing feature extraction on the sub-input feature map with the increased number of channels through depth separable convolution, and then performing nonlinear operation through the nonlinear activation function to obtain feature information;
step S250, performing dimension reduction on the feature information through 1 × 1 point convolution, and performing activation processing through a linear activation function to obtain a sub-output feature map, where the number of channels of the sub-output feature map is equal to the number of channels of the sub-input feature map.
Specifically, the sub-input feature map only contains the pixel value of one channel, and the number of channels is already small, so the dimension is increased by performing 1 × 1 point-by-point convolution to increase the number of channels of the sub-input feature map. After the point convolution is finished, the non-linear expression capacity is increased through the non-linear activation function processing.
And then 3 x 3 deep separable convolution is used to extract features, and the amount of computation in the network is greatly reduced. And after the depth separable convolution is finished, the nonlinear activation function processing is carried out, so that the nonlinear expression capability is increased.
Optionally, the nonlinear activation function is a ReLU6 activation function. The ReLU6 is a special ReLU function that limits the maximum output to 6, suitable for handling high-dimensional feature inputs.
Finally, dimension reduction is carried out by using 1 × 1 point-by-point convolution, on one hand, the number of channels of the feature graph is recovered, on the other hand, a certain channel shuffling effect is also achieved, the feature graphs extracted by the depth separable convolution are mixed according to the channels, and the information expression capacity is improved. The dimensionality of the features after the last 1 x 1 point convolution is already minimal, and if the ReLU6 is reused, the features will be destroyed, resulting in a large loss of information. A linear activation function is therefore used after the final 1 x 1 point convolution.
Step S300 shuffles and combines the sub-output feature maps of all the groups to obtain an output feature map.
The method of the embodiment is applied to a damaged video detection model, iterative training is carried out on the model, and the VFD-SmartNet network is obtained after convergence. Comparing it with AlexNet, VGG16, ResNet18, ResNet34, ResNet-like, DenseNet-like models, the performance is compared as follows:
the data in the table show that the operation speed is related to the parameter quantity, the parameter quantity of the VFD-SmartNet model is obviously reduced, the operation speed is obviously improved, the aim of network acceleration is well achieved, the recall ratio and the precision ratio are also maintained at higher levels, and the model can improve the operation speed on the premise of keeping the accuracy of the model so as to guarantee the requirement of real-time performance.
In another embodiment of the present invention, as shown in fig. 4, an apparatus for optimizing convolution operations of a neural network model includes:
and the channel decomposition module 110 is configured to segment the input feature map in a channel dimension to obtain sub-input feature maps of each group.
Specifically, the input feature map may be an input original image of a certain model (for example, based on a convolutional neural network), or may be input feature information of a certain convolutional layer in the model. The data volume of the characteristic diagram is represented by H (height) W (width) C (number of channels), and the meaning is: data for C channels, each channel having H × W data.
And splitting the input feature graph in the channel dimension, namely, grouping certain channels into a group. For example, the input feature map is 27 × 16, and the input feature map is grouped into 4 (16/4) groups by 4 channels, and each group includes data of 4 channels, thereby forming a sub-input feature map.
The above process is referred to as channel decomposition of the input feature map.
And the grouping convolution module 120 is configured to perform different grouping convolution operations on each sub-input feature map, and extract feature information of channels included in each group to obtain different sub-output feature maps.
Specifically, the group convolution operation refers to a convolution operation for each group. The input of the group convolution operation is a sub-input feature map, and the output is a sub-output feature map, so as to extract feature information from the sub-input feature map, wherein the feature information forms the sub-output feature map.
The group convolution operation may use convolution kernels having the same size as the conventional convolution operation, but the number of convolution kernels is only 1/group of the number of conventional convolution kernels (the convolution kernels used in the conventional convolution operation, referred to as conventional convolution kernels). The number of the conventional convolution kernels is equal to the number of channels of the output characteristic diagram, and the number of the convolution kernels of the group convolution is equal to the number of the channels of the output characteristic diagram divided by the number of the groups, which is also called as channel decomposition of the convolution kernels.
In order to ensure that the input feature map can be uniformly segmented and the conventional convolution kernel can be uniformly decomposed during channel decomposition, the group number is usually a common divisor of the number of channels of the input feature map and the number of channels of the output feature map.
Computational efficiency is improved since each group convolution only needs to process feature maps of a few channels.
For example, the input feature map is DF×DF×m,DFThe height and width of the input characteristic diagram are shown, and m is the channel number of the input characteristic diagram; the output characteristic diagram is DF×DFX n, n is the channel number of the output characteristic diagram; the convolution kernel size is s × s, and with conventional convolution processing, n convolution kernels of s × s × m are used, and the amount of computation for each convolution kernel to complete the convolution operation of the input feature map is: s × s × m × DF×DFThe calculated number of n convolution kernels is s × s × m × DF×DF×n。
By using block convolution, assuming that the input data is divided into g groups, each sub-input feature map is DF×DFX (m/g), the corresponding sub-output feature map is DF×DF(n/g), each volume office using (n/g) s × s × g convolution kernels, the calculated amount for each volume office being s × s × (m/g) × DF×DFX (n/g). There are g group volume offices, so the total calculated amount is s × s × (m/g) × DF×DFAnd x (n/g) x g, the calculated amount is reduced to 1/g compared with the traditional convolution processing.
In addition, by adopting the block convolution, compared with the traditional convolution, the parameters required for performing the convolution operation are also reduced, and the analysis is as follows:
the parameters required for conventional convolution are m × s × s × n. Adopting grouping convolution, wherein the parameter required by each grouping convolution is (m/g) multiplied by s multiplied by (n/g); a total of g sets of convolutions, so the total parameter is (m/g) × s × s × (n/g) × g, reduced to the original 1/g.
And the feature shuffling module 130 is used for shuffling and combining the sub-output feature maps of all the groups to obtain the output feature map.
Specifically, after segmentation according to the channel dimensions, the features of the sub-output feature maps are relatively sparse, if all the grouped sub-output feature maps are connected in a simple cascade mode to serve as the output feature maps, information isolation of the feature maps among different groups can be caused, and the information expression capacity is reduced, so that the feature maps among different groups are mixed in a channel shuffling mode after group convolution, and feature interaction among the groups is realized.
For example, as shown in the Group-inclusion structure shown in fig. 7, taking an online video detection model as an example, an input video image has three channels of RGB, and pixel values of the three channels form an input feature map, that is, the number of channels of the input feature map is 3. Segmenting and grouping the input characteristic graphs according to three channels of RGB to obtain 3 sub-input characteristic graphs which are respectively as follows: a sub-input characteristic diagram 1, a sub-input characteristic diagram 2 and a sub-input characteristic diagram 3. Each sub-input feature map contains only pixel values of one channel.
And performing group convolution on each sub-input feature map respectively to obtain 3 sub-output feature maps, wherein each sub-output feature map comprises features of 3 channels (11 is data of channel 1 of the first sub-output feature map, 12 is data of channel 2 of the first sub-output feature map, 13 is data of channel 3 of the first sub-output feature map, 21 is data of channel 1 of the second sub-output feature map, 31 is data of channel 1 of the third sub-output feature map, and the other meanings are analogized in turn).
And performing channel shuffling and combination on the 3 sub-output characteristic graphs, mixing the characteristics of different channels, realizing the characteristic interaction among the channels and obtaining the output characteristic graph. The data of the first 3 channels of the output feature map are respectively composed of the data of the 1 st channel in the sub-output feature maps 1-3, and the middle 3 channels and the last 3 channels are processed similarly.
In the embodiment, the operation amount and the parameter amount of the convolution operation are reduced and the convolution operation is optimized in a grouping convolution mode, so that the network model is favorably applied to scenes with high real-time performance.
In another embodiment of the present invention, as shown in fig. 5, an apparatus for optimizing convolution operations of a neural network model includes:
and the channel decomposition module 110 is configured to segment the input feature map in a channel dimension to obtain sub-input feature maps of each group.
And the grouping convolution module 120 is configured to perform different grouping convolution operations on each sub-input feature map, and extract feature information of channels included in each group to obtain different sub-output feature maps.
Wherein, the packet convolution module 120 includes:
a single-channel feature extraction unit 121, configured to separately process information of each channel of the sub-input feature map by using a plurality of depth-separable convolutions, where the depth-separable convolutions are convolution kernels with a channel 1, and the number of depth-separable convolutions used is equal to the number of channels of the sub-input feature map;
and the feature merging unit 122 is configured to merge the features of the different channels extracted by the depth separable convolution through 1 × 1 point convolution to obtain a sub-output feature map.
Specifically, the depth separable convolution is a convolution kernel with channel 1, each convolution kernel only processes the features of one channel of the sub-input feature map. However, each depth separable convolution only learns the characteristics of one channel, and therefore the characteristics of different channels are combined through 1 × 1 point convolution, so that the characteristics of different channels can be learned.
For example, the sub-input feature map is DF×DFXm 1, requiring m1 s × s × 1 depth separable convolutions to process the features of all channels of the sub-input feature map. The computational load for all depth separable convolutions is: s × s × m1 × DF×DF。
If the sub-output feature diagram has n1 channels, n1 point convolutions of 1 × 1 × m1 are needed, and the calculation amount of all the point convolutions is m1 × DF×DFX n 1. The calculation amount of this combination is s × s × m1 × DF×DF+m1×DF×DFX n 1. Relative to each otherCalculated amount of the conventional convolution method (s × s × m1 × DF×DFXn 1) for this combination, the computational cost is reduced to:
furthermore, the depth separable convolution also enables compression of the model through a reduction in the number of parameters. Continuing with the previous example, if conventional convolution is used, the required parameters are m1 × s × s × n 1; and the number of parameters for the depth separable convolution is s × s × m1+ n1 × m1, the compression ratio of the parameters is:
and the feature shuffling module 130 is used for shuffling and combining the sub-output feature maps of all the groups to obtain the output feature map.
In this embodiment, on the basis of the packet convolution, the packet convolution operation is implemented by using the depth separable convolution and the 1 × 1 dot convolution, so that the amount of operation and the amount of parameters of the convolution operation are further reduced, and the convolution operation is optimized.
In another embodiment of the present invention, as shown in fig. 4 and 6, an apparatus for optimizing convolution operations of a neural network model includes:
and the channel decomposition module 110 is configured to split the input feature map according to the number of channels of the input feature map, so as to obtain a sub-input feature map of each channel.
Specifically, taking the online video detection model as an example, the input video image has three channels of RGB, the pixel values of each channel form a two-dimensional matrix, and the pixel values of the three channels form an input feature map, in other words, the number of channels of the input feature map is 3. And segmenting and grouping the input feature map according to three channels of RGB to obtain 3 sub-input feature maps, wherein each sub-input feature map only comprises a pixel value of one channel.
And the grouping convolution module 120 is configured to perform different grouping convolution operations on each sub-input feature map, and extract feature information of channels included in each group to obtain different sub-output feature maps.
Wherein, the packet convolution module 120 includes:
a dimension raising unit 123, configured to raise the dimension of the sub-input feature map by 1 × 1 dot convolution;
the first nonlinear activation unit 126 is configured to perform nonlinear operation on the convolution result after the dimension increase by using a nonlinear activation function to obtain a sub-input feature map with an increased channel number;
a feature extraction unit 124, configured to perform feature extraction on the sub-input feature map with the increased number of channels through depth separable convolution;
a second nonlinear activation unit 127, configured to perform a nonlinear operation through the nonlinear activation function after performing feature extraction through depth separable convolution, so as to obtain feature information;
a dimension reduction unit 125, configured to perform dimension reduction on the feature information through a 1 × 1 point convolution;
a linear activation unit 128, configured to perform dimension reduction on the feature information through 1 × 1 point convolution, and then perform activation processing through a linear activation function to obtain a sub-output feature map, where the number of channels of the sub-output feature map is equal to the number of channels of the sub-input feature map.
Specifically, the sub-input feature map only contains the pixel value of one channel, and the number of channels is already small, so the dimension is increased by performing 1 × 1 point-by-point convolution to increase the number of channels of the sub-input feature map. After the point convolution is finished, the non-linear expression capacity is increased through the non-linear activation function processing.
And then 3 x 3 deep separable convolution is used to extract features, and the amount of computation in the network is greatly reduced. And after the depth separable convolution is finished, the nonlinear activation function processing is carried out, so that the nonlinear expression capability is increased.
Optionally, the nonlinear activation function is a ReLU6 activation function. The ReLU6 is a special ReLU function that limits the maximum output to 6, suitable for handling high-dimensional feature inputs.
Finally, dimension reduction is carried out by using 1 × 1 point-by-point convolution, on one hand, the number of channels of the feature graph is recovered, on the other hand, a certain channel shuffling effect is also achieved, the feature graphs extracted by the depth separable convolution are mixed according to the channels, and the information expression capacity is improved. The dimensionality of the features after the last 1 x 1 point convolution is already minimal, and if the ReLU6 is reused, the features will be destroyed, resulting in a large loss of information. A linear activation function is therefore used after the final 1 x 1 point convolution.
A feature shuffling module 300 for shuffling and combining the sub-output feature maps of all the groups to obtain the output feature maps
The method of the embodiment is applied to a damaged video detection model, iterative training is carried out on the model, and the VFD-SmartNet network is obtained after convergence. Comparing it with AlexNet, VGG16, ResNet18, ResNet34, ResNet-like, DenseNet-like models, the performance is compared as follows:
as can be seen from the data in the table, the operation speed of the VFD-SmartNet model is obviously improved, the goal of network acceleration is well achieved, the recall ratio and the precision ratio are also maintained at higher levels, and the model can improve the operation speed on the premise of keeping the accuracy of the model so as to ensure the requirement of real-time performance.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A convolution operation optimization method of a neural network model is applied to convolution operation for extracting characteristic information from an input characteristic diagram and obtaining an output characteristic diagram, and is characterized by comprising the following steps:
segmenting the input feature map in the channel dimension to obtain sub-input feature maps of each group;
respectively carrying out different group convolution operations on each sub-input feature map, and extracting feature information of channels contained in each group to obtain different sub-output feature maps;
and shuffling and combining the sub-output characteristic graphs of all the groups to obtain the output characteristic graph.
2. The convolution optimization method of claim 1, wherein the slicing the input feature map in channel dimensions to obtain sub-input feature maps for each group comprises:
and segmenting the input feature diagram according to the number of the channels of the input feature diagram to obtain a sub-input feature diagram of each channel.
3. The convolution optimization method of claim 1, wherein the set of convolution operations comprises:
processing information of each channel of the sub-input feature map respectively by adopting a plurality of depth separable convolutions, wherein the depth separable convolutions are convolution kernels with channels being 1, and the number of the adopted depth separable convolutions is equal to the number of the channels of the sub-input feature map;
and combining the features of the different channels extracted by the depth separable convolution through 1 x 1 point convolution to obtain a sub-output feature map.
4. The convolution optimization method of claim 1, wherein the set of convolution operations comprises:
performing dimension increasing on the sub-input feature map through 1 × 1 point convolution to obtain a sub-input feature map with increased channel number;
performing feature extraction on the sub-input feature map with the increased number of channels through depth separable convolution to obtain feature information;
and performing dimensionality reduction on the feature information through 1 × 1 point convolution to obtain a sub-output feature map, wherein the number of channels of the sub-output feature map is equal to that of channels of the sub-input feature map.
5. The convolution optimization method of claim 4, wherein:
the step of performing dimensionality raising on the sub-input feature map through 1 × 1 point convolution comprises the following steps:
performing dimension increasing on the sub-input feature map through 1 × 1 point convolution, and performing nonlinear operation on a convolution result after dimension increasing by adopting a nonlinear activation function;
the feature extraction by depth separable convolution comprises:
firstly, carrying out feature extraction through depth separable convolution, and then carrying out nonlinear operation through the nonlinear activation function;
the dimension reduction of the characteristic information through 1 × 1 point convolution includes:
and reducing the dimension of the characteristic information through 1 multiplied by 1 point convolution, and performing activation processing through a linear activation function.
6. The convolution optimization method of claim 5, wherein:
the nonlinear activation function is a ReLU6 activation function.
7. A convolution operation optimization device of a neural network model is applied to convolution operation for extracting characteristic information from an input characteristic diagram and obtaining an output characteristic diagram, and is characterized by comprising the following steps:
the channel decomposition module is used for segmenting the input feature map in channel dimensions to obtain sub-input feature maps of each group;
the grouping convolution module is used for respectively carrying out different grouping convolution operations on each sub-input feature map, extracting feature information of channels contained in each group and obtaining different sub-output feature maps;
and the characteristic shuffling module is used for shuffling and combining the sub-output characteristic graphs of all the groups to obtain the output characteristic graph.
8. The convolution optimization device of claim 7, wherein the packet convolution module comprises:
a single-channel feature extraction unit, configured to separately process information of each channel of the sub-input feature map by using a plurality of depth-separable convolutions, where the depth-separable convolutions are convolution kernels with a channel of 1, and the number of the depth-separable convolutions used is equal to the number of channels of the sub-input feature map;
and the feature merging unit is used for merging the features of the different channels extracted by the depth separable convolution through 1 x 1 point convolution to obtain a sub-output feature map.
9. The convolution optimization device of claim 7, wherein the packet convolution module comprises:
the dimension increasing unit is used for increasing the dimension of the sub-input feature map through 1 multiplied by 1 point convolution to obtain the sub-input feature map with increased channel number;
the characteristic extraction unit is used for extracting the characteristics of the sub-input characteristic graphs with the increased channel number through depth separable convolution to obtain characteristic information;
and the dimension reduction unit is used for performing dimension reduction on the feature information through 1 multiplied by 1 point convolution to obtain a sub-output feature map, wherein the number of channels of the sub-output feature map is equal to that of the channels of the sub-input feature map.
10. The convolution optimization device of claim 9, wherein the packet convolution module further comprises:
the first nonlinear activation unit is used for carrying out nonlinear operation on the convolution result after the dimension is increased by adopting a nonlinear activation function;
a second nonlinear activation unit for performing a nonlinear operation by a nonlinear activation function after feature extraction by depth separable convolution;
and the linear activation unit is used for performing activation processing through a linear activation function after dimension reduction is performed on the characteristic information through 1 × 1 point convolution.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911155114.2A CN110909874A (en) | 2019-11-22 | 2019-11-22 | Convolution operation optimization method and device of neural network model |
CN202310411791.6A CN116416561A (en) | 2019-11-22 | 2019-11-22 | Video image processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911155114.2A CN110909874A (en) | 2019-11-22 | 2019-11-22 | Convolution operation optimization method and device of neural network model |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310411791.6A Division CN116416561A (en) | 2019-11-22 | 2019-11-22 | Video image processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110909874A true CN110909874A (en) | 2020-03-24 |
Family
ID=69818785
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310411791.6A Pending CN116416561A (en) | 2019-11-22 | 2019-11-22 | Video image processing method and device |
CN201911155114.2A Pending CN110909874A (en) | 2019-11-22 | 2019-11-22 | Convolution operation optimization method and device of neural network model |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310411791.6A Pending CN116416561A (en) | 2019-11-22 | 2019-11-22 | Video image processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN116416561A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111445012A (en) * | 2020-04-28 | 2020-07-24 | 南京大学 | FPGA-based packet convolution hardware accelerator and method thereof |
CN111445019A (en) * | 2020-04-30 | 2020-07-24 | 南京大学 | Device and method for realizing channel shuffling operation in packet convolution |
CN111738424A (en) * | 2020-06-29 | 2020-10-02 | 湖南国科微电子股份有限公司 | Neural network processing method, neural network processing device, electronic equipment and storage medium |
CN112288028A (en) * | 2020-11-06 | 2021-01-29 | 神思电子技术股份有限公司 | Image identification method based on stream convolution |
CN112363844A (en) * | 2021-01-12 | 2021-02-12 | 之江实验室 | Convolutional neural network vertical segmentation method for image processing |
CN112418401A (en) * | 2020-11-20 | 2021-02-26 | 中山大学 | Local distributed image identification method for terminal application |
CN113313056A (en) * | 2021-06-16 | 2021-08-27 | 中国科学技术大学 | Compact 3D convolution-based lip language identification method, system, device and storage medium |
-
2019
- 2019-11-22 CN CN202310411791.6A patent/CN116416561A/en active Pending
- 2019-11-22 CN CN201911155114.2A patent/CN110909874A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111445012A (en) * | 2020-04-28 | 2020-07-24 | 南京大学 | FPGA-based packet convolution hardware accelerator and method thereof |
CN111445012B (en) * | 2020-04-28 | 2023-04-18 | 南京大学 | FPGA-based packet convolution hardware accelerator and method thereof |
CN111445019A (en) * | 2020-04-30 | 2020-07-24 | 南京大学 | Device and method for realizing channel shuffling operation in packet convolution |
CN111738424A (en) * | 2020-06-29 | 2020-10-02 | 湖南国科微电子股份有限公司 | Neural network processing method, neural network processing device, electronic equipment and storage medium |
CN111738424B (en) * | 2020-06-29 | 2023-12-26 | 湖南国科微电子股份有限公司 | Neural network processing method and device, electronic equipment and storage medium |
CN112288028A (en) * | 2020-11-06 | 2021-01-29 | 神思电子技术股份有限公司 | Image identification method based on stream convolution |
CN112418401A (en) * | 2020-11-20 | 2021-02-26 | 中山大学 | Local distributed image identification method for terminal application |
CN112363844A (en) * | 2021-01-12 | 2021-02-12 | 之江实验室 | Convolutional neural network vertical segmentation method for image processing |
CN113313056A (en) * | 2021-06-16 | 2021-08-27 | 中国科学技术大学 | Compact 3D convolution-based lip language identification method, system, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN116416561A (en) | 2023-07-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110909874A (en) | Convolution operation optimization method and device of neural network model | |
CN107944556B (en) | Deep neural network compression method based on block item tensor decomposition | |
CN109934331B (en) | Apparatus and method for performing artificial neural network forward operations | |
CN107340993B (en) | Arithmetic device and method | |
CN109886391B (en) | Neural network compression method based on space forward and backward diagonal convolution | |
CN110781912A (en) | Image classification method based on channel expansion inverse convolution neural network | |
CN111210432A (en) | Image semantic segmentation method based on multi-scale and multi-level attention mechanism | |
CN111882053B (en) | Neural network model compression method based on splicing convolution | |
CN113111889A (en) | Target detection network processing method for edge computing terminal | |
CN110728354B (en) | Image processing method based on improved sliding type grouping convolution neural network | |
CN110782001B (en) | Improved method for using shared convolution kernel based on group convolution neural network | |
TWI738048B (en) | Arithmetic framework system and method for operating floating-to-fixed arithmetic framework | |
CN114882278A (en) | Tire pattern classification method and device based on attention mechanism and transfer learning | |
Fuketa et al. | Image-classifier deep convolutional neural network training by 9-bit dedicated hardware to realize validation accuracy and energy efficiency superior to the half precision floating point format | |
Qi et al. | Learning low resource consumption cnn through pruning and quantization | |
US20230143985A1 (en) | Data feature extraction method and related apparatus | |
CN116434039A (en) | Target detection method based on multiscale split attention mechanism | |
CN112418388A (en) | Method and device for realizing deep convolutional neural network processing | |
Nguyen et al. | Development of an object recognition algorithm based on neural networks with using a hierarchical classifier | |
CN115375922A (en) | Lightweight significance detection method based on multi-scale space attention | |
CN114492631A (en) | Spatial attention calculation method based on channel attention | |
CN113554104A (en) | Image classification method based on deep learning model | |
CN113313253A (en) | Neural network compression method, data processing device and computer equipment | |
CN110807479A (en) | Neural network convolution calculation acceleration method based on Kmeans algorithm | |
CN110378466A (en) | Quantization method and system based on neural network difference |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200324 |
|
RJ01 | Rejection of invention patent application after publication |