CN110909874A - Convolution operation optimization method and device of neural network model - Google Patents

Convolution operation optimization method and device of neural network model Download PDF

Info

Publication number
CN110909874A
CN110909874A CN201911155114.2A CN201911155114A CN110909874A CN 110909874 A CN110909874 A CN 110909874A CN 201911155114 A CN201911155114 A CN 201911155114A CN 110909874 A CN110909874 A CN 110909874A
Authority
CN
China
Prior art keywords
convolution
sub
feature map
input feature
channels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911155114.2A
Other languages
Chinese (zh)
Inventor
杜渂
邱祥平
陈春东
雷霆
彭明喜
周赵云
陈健
王聚全
杨博
刘冉东
王月
王孟轩
张胜
韩国令
和传志
曹若麟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Di'aisi Information Technology Ltd By Share Ltd
Original Assignee
Di'aisi Information Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Di'aisi Information Technology Ltd By Share Ltd filed Critical Di'aisi Information Technology Ltd By Share Ltd
Priority to CN201911155114.2A priority Critical patent/CN110909874A/en
Priority to CN202310411791.6A priority patent/CN116416561A/en
Publication of CN110909874A publication Critical patent/CN110909874A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a convolution operation optimization method and a convolution operation optimization device of a neural network model, wherein the method comprises the following steps: segmenting the input feature map in the channel dimension to obtain sub-input feature maps of each group; respectively carrying out different group convolution operations on each sub-input feature map, and extracting feature information of channels contained in each group to obtain different sub-output feature maps; and shuffling and combining the sub-output characteristic graphs of all the groups to obtain an output characteristic graph. The invention reduces the operation time consumption of the network model on the premise of ensuring the model effect.

Description

Convolution operation optimization method and device of neural network model
Technical Field
The invention relates to the technical field of neural network models, in particular to a convolution operation optimization method and device of a neural network model.
Background
In recent years, with the vigorous development of deep neural networks, academic circles and industries have seen major breakthroughs in deep learning in many fields, but the size and the calculation amount of the neural network model become bottlenecks in practical application, so that the neural network model is difficult to be applied to some scenes with high real-time requirements, such as typical application scenes of online video quality detection.
Reducing the amount of computation in the neural network can effectively reduce the operation time consumption of the neural network. However, reducing the amount of computation in the neural network may result in a reduction in the expressive power of the model, thereby affecting the actual effectiveness of the model.
Disclosure of Invention
The invention aims to provide a convolution operation optimization method and a convolution operation optimization device of a neural network model, which reduce the operation time consumption of the network model on the premise of ensuring the model effect.
The technical scheme provided by the invention is as follows:
a convolution operation optimization method of a neural network model is applied to convolution operation for extracting characteristic information from an input characteristic diagram and obtaining an output characteristic diagram, and comprises the following steps: segmenting the input feature map in the channel dimension to obtain sub-input feature maps of each group; respectively carrying out different group convolution operations on each sub-input feature map, and extracting feature information of channels contained in each group to obtain different sub-output feature maps; and shuffling and combining the sub-output characteristic graphs of all the groups to obtain the output characteristic graph.
Further, the segmenting the input feature map in the channel dimension to obtain sub-input feature maps of each group includes: and segmenting the input feature diagram according to the number of the channels of the input feature diagram to obtain a sub-input feature diagram of each channel.
Further, the group convolution operation includes: processing information of each channel of the sub-input feature map respectively by adopting a plurality of depth separable convolutions, wherein the depth separable convolutions are convolution kernels with channels being 1, and the number of the adopted depth separable convolutions is equal to the number of the channels of the sub-input feature map; and combining the features of the different channels extracted by the depth separable convolution through 1 x 1 point convolution to obtain a sub-output feature map.
Further, the group convolution operation includes: performing dimension increasing on the sub-input feature map through 1 × 1 point convolution to obtain a sub-input feature map with increased channel number; performing feature extraction on the sub-input feature map with the increased number of channels through depth separable convolution to obtain feature information; and performing dimensionality reduction on the feature information through 1 × 1 point convolution to obtain a sub-output feature map, wherein the number of channels of the sub-output feature map is equal to that of channels of the sub-input feature map.
Further, the performing dimensionality raising on the sub-input feature map by 1 × 1 point convolution includes: performing dimension increasing on the sub-input feature map through 1 × 1 point convolution, and performing nonlinear operation on a convolution result after dimension increasing by adopting a nonlinear activation function; the feature extraction by depth separable convolution comprises: firstly, carrying out feature extraction through depth separable convolution, and then carrying out nonlinear operation through the nonlinear activation function; the dimension reduction of the characteristic information through 1 × 1 point convolution includes: and reducing the dimension of the characteristic information through 1 multiplied by 1 point convolution, and performing activation processing through a linear activation function.
Further, the nonlinear activation function is a ReLU6 activation function.
The invention also provides a convolution operation optimization device of the neural network model, which is applied to convolution operation for extracting characteristic information from an input characteristic diagram and obtaining an output characteristic diagram, and comprises the following steps: the channel decomposition module is used for segmenting the input feature map in channel dimensions to obtain sub-input feature maps of each group; the grouping convolution module is used for respectively carrying out different grouping convolution operations on each sub-input feature map, extracting feature information of channels contained in each group and obtaining different sub-output feature maps; and the characteristic shuffling module is used for shuffling and combining the sub-output characteristic graphs of all the groups to obtain the output characteristic graph.
Further, the packet convolution module includes: a single-channel feature extraction unit, configured to separately process information of each channel of the sub-input feature map by using a plurality of depth-separable convolutions, where the depth-separable convolutions are convolution kernels with a channel of 1, and the number of the depth-separable convolutions used is equal to the number of channels of the sub-input feature map; and the feature merging unit is used for merging the features of the different channels extracted by the depth separable convolution through 1 x 1 point convolution to obtain a sub-output feature map.
Further, the packet convolution module includes: the dimension increasing unit is used for increasing the dimension of the sub-input feature map through 1 multiplied by 1 point convolution to obtain the sub-input feature map with increased channel number; the characteristic extraction unit is used for extracting the characteristics of the sub-input characteristic graphs with the increased channel number through depth separable convolution to obtain characteristic information; and the dimension reduction unit is used for performing dimension reduction on the feature information through 1 multiplied by 1 point convolution to obtain a sub-output feature map, wherein the number of channels of the sub-output feature map is equal to that of the channels of the sub-input feature map.
Further, the packet convolution module further includes: the first nonlinear activation unit is used for carrying out nonlinear operation on the convolution result after the dimension is increased by adopting a nonlinear activation function; a second nonlinear activation unit for performing a nonlinear operation by the nonlinear activation function after feature extraction by depth separable convolution; and the linear activation unit is used for performing activation processing through a linear activation function after dimension reduction is performed on the characteristic information through 1 × 1 point convolution.
The convolution operation optimization method and device of the neural network model provided by the invention can bring the following beneficial effects:
1. the invention optimizes the calculated amount and parameters of the convolution operation by carrying out the channel decomposition of the characteristic diagram and the channel decomposition of the convolution kernel in the convolution operation, thereby realizing the effect of the convolution operation optimization of the neural network model.
2. The invention optimizes the convolution operation of CNN in the network on the premise of ensuring the model effect, thereby reducing the calculated amount and the operation time consumption of the network model and meeting the real-time requirement under some classical application scenes.
Drawings
The above features, technical features, advantages and implementations of a method and apparatus for optimizing convolution operations of a neural network model will be further described in the following detailed description of preferred embodiments with reference to the accompanying drawings.
FIG. 1 is a flow diagram of one embodiment of a method for convolutional optimization of a neural network model of the present invention;
FIG. 2 is a flow diagram of another embodiment of a method for convolutional optimization of a neural network model of the present invention;
FIG. 3 is a flow diagram of another embodiment of a method for convolutional optimization of a neural network model of the present invention;
FIG. 4 is a schematic structural diagram of an embodiment of an apparatus for optimizing convolution operations of a neural network model according to the present invention;
FIG. 5 is a schematic structural diagram of an apparatus for optimizing convolution operations of a neural network model according to another embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for optimizing convolution operations of a neural network model according to another embodiment of the present invention;
fig. 7 is a block diagram of a Group-incorporation structure of fig. 1.
The reference numbers illustrate:
110. the device comprises a channel decomposition module, 120 a grouping convolution module, 130 a feature shuffling module, 121 a single-channel feature extraction unit, 122 a feature merging unit, 123 a dimension ascending unit, 124 a feature extraction unit, 125 a dimension descending unit, 126 a first nonlinear activation unit, 127 a second nonlinear activation unit and 128 a linear activation unit.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".
In an embodiment of the present invention, as shown in fig. 1, a convolution operation optimization method for a neural network model, applied to a convolution operation for extracting feature information from an input feature map to obtain an output feature map, includes:
step S100 is to segment the input feature map in the channel dimension to obtain sub-input feature maps of each group.
Specifically, the input feature map may be an input original image of a certain model (for example, based on a convolutional neural network), or may be input feature information of a certain convolutional layer in the model. The data volume of the characteristic diagram is represented by H (height) W (width) C (number of channels), and the meaning is: data for C channels, each channel having H × W data. Since the present invention relates only to the amount of data of various feature maps and not to specific feature data, the following description is simplified, and the feature maps are denoted by H × W × C.
And splitting the input feature graph in the channel dimension, namely, grouping certain channels into a group. For example, the input feature map is 27 × 16, and the input feature map is grouped into 4 (16/4) groups by 4 channels, and each group includes data of 4 channels, thereby forming a sub-input feature map.
The above process is referred to as channel decomposition of the input feature map.
Step S200 performs different group convolution operations on each sub-input feature map, and extracts feature information of channels included in each group to obtain different sub-output feature maps.
Specifically, the group convolution operation refers to a convolution operation for each group. The input of the group convolution operation is a sub-input feature map, and the output is a sub-output feature map, so as to extract feature information from the sub-input feature map, wherein the feature information forms the sub-output feature map.
The group convolution operation may use convolution kernels having the same size as the conventional convolution operation, but the number of convolution kernels is only 1/group of the number of conventional convolution kernels (the convolution kernels used in the conventional convolution operation, referred to as conventional convolution kernels). The number of the conventional convolution kernels is equal to the number of channels of the output characteristic diagram, and the number of the convolution kernels of the group convolution is equal to the number of the channels of the output characteristic diagram divided by the number of the groups, which is also called as channel decomposition of the convolution kernels.
In order to ensure that the input feature map can be uniformly segmented and the conventional convolution kernel can be uniformly decomposed during channel decomposition, the group number is usually a common divisor of the number of channels of the input feature map and the number of channels of the output feature map.
Computational efficiency is improved since each group convolution only needs to process feature maps of a few channels.
For example, the input feature map is DF×DF×m,DFThe height and width of the input characteristic diagram are shown, and m is the channel number of the input characteristic diagram; the output characteristic diagram is DF×DFX n, n is the channel number of the output characteristic diagram; the convolution kernel size is s × s, and with conventional convolution processing, n convolution kernels of s × s × m are used, and the amount of computation for each convolution kernel to complete the convolution operation of the input feature map is: s × s × m × DF×DFThe calculated number of n convolution kernels is s × s × m × DF×DF×n。
By using block convolution, assuming that the input data is divided into g groups, each sub-input feature map is DF×DFX (m/g), the corresponding sub-output feature map is DF×DF(n/g), each volume office using (n/g) s × s × g convolution kernels, the calculated amount for each volume office being s × s × (m/g) × DF×DFX (n/g). There are g group volume offices, so the total calculated amount is s × s × (m/g) × DF×DFAnd x (n/g) x g, the calculated amount is reduced to 1/g compared with the traditional convolution processing.
In addition, by adopting the block convolution, compared with the traditional convolution, the parameters required for performing the convolution operation are also reduced, and the analysis is as follows:
the parameters required for conventional convolution are m × s × s × n. Adopting grouping convolution, wherein the parameter required by each grouping convolution is (m/g) multiplied by s multiplied by (n/g); a total of g sets of convolutions, so the total parameter is (m/g) × s × s × (n/g) × g, reduced to the original 1/g.
Step S300 shuffles and combines the sub-output feature maps of all the groups to obtain an output feature map.
Specifically, after segmentation according to the channel dimensions, the features of the sub-output feature maps are relatively sparse, if all the grouped sub-output feature maps are connected in a simple cascade mode to serve as the output feature maps, information isolation of the feature maps among different groups can be caused, and the information expression capacity is reduced, so that the feature maps among different groups are mixed in a channel shuffling mode after group convolution, and feature interaction among the groups is realized.
For example, as shown in the Group-inclusion structure shown in fig. 7, taking an online video detection model as an example, an input video image has three channels of RGB, and pixel values of the three channels form an input feature map, that is, the number of channels of the input feature map is 3. Segmenting and grouping the input characteristic graphs according to three channels of RGB to obtain 3 sub-input characteristic graphs which are respectively as follows: a sub-input characteristic diagram 1, a sub-input characteristic diagram 2 and a sub-input characteristic diagram 3. Each sub-input feature map contains only pixel values of one channel.
And performing group convolution on each sub-input feature map respectively to obtain 3 sub-output feature maps, wherein each sub-output feature map comprises features of 3 channels (11 is data of channel 1 of the first sub-output feature map, 12 is data of channel 2 of the first sub-output feature map, 13 is data of channel 3 of the first sub-output feature map, 21 is data of channel 1 of the second sub-output feature map, 31 is data of channel 1 of the third sub-output feature map, and the other meanings are analogized in turn).
And performing channel shuffling and combination on the 3 sub-output characteristic graphs, mixing the characteristics of different channels, realizing the characteristic interaction among the channels and obtaining the output characteristic graph. The data of the first 3 channels of the output feature map are respectively composed of the data of the 1 st channel in the sub-output feature maps 1-3, and the middle 3 channels and the last 3 channels are processed similarly.
In the embodiment, the operation amount and the parameter amount of the convolution operation are reduced and the convolution operation is optimized in a grouping convolution mode, so that the network model is favorably applied to scenes with high real-time performance.
In another embodiment of the present invention, as shown in fig. 2, a convolution operation optimization method for a neural network model includes:
step S100 is to segment the input feature map in the channel dimension to obtain sub-input feature maps of each group.
Step S200 performs different group convolution operations on each sub-input feature map, and extracts feature information of channels included in each group to obtain different sub-output feature maps.
The group convolution operation on each sub-input feature map specifically comprises the following steps:
step S210, respectively processing information of each channel of the sub-input feature map by using a plurality of depth-separable convolutions, where the depth-separable convolutions are convolution kernels with channels of 1, and the number of the depth-separable convolutions is equal to the number of channels of the sub-input feature map;
step S220 combines the features of the different channels extracted by the depth separable convolution through a 1 × 1 point convolution to obtain a sub-output feature map.
Specifically, the depth separable convolution is a convolution kernel with channel 1, each convolution kernel only processes the features of one channel of the sub-input feature map. However, each depth separable convolution only learns the characteristics of one channel, and therefore the characteristics of different channels are combined through 1 × 1 point convolution, so that the characteristics of different channels can be learned.
For example, the sub-input feature map is DF×DFXm 1, requiring m1 s × s × 1 depth separable convolutions to process the features of all channels of the sub-input feature map. The computational load for all depth separable convolutions is: s × s × m1 × DF×DF
If the sub-output feature diagram has n1 channels, n1 point convolutions of 1 × 1 × m1 are needed, and the calculation amount of all the point convolutions is m1 × DF×DFX n 1. The calculation amount of this combination is s × s × m1 × DF×DF+m1×DF×DFX n 1. The calculated amount (s × s × m1 × D) relative to the conventional convolution methodF×DFXn 1) for this combination, the computational cost is reduced to:
Figure BDA0002284602320000081
furthermore, the depth separable convolution also enables compression of the model through a reduction in the number of parameters. Continuing with the previous example, if conventional convolution is used, the required parameters are m1 × s × s × n 1; and the number of parameters for the depth separable convolution is s × s × m1+ n1 × m1, the compression ratio of the parameters is:
Figure BDA0002284602320000082
step S300 shuffles and combines the sub-output feature maps of all the groups to obtain an output feature map.
In this embodiment, on the basis of the packet convolution, the packet convolution operation is implemented by using the depth separable convolution and the 1 × 1 dot convolution, so that the amount of operation and the amount of parameters of the convolution operation are further reduced, and the convolution operation is optimized.
In another embodiment of the present invention, as shown in fig. 3, a convolution operation optimization method for a neural network model includes:
step S110, the input feature diagram is segmented according to the number of the channels of the input feature diagram, and a sub-input feature diagram of each channel is obtained.
Specifically, taking the online video detection model as an example, the input video image has three channels of RGB, the pixel values of each channel form a two-dimensional matrix, and the pixel values of the three channels form an input feature map, in other words, the number of channels of the input feature map is 3. And segmenting and grouping the input feature map according to three channels of RGB to obtain 3 sub-input feature maps, wherein each sub-input feature map only comprises a pixel value of one channel.
Step S200 performs different group convolution operations on each sub-input feature map, and extracts feature information of channels included in each group to obtain different sub-output feature maps.
The group convolution operation on each sub-input feature map specifically comprises the following steps:
step S230 performs dimension increase on the sub-input feature map by 1 × 1 point convolution, and performs nonlinear operation on the dimension-increased convolution result by using a nonlinear activation function to obtain a sub-input feature map with an increased number of channels.
Step S240, firstly, performing feature extraction on the sub-input feature map with the increased number of channels through depth separable convolution, and then performing nonlinear operation through the nonlinear activation function to obtain feature information;
step S250, performing dimension reduction on the feature information through 1 × 1 point convolution, and performing activation processing through a linear activation function to obtain a sub-output feature map, where the number of channels of the sub-output feature map is equal to the number of channels of the sub-input feature map.
Specifically, the sub-input feature map only contains the pixel value of one channel, and the number of channels is already small, so the dimension is increased by performing 1 × 1 point-by-point convolution to increase the number of channels of the sub-input feature map. After the point convolution is finished, the non-linear expression capacity is increased through the non-linear activation function processing.
And then 3 x 3 deep separable convolution is used to extract features, and the amount of computation in the network is greatly reduced. And after the depth separable convolution is finished, the nonlinear activation function processing is carried out, so that the nonlinear expression capability is increased.
Optionally, the nonlinear activation function is a ReLU6 activation function. The ReLU6 is a special ReLU function that limits the maximum output to 6, suitable for handling high-dimensional feature inputs.
Finally, dimension reduction is carried out by using 1 × 1 point-by-point convolution, on one hand, the number of channels of the feature graph is recovered, on the other hand, a certain channel shuffling effect is also achieved, the feature graphs extracted by the depth separable convolution are mixed according to the channels, and the information expression capacity is improved. The dimensionality of the features after the last 1 x 1 point convolution is already minimal, and if the ReLU6 is reused, the features will be destroyed, resulting in a large loss of information. A linear activation function is therefore used after the final 1 x 1 point convolution.
Step S300 shuffles and combines the sub-output feature maps of all the groups to obtain an output feature map.
The method of the embodiment is applied to a damaged video detection model, iterative training is carried out on the model, and the VFD-SmartNet network is obtained after convergence. Comparing it with AlexNet, VGG16, ResNet18, ResNet34, ResNet-like, DenseNet-like models, the performance is compared as follows:
Figure BDA0002284602320000101
the data in the table show that the operation speed is related to the parameter quantity, the parameter quantity of the VFD-SmartNet model is obviously reduced, the operation speed is obviously improved, the aim of network acceleration is well achieved, the recall ratio and the precision ratio are also maintained at higher levels, and the model can improve the operation speed on the premise of keeping the accuracy of the model so as to guarantee the requirement of real-time performance.
In another embodiment of the present invention, as shown in fig. 4, an apparatus for optimizing convolution operations of a neural network model includes:
and the channel decomposition module 110 is configured to segment the input feature map in a channel dimension to obtain sub-input feature maps of each group.
Specifically, the input feature map may be an input original image of a certain model (for example, based on a convolutional neural network), or may be input feature information of a certain convolutional layer in the model. The data volume of the characteristic diagram is represented by H (height) W (width) C (number of channels), and the meaning is: data for C channels, each channel having H × W data.
And splitting the input feature graph in the channel dimension, namely, grouping certain channels into a group. For example, the input feature map is 27 × 16, and the input feature map is grouped into 4 (16/4) groups by 4 channels, and each group includes data of 4 channels, thereby forming a sub-input feature map.
The above process is referred to as channel decomposition of the input feature map.
And the grouping convolution module 120 is configured to perform different grouping convolution operations on each sub-input feature map, and extract feature information of channels included in each group to obtain different sub-output feature maps.
Specifically, the group convolution operation refers to a convolution operation for each group. The input of the group convolution operation is a sub-input feature map, and the output is a sub-output feature map, so as to extract feature information from the sub-input feature map, wherein the feature information forms the sub-output feature map.
The group convolution operation may use convolution kernels having the same size as the conventional convolution operation, but the number of convolution kernels is only 1/group of the number of conventional convolution kernels (the convolution kernels used in the conventional convolution operation, referred to as conventional convolution kernels). The number of the conventional convolution kernels is equal to the number of channels of the output characteristic diagram, and the number of the convolution kernels of the group convolution is equal to the number of the channels of the output characteristic diagram divided by the number of the groups, which is also called as channel decomposition of the convolution kernels.
In order to ensure that the input feature map can be uniformly segmented and the conventional convolution kernel can be uniformly decomposed during channel decomposition, the group number is usually a common divisor of the number of channels of the input feature map and the number of channels of the output feature map.
Computational efficiency is improved since each group convolution only needs to process feature maps of a few channels.
For example, the input feature map is DF×DF×m,DFThe height and width of the input characteristic diagram are shown, and m is the channel number of the input characteristic diagram; the output characteristic diagram is DF×DFX n, n is the channel number of the output characteristic diagram; the convolution kernel size is s × s, and with conventional convolution processing, n convolution kernels of s × s × m are used, and the amount of computation for each convolution kernel to complete the convolution operation of the input feature map is: s × s × m × DF×DFThe calculated number of n convolution kernels is s × s × m × DF×DF×n。
By using block convolution, assuming that the input data is divided into g groups, each sub-input feature map is DF×DFX (m/g), the corresponding sub-output feature map is DF×DF(n/g), each volume office using (n/g) s × s × g convolution kernels, the calculated amount for each volume office being s × s × (m/g) × DF×DFX (n/g). There are g group volume offices, so the total calculated amount is s × s × (m/g) × DF×DFAnd x (n/g) x g, the calculated amount is reduced to 1/g compared with the traditional convolution processing.
In addition, by adopting the block convolution, compared with the traditional convolution, the parameters required for performing the convolution operation are also reduced, and the analysis is as follows:
the parameters required for conventional convolution are m × s × s × n. Adopting grouping convolution, wherein the parameter required by each grouping convolution is (m/g) multiplied by s multiplied by (n/g); a total of g sets of convolutions, so the total parameter is (m/g) × s × s × (n/g) × g, reduced to the original 1/g.
And the feature shuffling module 130 is used for shuffling and combining the sub-output feature maps of all the groups to obtain the output feature map.
Specifically, after segmentation according to the channel dimensions, the features of the sub-output feature maps are relatively sparse, if all the grouped sub-output feature maps are connected in a simple cascade mode to serve as the output feature maps, information isolation of the feature maps among different groups can be caused, and the information expression capacity is reduced, so that the feature maps among different groups are mixed in a channel shuffling mode after group convolution, and feature interaction among the groups is realized.
For example, as shown in the Group-inclusion structure shown in fig. 7, taking an online video detection model as an example, an input video image has three channels of RGB, and pixel values of the three channels form an input feature map, that is, the number of channels of the input feature map is 3. Segmenting and grouping the input characteristic graphs according to three channels of RGB to obtain 3 sub-input characteristic graphs which are respectively as follows: a sub-input characteristic diagram 1, a sub-input characteristic diagram 2 and a sub-input characteristic diagram 3. Each sub-input feature map contains only pixel values of one channel.
And performing group convolution on each sub-input feature map respectively to obtain 3 sub-output feature maps, wherein each sub-output feature map comprises features of 3 channels (11 is data of channel 1 of the first sub-output feature map, 12 is data of channel 2 of the first sub-output feature map, 13 is data of channel 3 of the first sub-output feature map, 21 is data of channel 1 of the second sub-output feature map, 31 is data of channel 1 of the third sub-output feature map, and the other meanings are analogized in turn).
And performing channel shuffling and combination on the 3 sub-output characteristic graphs, mixing the characteristics of different channels, realizing the characteristic interaction among the channels and obtaining the output characteristic graph. The data of the first 3 channels of the output feature map are respectively composed of the data of the 1 st channel in the sub-output feature maps 1-3, and the middle 3 channels and the last 3 channels are processed similarly.
In the embodiment, the operation amount and the parameter amount of the convolution operation are reduced and the convolution operation is optimized in a grouping convolution mode, so that the network model is favorably applied to scenes with high real-time performance.
In another embodiment of the present invention, as shown in fig. 5, an apparatus for optimizing convolution operations of a neural network model includes:
and the channel decomposition module 110 is configured to segment the input feature map in a channel dimension to obtain sub-input feature maps of each group.
And the grouping convolution module 120 is configured to perform different grouping convolution operations on each sub-input feature map, and extract feature information of channels included in each group to obtain different sub-output feature maps.
Wherein, the packet convolution module 120 includes:
a single-channel feature extraction unit 121, configured to separately process information of each channel of the sub-input feature map by using a plurality of depth-separable convolutions, where the depth-separable convolutions are convolution kernels with a channel 1, and the number of depth-separable convolutions used is equal to the number of channels of the sub-input feature map;
and the feature merging unit 122 is configured to merge the features of the different channels extracted by the depth separable convolution through 1 × 1 point convolution to obtain a sub-output feature map.
Specifically, the depth separable convolution is a convolution kernel with channel 1, each convolution kernel only processes the features of one channel of the sub-input feature map. However, each depth separable convolution only learns the characteristics of one channel, and therefore the characteristics of different channels are combined through 1 × 1 point convolution, so that the characteristics of different channels can be learned.
For example, the sub-input feature map is DF×DFXm 1, requiring m1 s × s × 1 depth separable convolutions to process the features of all channels of the sub-input feature map. The computational load for all depth separable convolutions is: s × s × m1 × DF×DF
If the sub-output feature diagram has n1 channels, n1 point convolutions of 1 × 1 × m1 are needed, and the calculation amount of all the point convolutions is m1 × DF×DFX n 1. The calculation amount of this combination is s × s × m1 × DF×DF+m1×DF×DFX n 1. Relative to each otherCalculated amount of the conventional convolution method (s × s × m1 × DF×DFXn 1) for this combination, the computational cost is reduced to:
Figure BDA0002284602320000141
furthermore, the depth separable convolution also enables compression of the model through a reduction in the number of parameters. Continuing with the previous example, if conventional convolution is used, the required parameters are m1 × s × s × n 1; and the number of parameters for the depth separable convolution is s × s × m1+ n1 × m1, the compression ratio of the parameters is:
Figure BDA0002284602320000142
and the feature shuffling module 130 is used for shuffling and combining the sub-output feature maps of all the groups to obtain the output feature map.
In this embodiment, on the basis of the packet convolution, the packet convolution operation is implemented by using the depth separable convolution and the 1 × 1 dot convolution, so that the amount of operation and the amount of parameters of the convolution operation are further reduced, and the convolution operation is optimized.
In another embodiment of the present invention, as shown in fig. 4 and 6, an apparatus for optimizing convolution operations of a neural network model includes:
and the channel decomposition module 110 is configured to split the input feature map according to the number of channels of the input feature map, so as to obtain a sub-input feature map of each channel.
Specifically, taking the online video detection model as an example, the input video image has three channels of RGB, the pixel values of each channel form a two-dimensional matrix, and the pixel values of the three channels form an input feature map, in other words, the number of channels of the input feature map is 3. And segmenting and grouping the input feature map according to three channels of RGB to obtain 3 sub-input feature maps, wherein each sub-input feature map only comprises a pixel value of one channel.
And the grouping convolution module 120 is configured to perform different grouping convolution operations on each sub-input feature map, and extract feature information of channels included in each group to obtain different sub-output feature maps.
Wherein, the packet convolution module 120 includes:
a dimension raising unit 123, configured to raise the dimension of the sub-input feature map by 1 × 1 dot convolution;
the first nonlinear activation unit 126 is configured to perform nonlinear operation on the convolution result after the dimension increase by using a nonlinear activation function to obtain a sub-input feature map with an increased channel number;
a feature extraction unit 124, configured to perform feature extraction on the sub-input feature map with the increased number of channels through depth separable convolution;
a second nonlinear activation unit 127, configured to perform a nonlinear operation through the nonlinear activation function after performing feature extraction through depth separable convolution, so as to obtain feature information;
a dimension reduction unit 125, configured to perform dimension reduction on the feature information through a 1 × 1 point convolution;
a linear activation unit 128, configured to perform dimension reduction on the feature information through 1 × 1 point convolution, and then perform activation processing through a linear activation function to obtain a sub-output feature map, where the number of channels of the sub-output feature map is equal to the number of channels of the sub-input feature map.
Specifically, the sub-input feature map only contains the pixel value of one channel, and the number of channels is already small, so the dimension is increased by performing 1 × 1 point-by-point convolution to increase the number of channels of the sub-input feature map. After the point convolution is finished, the non-linear expression capacity is increased through the non-linear activation function processing.
And then 3 x 3 deep separable convolution is used to extract features, and the amount of computation in the network is greatly reduced. And after the depth separable convolution is finished, the nonlinear activation function processing is carried out, so that the nonlinear expression capability is increased.
Optionally, the nonlinear activation function is a ReLU6 activation function. The ReLU6 is a special ReLU function that limits the maximum output to 6, suitable for handling high-dimensional feature inputs.
Finally, dimension reduction is carried out by using 1 × 1 point-by-point convolution, on one hand, the number of channels of the feature graph is recovered, on the other hand, a certain channel shuffling effect is also achieved, the feature graphs extracted by the depth separable convolution are mixed according to the channels, and the information expression capacity is improved. The dimensionality of the features after the last 1 x 1 point convolution is already minimal, and if the ReLU6 is reused, the features will be destroyed, resulting in a large loss of information. A linear activation function is therefore used after the final 1 x 1 point convolution.
A feature shuffling module 300 for shuffling and combining the sub-output feature maps of all the groups to obtain the output feature maps
The method of the embodiment is applied to a damaged video detection model, iterative training is carried out on the model, and the VFD-SmartNet network is obtained after convergence. Comparing it with AlexNet, VGG16, ResNet18, ResNet34, ResNet-like, DenseNet-like models, the performance is compared as follows:
Figure BDA0002284602320000161
as can be seen from the data in the table, the operation speed of the VFD-SmartNet model is obviously improved, the goal of network acceleration is well achieved, the recall ratio and the precision ratio are also maintained at higher levels, and the model can improve the operation speed on the premise of keeping the accuracy of the model so as to ensure the requirement of real-time performance.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A convolution operation optimization method of a neural network model is applied to convolution operation for extracting characteristic information from an input characteristic diagram and obtaining an output characteristic diagram, and is characterized by comprising the following steps:
segmenting the input feature map in the channel dimension to obtain sub-input feature maps of each group;
respectively carrying out different group convolution operations on each sub-input feature map, and extracting feature information of channels contained in each group to obtain different sub-output feature maps;
and shuffling and combining the sub-output characteristic graphs of all the groups to obtain the output characteristic graph.
2. The convolution optimization method of claim 1, wherein the slicing the input feature map in channel dimensions to obtain sub-input feature maps for each group comprises:
and segmenting the input feature diagram according to the number of the channels of the input feature diagram to obtain a sub-input feature diagram of each channel.
3. The convolution optimization method of claim 1, wherein the set of convolution operations comprises:
processing information of each channel of the sub-input feature map respectively by adopting a plurality of depth separable convolutions, wherein the depth separable convolutions are convolution kernels with channels being 1, and the number of the adopted depth separable convolutions is equal to the number of the channels of the sub-input feature map;
and combining the features of the different channels extracted by the depth separable convolution through 1 x 1 point convolution to obtain a sub-output feature map.
4. The convolution optimization method of claim 1, wherein the set of convolution operations comprises:
performing dimension increasing on the sub-input feature map through 1 × 1 point convolution to obtain a sub-input feature map with increased channel number;
performing feature extraction on the sub-input feature map with the increased number of channels through depth separable convolution to obtain feature information;
and performing dimensionality reduction on the feature information through 1 × 1 point convolution to obtain a sub-output feature map, wherein the number of channels of the sub-output feature map is equal to that of channels of the sub-input feature map.
5. The convolution optimization method of claim 4, wherein:
the step of performing dimensionality raising on the sub-input feature map through 1 × 1 point convolution comprises the following steps:
performing dimension increasing on the sub-input feature map through 1 × 1 point convolution, and performing nonlinear operation on a convolution result after dimension increasing by adopting a nonlinear activation function;
the feature extraction by depth separable convolution comprises:
firstly, carrying out feature extraction through depth separable convolution, and then carrying out nonlinear operation through the nonlinear activation function;
the dimension reduction of the characteristic information through 1 × 1 point convolution includes:
and reducing the dimension of the characteristic information through 1 multiplied by 1 point convolution, and performing activation processing through a linear activation function.
6. The convolution optimization method of claim 5, wherein:
the nonlinear activation function is a ReLU6 activation function.
7. A convolution operation optimization device of a neural network model is applied to convolution operation for extracting characteristic information from an input characteristic diagram and obtaining an output characteristic diagram, and is characterized by comprising the following steps:
the channel decomposition module is used for segmenting the input feature map in channel dimensions to obtain sub-input feature maps of each group;
the grouping convolution module is used for respectively carrying out different grouping convolution operations on each sub-input feature map, extracting feature information of channels contained in each group and obtaining different sub-output feature maps;
and the characteristic shuffling module is used for shuffling and combining the sub-output characteristic graphs of all the groups to obtain the output characteristic graph.
8. The convolution optimization device of claim 7, wherein the packet convolution module comprises:
a single-channel feature extraction unit, configured to separately process information of each channel of the sub-input feature map by using a plurality of depth-separable convolutions, where the depth-separable convolutions are convolution kernels with a channel of 1, and the number of the depth-separable convolutions used is equal to the number of channels of the sub-input feature map;
and the feature merging unit is used for merging the features of the different channels extracted by the depth separable convolution through 1 x 1 point convolution to obtain a sub-output feature map.
9. The convolution optimization device of claim 7, wherein the packet convolution module comprises:
the dimension increasing unit is used for increasing the dimension of the sub-input feature map through 1 multiplied by 1 point convolution to obtain the sub-input feature map with increased channel number;
the characteristic extraction unit is used for extracting the characteristics of the sub-input characteristic graphs with the increased channel number through depth separable convolution to obtain characteristic information;
and the dimension reduction unit is used for performing dimension reduction on the feature information through 1 multiplied by 1 point convolution to obtain a sub-output feature map, wherein the number of channels of the sub-output feature map is equal to that of the channels of the sub-input feature map.
10. The convolution optimization device of claim 9, wherein the packet convolution module further comprises:
the first nonlinear activation unit is used for carrying out nonlinear operation on the convolution result after the dimension is increased by adopting a nonlinear activation function;
a second nonlinear activation unit for performing a nonlinear operation by a nonlinear activation function after feature extraction by depth separable convolution;
and the linear activation unit is used for performing activation processing through a linear activation function after dimension reduction is performed on the characteristic information through 1 × 1 point convolution.
CN201911155114.2A 2019-11-22 2019-11-22 Convolution operation optimization method and device of neural network model Pending CN110909874A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911155114.2A CN110909874A (en) 2019-11-22 2019-11-22 Convolution operation optimization method and device of neural network model
CN202310411791.6A CN116416561A (en) 2019-11-22 2019-11-22 Video image processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911155114.2A CN110909874A (en) 2019-11-22 2019-11-22 Convolution operation optimization method and device of neural network model

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202310411791.6A Division CN116416561A (en) 2019-11-22 2019-11-22 Video image processing method and device

Publications (1)

Publication Number Publication Date
CN110909874A true CN110909874A (en) 2020-03-24

Family

ID=69818785

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202310411791.6A Pending CN116416561A (en) 2019-11-22 2019-11-22 Video image processing method and device
CN201911155114.2A Pending CN110909874A (en) 2019-11-22 2019-11-22 Convolution operation optimization method and device of neural network model

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202310411791.6A Pending CN116416561A (en) 2019-11-22 2019-11-22 Video image processing method and device

Country Status (1)

Country Link
CN (2) CN116416561A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445012A (en) * 2020-04-28 2020-07-24 南京大学 FPGA-based packet convolution hardware accelerator and method thereof
CN111445019A (en) * 2020-04-30 2020-07-24 南京大学 Device and method for realizing channel shuffling operation in packet convolution
CN111738424A (en) * 2020-06-29 2020-10-02 湖南国科微电子股份有限公司 Neural network processing method, neural network processing device, electronic equipment and storage medium
CN112288028A (en) * 2020-11-06 2021-01-29 神思电子技术股份有限公司 Image identification method based on stream convolution
CN112363844A (en) * 2021-01-12 2021-02-12 之江实验室 Convolutional neural network vertical segmentation method for image processing
CN112418401A (en) * 2020-11-20 2021-02-26 中山大学 Local distributed image identification method for terminal application
CN113313056A (en) * 2021-06-16 2021-08-27 中国科学技术大学 Compact 3D convolution-based lip language identification method, system, device and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445012A (en) * 2020-04-28 2020-07-24 南京大学 FPGA-based packet convolution hardware accelerator and method thereof
CN111445012B (en) * 2020-04-28 2023-04-18 南京大学 FPGA-based packet convolution hardware accelerator and method thereof
CN111445019A (en) * 2020-04-30 2020-07-24 南京大学 Device and method for realizing channel shuffling operation in packet convolution
CN111738424A (en) * 2020-06-29 2020-10-02 湖南国科微电子股份有限公司 Neural network processing method, neural network processing device, electronic equipment and storage medium
CN111738424B (en) * 2020-06-29 2023-12-26 湖南国科微电子股份有限公司 Neural network processing method and device, electronic equipment and storage medium
CN112288028A (en) * 2020-11-06 2021-01-29 神思电子技术股份有限公司 Image identification method based on stream convolution
CN112418401A (en) * 2020-11-20 2021-02-26 中山大学 Local distributed image identification method for terminal application
CN112363844A (en) * 2021-01-12 2021-02-12 之江实验室 Convolutional neural network vertical segmentation method for image processing
CN113313056A (en) * 2021-06-16 2021-08-27 中国科学技术大学 Compact 3D convolution-based lip language identification method, system, device and storage medium

Also Published As

Publication number Publication date
CN116416561A (en) 2023-07-11

Similar Documents

Publication Publication Date Title
CN110909874A (en) Convolution operation optimization method and device of neural network model
CN107944556B (en) Deep neural network compression method based on block item tensor decomposition
CN109934331B (en) Apparatus and method for performing artificial neural network forward operations
CN107340993B (en) Arithmetic device and method
CN109886391B (en) Neural network compression method based on space forward and backward diagonal convolution
CN110781912A (en) Image classification method based on channel expansion inverse convolution neural network
CN111210432A (en) Image semantic segmentation method based on multi-scale and multi-level attention mechanism
CN111882053B (en) Neural network model compression method based on splicing convolution
CN113111889A (en) Target detection network processing method for edge computing terminal
CN110728354B (en) Image processing method based on improved sliding type grouping convolution neural network
CN110782001B (en) Improved method for using shared convolution kernel based on group convolution neural network
TWI738048B (en) Arithmetic framework system and method for operating floating-to-fixed arithmetic framework
CN114882278A (en) Tire pattern classification method and device based on attention mechanism and transfer learning
Fuketa et al. Image-classifier deep convolutional neural network training by 9-bit dedicated hardware to realize validation accuracy and energy efficiency superior to the half precision floating point format
Qi et al. Learning low resource consumption cnn through pruning and quantization
US20230143985A1 (en) Data feature extraction method and related apparatus
CN116434039A (en) Target detection method based on multiscale split attention mechanism
CN112418388A (en) Method and device for realizing deep convolutional neural network processing
Nguyen et al. Development of an object recognition algorithm based on neural networks with using a hierarchical classifier
CN115375922A (en) Lightweight significance detection method based on multi-scale space attention
CN114492631A (en) Spatial attention calculation method based on channel attention
CN113554104A (en) Image classification method based on deep learning model
CN113313253A (en) Neural network compression method, data processing device and computer equipment
CN110807479A (en) Neural network convolution calculation acceleration method based on Kmeans algorithm
CN110378466A (en) Quantization method and system based on neural network difference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200324

RJ01 Rejection of invention patent application after publication