CN111368850A

CN111368850A - Image feature extraction method, image target detection method, image feature extraction device, image target detection device, convolution device, CNN network device and terminal

Info

Publication number: CN111368850A
Application number: CN201811589348.3A
Authority: CN
Inventors: 刘阳; 罗小伟; 林福辉
Original assignee: Spreadtrum Communications Tianjin Co Ltd
Current assignee: Spreadtrum Communications Tianjin Co Ltd
Priority date: 2018-12-25
Filing date: 2018-12-25
Publication date: 2020-07-03
Anticipated expiration: 2038-12-25
Also published as: CN111368850B

Abstract

A method and a device for extracting image features and detecting an object, a convolution device, a CNN network device and a terminal are provided, wherein the convolution device comprises: the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping; and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping. By the technical scheme provided by the invention, the image convolution calculation complexity can be reduced, the calculation efficiency is improved, and the feature extraction difficulty is favorably reduced.

Description

Image feature extraction method, image target detection method, image feature extraction device, image target detection device, convolution device, CNN network device and terminal

Technical Field

The invention relates to the technical field of target detection, in particular to a method and a device for extracting image features and detecting a target, a convolution device, a CNN network device and a terminal.

Background

The target detection is a core problem in the field of computer vision, and mainly aims to analyze image or video information and judge whether certain objects (such as human faces, pedestrians, automobiles and the like) exist. If so, the specific location of each object is determined. The target detection technology can be widely applied to the fields of security monitoring, automatic driving, man-machine interaction and the like, and is a premise for performing high-order tasks such as behavior analysis, semantic analysis and the like.

There are many target detection methods, and the most influential in the conventional method is a deformation Model (DPM) based on components and a self-lifting cascade Model (AdaBoost cascade Model). The former is mainly applied to the pedestrian detection field, and the latter is mainly applied to the face detection field. However, the detection accuracy and adaptability of the two methods are surpassed by a deep learning method based on a Convolutional Neural Network (CNN). The deep learning method based on the CNN is mainly applied to the field of target detection. Methods for target detection based on CNN can be divided into two categories: one of the methods is a method based on a target candidate window, and a typical representative method is a Faster detection method of a convolutional neural network (fast Regions with CNN, abbreviated as fast R-CNN) based on Regions. The other type is a candidate window independent (propofol Free) detection method, and typical candidate window independent methods include a single multi-window detection (SSD) detection method and a real-time target (yoly Only Look one, YOLO) detection method.

However, the target detection accuracy greatly depends on the feature extraction of the image data. The feature extraction method of image data relies on image convolution to extract salient features. The existing image convolution method has high complexity of extracting image features and long time consumption.

Disclosure of Invention

The invention solves the technical problem of how to optimize a convolution device to reduce the complexity of convolution calculation and improve the calculation efficiency so as to be beneficial to maintaining higher feature extraction precision and simultaneously reduce the complexity of feature extraction.

To solve the above technical problem, an embodiment of the present invention provides an image convolution apparatus, including: the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping; and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping.

Optionally, the channel expansion module includes a first convolution layer sub-module configured to determine (e · Min) as the number of output channels of the first convolution layer sub-module, perform (e · Min) M × M convolution on the feature mapping input to the channel expansion module, where M, Min is a positive integer, e represents a preset expansion coefficient, e >1, and e is a positive integer, and Min represents the number of channels of the feature mapping, a first batch normalization layer sub-module configured to perform batch normalization on an output result of the first convolution layer sub-module, and a first limited linear unit layer sub-module configured to perform limited linear processing on data output by the first batch normalization layer sub-module to obtain the first feature mapping.

Optionally, the depth separation convolution module includes a depth separation convolution layer submodule configured to perform N × N depth separation convolution on the first feature mapping, where N is greater than M and N is a positive integer, a second batch normalization layer submodule configured to perform batch normalization on convolution results obtained by the depth separation convolution layer submodule, and a second limited linear unit layer submodule configured to perform limited linear processing on data obtained by the second batch normalization layer submodule to obtain the second feature mapping.

Optionally, the channel compression module includes a second convolutional layer submodule configured to determine (e · Min) as an input channel number of the second convolutional layer submodule, and perform Mout M × M convolution on the second feature map, where Mout is a positive integer and represents an output channel number of the channel compression module, and a third batch normalization layer submodule configured to perform batch normalization on convolution results output by the second convolutional layer submodule.

Optionally, the channel expansion module includes a first convolution batch processing layer sub-module, configured to determine (e · Min) as the number of output channels of the first convolution batch processing layer sub-module, and perform (e · Min) times of M × M convolution on the feature mapping input to the channel expansion module by using the following formula, and perform batch normalization, where M is a positive integer, e represents a preset expansion coefficient, and e represents a preset expansion coefficient>1, and e, Min are positive integers, Min represents the number of channels of the feature map,

the first limited linear unit layer submodule is used for performing limited linear processing on the output data of the first volume batch processing layer submodule to obtain the first feature mapping; wherein z is output data of the first convolution batch processing layer sub-module, w is a weight parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, b is a bias parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, x is the feature mapping of the image data, m is a preset mean parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, δ is a preset standard deviation parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, s is a preset scale parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, and t is a preset offset parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping.

Optionally, the depth separation convolution module includes: a deep separation convolution batch layer submodule for applying the following formula to input to the deep separationThe data of the convolution module is subjected to N × N deep separation convolution and batch normalization, wherein N is>M, and N is a positive integer;

the second limited linear unit layer submodule is used for performing limited linear processing on the output data of the deep separation convolution batch processing layer submodule to obtain second feature mapping; wherein z is₁For the second feature mapping, w₁Weight parameter, x, for deep split convolutional layer sub-module determined based on the first feature map₁For the first feature mapping, b₁Bias parameters, m, for deeply separated convolutional layer sub-modules determined based on the first feature map₁A predetermined mean parameter, δ, for a deeply separated convolutional layer sub-module determined based on the first feature mapping₁A predetermined standard deviation parameter, s, for a deeply separated convolutional layer sub-module determined based on the first feature mapping₁A preset scale parameter, t, for the depth-separated convolutional batch layer sub-module determined based on the first feature mapping₁And presetting an offset parameter of the deep separation convolution batch layer submodule determined based on the first feature mapping.

Optionally, the channel compression module includes a second convolution batch processing layer sub-module, configured to determine (e · Min) as the number of input channels of the second convolution batch processing layer sub-module, perform Mout times of M × M convolution on data input to the channel compression module by using the following formula, and perform batch normalization, where Mout is a positive integer and represents the number of output channels of the channel compression module,

wherein z is₂For the third feature mapping, w₂A weight parameter, x, for a second convolution batch layer sub-module determined based on the second feature map₂For the second feature mapping, b₂A second convolution batch layer sub-model determined for the second feature mapBias parameter of block, m₂A predetermined mean parameter, δ, of a second convolution batch layer sub-module determined based on the second feature map₂A predetermined standard deviation parameter, s, for a second convolution batch layer sub-module determined based on the second feature map₂A predetermined scale parameter, t, for a second convolution batch layer sub-module determined based on the second feature map₂A preset offset parameter for a second convolution batch layer sub-module determined based on the second feature map.

Alternatively, M is 1 and N is 3.

Optionally, the convolution device further includes: a residual module, configured to calculate a sum of each data element of the feature map and each data element of the output data when the number of channels of the feature map input to the channel expansion module is equal to the number of channels of the output data of the channel compression module.

Optionally, the convolution device further includes: and the point-by-point convolution module is suitable for performing point-by-point convolution on the data input to the point-by-point convolution module.

To solve the foregoing technical problem, an embodiment of the present invention further provides a CNN network device, including an input layer module, and a first convolution layer module connected to the input layer module, where the CNN network device further includes: and a convolution device for performing convolution operation on the feature map of the image data output by the first convolution layer module, wherein the convolution device is the convolution device.

Optionally, the CNN network device further includes: and the second convolution layer module is used for receiving the third feature mapping output by the convolution device and performing point-by-point convolution on the third feature mapping.

Optionally, the CNN network device further includes a third convolutional layer module connected to the second convolutional layer module, where the third convolutional layer module includes a plurality of cascaded third convolutional layer sub-modules, each third convolutional layer sub-module is configured to perform N × N convolution or M × M convolution with a sliding step size P, P is a positive integer greater than 1, and M, N is a positive integer.

Optionally, the CNN network device further includes a feature layer extraction module, which includes a plurality of cascaded feature layer extraction submodules, where each feature layer extraction submodule is configured to receive convolution results output by the second convolutional layer module and each third convolutional layer submodule, and perform N × N convolution on each convolution result to extract feature information of the image data.

In order to solve the above technical problem, an embodiment of the present invention further provides an image target detection apparatus, including: a feature extraction module adapted to extract feature information of image data based on the CNN network device; the prediction module is suitable for predicting a preset anchor point window based on the characteristic information to obtain a prediction result; and the suppression module is suitable for carrying out non-extreme suppression processing on the prediction result to obtain each detection target.

In order to solve the above technical problem, an embodiment of the present invention further provides a method for extracting features of an image, including: performing convolution operation on the feature mapping of the image data, and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; performing depth separation convolution on the first feature mapping to obtain a second feature mapping; and performing convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, so that the channel number of the third feature mapping is smaller than the channel number of the first feature mapping.

Optionally, the performing convolution operation on the feature mapping of the image data and expanding the number of channels of the feature mapping obtained by convolution to obtain the first feature mapping includes determining (e · Min) as the number of channels of the first feature mapping, where e represents a preset expansion coefficient, e >1, and e and Min are positive integers, and Min represents the number of channels of the feature mapping, performing (e · Min) M × M convolution on the feature mapping to obtain a first convolution result, where M is a positive integer, performing batch normalization on the first convolution result to obtain a first normalization result, and performing limited linear processing on the first normalization result to obtain the first feature mapping.

Optionally, the performing the depth separation convolution on the first feature mapping to obtain the second feature mapping includes performing N × N depth separation convolution on the first feature mapping to obtain a second convolution result, where N is greater than M and N is a positive integer, performing batch normalization on the second convolution result to obtain a second normalization result, and performing limited linear processing on the second normalization result to obtain the second feature mapping.

Optionally, the performing convolution operation on the second feature mapping and compressing the number of output channels of the data after convolution operation includes determining Mout as the number of channels of the third feature mapping, where Mout is a positive integer, performing Mout times of M × M convolution on the second feature mapping to obtain a third convolution result, and performing batch normalization on the third convolution result to obtain the third feature mapping.

Optionally, the performing convolution operation on the feature map of the image data and expanding the number of channels of the feature map obtained by convolution to obtain the first feature map includes: determining (e.Min) as the number of channels of the first feature map, e representing a preset expansion coefficient, e>1, and e and Min are positive integers, wherein Min represents the channel number of the feature mapping, the feature mapping is subjected to M × M convolution for (e.Min) times by adopting the following formula and is subjected to batch normalization, and M is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the first feature mapping; wherein z is the first feature mapping, w is a weight parameter determined by the feature mapping, b is a bias parameter corresponding to the feature data, x is the feature mapping of the image data, m is a preset mean parameter, δ is a preset standard deviation parameter, s is a preset scale parameter, and t is a preset offset parameter.

Optionally, the performing depth separation convolution on the first feature map to obtain the second feature map includes performing N × N depth separation convolution on the first feature map by using the following formula and performing batch normalization, where N is>M, and N is a positive integer；

Performing limited linear processing on the output data after batch processing normalization to obtain the second feature mapping; wherein z is₁For the second feature mapping, w₁Mapping corresponding weight parameters, x, based on the first feature₁For the first feature mapping, b₁Mapping corresponding bias parameters, m, based on the first characteristics₁For the preset mean parameter, δ₁To preset standard deviation parameters, s₁To preset scale parameters, t₁Is a preset offset parameter.

Optionally, the performing convolution operation on the second feature mapping and compressing the number of output channels of the data after the convolution operation includes determining Mout as the number of channels of the third feature mapping, where Mout is a positive integer and represents the number of output channels of the channel compression module, performing Mout times of M × M convolution on the second feature mapping by using the following formula and performing batch normalization,

wherein z is₂For the third feature mapping, w₂Weight parameter, x, determined for the second feature map₂For the second feature mapping, b₂Bias parameters, m, determined for the second feature map₂For a predetermined mean parameter, δ₂To preset standard deviation parameters, s₂To preset scale parameters, t₂Is a preset offset parameter.

Alternatively, M is 1 and N is 3.

Optionally, the feature extraction method further includes: when the number of channels of the feature map is equal to the number of channels of the third feature map, calculating a sum of each data element of the feature map and each data element of the third feature map to obtain a fourth feature map.

Optionally, the feature extraction method further includes: and performing point-by-point convolution on the fourth feature map to obtain a fifth feature map.

Optionally, the feature extraction method further includes: and performing point-by-point convolution on the third feature map to obtain a sixth feature map.

In order to solve the above technical problem, an embodiment of the present invention further provides an image target detection method, including: extracting feature information of the image data based on the feature extraction method of the image; predicting a preset anchor point window based on the characteristic information to obtain a prediction result; and carrying out non-extreme value suppression processing on the prediction result to obtain each detection target.

In order to solve the foregoing technical problem, an embodiment of the present invention further provides a terminal, including a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the computer instructions to perform the steps of the foregoing method.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

an embodiment of the present invention provides an image convolution apparatus, including: the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping; and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping. The technical scheme provided by the embodiment of the invention can carry out convolution processing on the feature mapping of the image data, and after the channel expansion module expands the number of the channels of the data, the depth separation convolution operation is carried out based on the depth separation convolution module, so that more feature information can be extracted, the number of the channels of the third feature mapping obtained after the operation is compressed, the convolution operation scale can be reduced under the condition of keeping higher detection precision, the convolution operation complexity is reduced, and the possibility is provided for realizing light-weight feature extraction on the mobile terminal.

Further, an embodiment of the present invention provides a CNN network device, which includes an input layer module, a first convolution layer module connected to the input layer module, and further includes: and a convolution device for performing convolution operation on the feature map of the image data output by the first convolution layer module, wherein the convolution device is the convolution device. Compared with the prior art, the CNN network device provided by the embodiment of the invention has smaller convolution operation scale, is easy to realize light-weight feature extraction on the mobile terminal, and can reduce the calculation complexity of the CNN network forward reasoning due to the smaller operation scale.

Further, an embodiment of the present invention provides an image target detection apparatus, including: a feature extraction module adapted to extract feature information of image data based on the CNN network device; the prediction module is suitable for predicting a preset anchor point window based on the characteristic information to obtain a prediction result; and the suppression module is suitable for carrying out non-extreme suppression processing on the prediction result to obtain each detection target. The target detection device provided by the embodiment of the invention adopts the convolution device with lower calculation complexity as the CNN basic network, so that the target detection complexity can be reduced on the premise of keeping higher detection precision, and the target detection device is favorable for being applied to mobile terminal equipment.

Further, determining (e.Min) as the channel number of the first feature mapping, wherein e represents a preset expansion coefficient, e >1, e and Min are positive integers, and Min represents the channel number of the feature mapping;

performing limited linear processing on the output data after batch processing normalization to obtain the first feature mapping; wherein z is the first feature mapping, w is a weight parameter of the feature mapping, b is a bias parameter of the image data, x is a feature mapping of the image data, m is a preset mean parameter, δ is a preset standard deviation parameter, s is a preset scale parameter, and t is a preset offset parameter. By the technical scheme provided by the embodiment of the invention, when the CNN network is adopted for image processing, the batch processing normalization layer and the convolution layer associated with the batch processing normalization layer can be merged, so that multiplication and division operations can be reduced, the calculation complexity of feature extraction is reduced, and the calculation scale is reduced.

Drawings

FIG. 1 is a schematic diagram of a deep separable convolution module for a mobile Internet network according to the prior art;

FIG. 2 is a schematic structural diagram of a convolution device according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a specific structure of the convolution device shown in FIG. 2;

FIG. 4 is a schematic diagram of a functional decomposition of the convolution device shown in FIG. 3;

FIG. 5 is a schematic diagram of another embodiment of the convolution device shown in FIG. 2;

fig. 6 is a schematic structural diagram of a CNN network according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an image object detection apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a classification network according to an embodiment of the present invention;

FIG. 9 is a flowchart illustrating a method for extracting features of an image according to an embodiment of the present invention;

fig. 10 is a flowchart illustrating an image target detection method according to an embodiment of the present invention.

Detailed Description

As mentioned in the background, the prior art still has various disadvantages, and needs to optimize the convolution processing of the image and the target detection method.

Specifically, a deep learning method based on a Convolutional Neural Network (CNN) can be applied to the field of target detection. One of them is a detection method based on target candidate window, and typically represents a Faster detection method based on a convolutional neural network (fast Regions with CNN, abbreviated as fast R-CNN). The main principle is that on a shared image, a regional candidate window network (RPN for short) is adopted to calculate a plurality of target candidate windows, and then characteristic information in the target candidate windows is classified and regressed to obtain target category information and position information, so that a target detection task is completed.

The detection method based on the Faster R-CNN can obtain higher detection precision. However, since the target candidate window is acquired by relying on a Region pro-portal Network (RPN for short), the detection time is long, and the method is not suitable for occasions with high real-time requirements.

The other type is a detection method of candidate window independence (propofol Free), and a typical candidate window independence method mainly comprises a Single Shot multi-window detection (SSD) detection method and a real-time target (yoly Only Look one, YOLO) detection method. The SSD detection method and the YOLO detection method do not need to additionally calculate a target candidate window, and do not have a corresponding characteristic resampling process. When the target detection is carried out, the SSD and the YOLO can directly preset a plurality of Anchor point windows (Anchor Box) with different scales and aspect ratios in the whole image area, the whole CNN network only needs to be transmitted forwards during detection, then the confidence coefficient of the target category is calculated for each Anchor point window, and meanwhile, the offset is adjusted on the basis of the Anchor point windows to obtain the accurate target position. Compared with YOLO, the main difference of SSD is that SSD extracts more complete multi-scale image information for prediction, so SSD has higher detection accuracy.

In the prior art, a detection method based on the YOLO relies on a small number of images for classification and regression, loses more available information, has a poor detection effect on small targets, and has low positioning accuracy on the targets.

The SSD-based detection method uses a plurality of images for classification and regression, and compared with YOLO, the detection method has a better effect on small targets and improves the positioning accuracy of the targets. Specifically, when the SSD detector is used to detect the target, the information of the plurality of images may be selected to predict the preset anchor point window based on the forward-propagating convolutional neural network, and post-processing such as Non Maximum Suppression (NMS) may be performed to obtain a final detection result. The predicted variables may include, among other things, the confidence of the target class and the offset of the target location. The classic SSD detector uses a Visual Geometry Group (VGG 16 for short) classification network of oxford university as a basic CNN network, and is high in computational complexity and not suitable for a mobile terminal or an embedded device.

Further, the industry proposes an improved SSD detector based on Mobile Networks (MobileNet) as the Base Network (Base Network). MobileNet Network uses a depth separable convolution module as shown in FIG. 1. the depth separable convolution module 100 includes a depth separated convolution module 101 and a 1 × 1 convolution module 102. wherein the depth separated convolution module 101 is composed of a 3 × 3 depth separated convolution layer, a batch normalization layer, and a constrained linear element layer. the 1 × 1 convolution module 102 is composed of a 1 × 1 convolution layer, a batch normalization layer, and a constrained linear element layer. the computational complexity of the depth separable convolution module 100 can be generally reduced by an order of magnitude compared to the standard convolution layer, and the convolution Network constructed by the depth separable convolution module 100 can still maintain a higher accuracy, the detailed description can be found in reference [1] wherein reference [1] is for the separable convolution G.Howard, Meng, routing, tuning, routing, Box, Network, Mobile communication, and Mobile communication Network, while still maintaining the depth separable convolution module ×.

An embodiment of the present invention provides an image convolution apparatus, including: the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping; the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping; and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping.

The technical scheme provided by the embodiment of the invention can carry out convolution processing on the feature mapping of the image data, and after the channel expansion module expands the number of the channels of the data, the depth separation convolution operation is carried out based on the depth separation convolution module, so that more feature information can be extracted, the number of the channels of the third feature mapping obtained after the operation is compressed, the convolution operation scale can be reduced under the condition of keeping higher detection precision, the convolution operation complexity is reduced, and the possibility is provided for realizing light-weight feature extraction on the mobile terminal.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Fig. 2 is a schematic structural diagram of a convolution device according to an embodiment of the present invention. The convolution device 200 may be used in a CNN network as a convolution layer of the CNN network and performs convolution operation on input data. The convolution device 200 may include a channel expansion module 201, a depth separation convolution module 202, and a channel compression module 203.

In an implementation, the channel expansion module 201 may be configured to perform a convolution operation on a feature map (also called a feature map or a feature map) of the image data input to the convolution device 200. The feature map is obtained by convolving the original image data, and the dimension of the feature map is [ the number of height width channels ]. Typically, the number of channels of the feature map is much higher than the number of channels of the image data.

Those skilled in the art understand that, for the CNN network, the convolutional layer parameters include the number of convolutional kernels, step size and padding (padding), which together determine the size of the feature map output by the convolutional layer, and are important hyper-parameters of the CNN network. Where the number of convolution kernels can be specified as an arbitrary value smaller than the size of the input image, the larger the number of convolution kernels, the more complex the extractable input features. The number of channels of the first feature mapping is increased, and more feature information of the image can be extracted. To extract more feature information of the image data, the channel expansion module 201 may expand the number of channels of the data (i.e., the first feature map) obtained by convolution to obtain the first feature map with more feature information.

The depth separation convolution module 202 may receive the first feature map from the channel expansion module 201, and perform depth separation convolution on the first feature map output by the channel expansion module 201 to obtain a second feature map. The specific convolution operation step of the deep separation convolution can be referred to in reference [1], and is not described in detail here.

The channel compression module 203 may receive the second feature map output by the depth separation convolution module 202, perform convolution operation on the second feature map, and compress the output channel number of the data after the convolution operation to obtain a third feature map, so that the channel number of the third feature map is smaller than the channel number of the feature map, so as to extract significant feature information, and reduce the dimensionality of features (i.e., the third feature map) to reduce the amount of computation.

As a non-limiting example, referring to fig. 3, the channel expansion module 201 may include: a first rolling layer submodule 2011, a first batch normalization layer submodule 2012, and a first restricted linear unit layer submodule 2013.

Specifically, the first convolution layer submodule 2011 may perform (e · Min) M × M convolution on the feature map input to the channel expansion module 201, where M is a positive integer, e represents a preset expansion coefficient, e >1, and e, Min are positive integers, and Min represents the channel number of the feature map of the image data, in general, M is 1, and perform point-by-point convolution on the feature map, which is beneficial to reducing the computational complexity.

In a specific implementation, the depth separation convolution module 202 may include a depth separation convolution layer submodule 2021, a second batch normalization layer submodule 2022, and a second restricted linear unit layer submodule 2023, specifically, the depth separation convolution layer submodule 2021 may be configured to perform N × N depth separation convolution on the first feature map, where N > M and N is a positive integer, for example, N ═ 3, the second batch normalization layer submodule 2022 may be configured to perform batch normalization on a convolution result obtained by the depth separation convolution layer submodule 2021, the second restricted linear unit layer submodule 2023 may be configured to perform restricted linear processing on data obtained by the second batch normalization layer submodule 2022 to obtain the second feature map, and the depth separation convolution module 202 may reduce the computational complexity of the convolution apparatus 200 while maintaining high precision.

In a specific implementation, the channel compression module 203 may include a second convolutional layer submodule 2031 and a third batch normalization layer submodule 2032, specifically, the second convolutional layer submodule 2031 may be configured to determine (e · Min) as the number of input channels of the second convolutional layer submodule 2031, and perform Mout times of M × M convolution on the second feature map, preferably, M ═ 1 the third batch normalization layer submodule 2032 may be configured to perform batch normalization on the convolution result output by the second convolutional layer submodule 2031 to obtain a third feature map, where Mout of the third feature map is smaller than the number of channels (e · Min) of the first feature map.

Further, the convolution apparatus 200 may further include: a residual block 204. In a specific implementation, when the number of channels of the feature map of the image data input to the channel expansion module 201 is equal to the number of channels of the third feature map output by the channel compression module 203, the residual module 204 may be configured to calculate a sum of each data element of the feature map and each data element of the third feature map. As a variation, when the number of channels of the feature map input to the channel expansion module 201 is not equal to the number of channels of the third feature map output by the channel compression module 203, the convolution module 200 does not include the residual module 204. Those skilled in the art understand that the residual error module 204 can reduce the training difficulty of the CNN network, improve the generalization ability of the model, improve the efficiency of the deep neural network during reverse propagation, and effectively avoid Gradient disappearance (Gradient cancellation).

As a preferred embodiment, M is 1, N is 3, in this case, the function of each module and/or sub-module in the convolution device 200 may be as shown in fig. 4, referring to fig. 4, the channel expansion module 201 may be configured to perform 1 × 1 convolution, batch normalization and constrained linear processing, the depth separation convolution module 202 may be configured to perform 3 × 3 depth separation convolution, batch normalization and constrained linear processing, the channel compression module 203 may be configured to perform 1 × 1 convolution and batch normalization, when the number of channels of the feature map of the image data is equal to the number of channels of the third feature map, the convolution device 200 may include the residual module 204, and the residual module 204 may add each data element of the feature map and the data element of the third feature map output by the channel compression module 203 to obtain the output result of the convolution device 200.

In a specific implementation, the data dimension of the feature map of the image data input to the convolution device 200 may be three-dimensional data [ Fh, Fw, Min ]. Fh represents the height of the feature map, Fw represents the width of the feature map, Min represents the number of channels of the feature map, and Fh, Fw, Min are all positive integers, if the expansion coefficient is e, e >1, and the data dimension of the feature map of the image data input to the convolution device 200 is [ Fh, Fw, Min ], it means that the dimension of the first volume of the lamination sub-module 2011 in the channel expansion module 201 may be represented as [1,1, Min, e × Min ]. in combination with fig. 4, the data dimension of the first volume of the feature map obtained by performing a 1-dimensional convolution on the feature map of the image data is [ Fh, Fw, e 32 Min ], the expansion coefficient expands the number of channels input to the first volume of the lamination sub-module 203, the second volume of the image data obtained by performing a dimensional convolution on the feature map is [ Fh, Fw, Min, the data compression sub-module 203, the linear mapping is represented as [ Fh, Fw, Min ], and the linear depth map is output as a linear sub-module 202, a linear compression module for which outputs no more linear data compression, and no more linear compression is performed on the linear map for the linear sub-channel.

Specifically, the number of input channels and the number of output channels of all modules (including sub-modules) of the convolution device 200 can be multiplied by the multiplicative coefficient, that is, when the dimension of the output image data of a certain module is [ Fh, Fw, Mout ×β ]. β is 1, the number of parameters of the convolution device 200 can be changed accordingly and the corresponding calculation amount can be changed accordingly when β is changed, in a specific implementation, the value of β can be weighted and determined according to the model size, the calculation complexity and the identification accuracy of the convolution device 200 and the CNN network device where the convolution device is located.

Further, the convolution device 200 includes many convolution operations and batch normalization operations on the convolution operation results, and multiplication and division operations are required to be performed when the convolution operations are implemented, which is time-consuming. Considering that the parameters of each batch normalization layer sub-module are fixed after the convolution device 200 completes training, the following formula can be used:

w is a weight parameter of the sub-module used for convolution operation, b is a bias parameter of the sub-module used for convolution operation, m is a mean value parameter of each batch processing normalization layer sub-module after training, delta is a standard deviation parameter of each batch processing normalization layer sub-module after training, s is a scale parameter of each batch processing normalization layer sub-module after training, and t is an offset parameter of each batch processing normalization layer sub-module after training. Therefore, in order to simplify the operation complexity, the convolution operation and the batch normalization operation can be combined. Specifically, the formula (1) is a convolution operation formula, where x is the image data, and y is an output result of the sub-module after performing convolution operation. And (3) a combination result is shown as a formula (3), wherein z is output data of the first convolution batch processing layer submodule and is output of the submodule obtained by combining convolution operation and batch processing normalization. With equation (3), the parameter calculation can be done off-line, i.e.:

y＝w·x+b (1)

in specific implementation, when the formula (3) is used for convolution and batch normalization, w represents a weight parameter corresponding to each module or sub-module, z represents the merging result, b represents a bias parameter determined by the feature mapping, and m, δ, s, and t are fixed values and can represent preset parameters of each module or sub-module. m represents a preset mean parameter, δ represents a preset standard deviation parameter, s represents a preset scale parameter, and t represents a preset offset parameter.

Based on the above optimization method, the convolution device 200 can be simplified. Specifically, referring to fig. 5, the channel expansion module 201 may include: a first rolling batch layer sub-module 2011 'and a first restricted linear cell layer sub-module 2012'.

In a specific implementation, the first convolution batch layer sub-module 2011' performs M × M convolution (e · Min) times on the image data input to the channel expansion module and performs batch normalization, where M is a positive integer, e represents a preset expansion coefficient, e >1, e and Min are positive integers, and Min represents the number of channels of the feature map, and then (e · Min) may be determined as the number of output channels of the first convolution batch layer sub-module:

in a specific implementation, z is output data of the first convolution batch layer sub-module 2011 ', w is a weight parameter of the first convolution batch layer sub-module corresponding to the feature mapping, b is a bias parameter of the first convolution batch layer sub-module 2011' corresponding to the feature mapping, x is the feature mapping of the image data, m is a preset mean parameter of the channel expansion module 201, δ is a preset standard deviation parameter of the channel expansion module 201, s is a preset scale parameter of the channel expansion module 201, and t is a preset bias parameter of the channel expansion module 201.

The first constrained linear unit layer sub-module 2012 'may be configured to perform constrained linear processing on the output data of the first convolutional batch layer sub-module 2011' to obtain the first feature map.

In a specific implementation, the depth separation convolution module 202 may include: the depth separation convolution batch layer sub-module 2021 'and the second constrained linear cell layer sub-module 2022'.

Specifically, the deep separation convolution batch layer sub-module 2021' may be configured to perform N × N deep separation convolution and batch normalization on the data input to the deep separation convolution module using the following formula, where N > M and N is a positive integer:

wherein z is₁For the second feature mapping, w₁Weight parameter, x, for deep split convolutional layer sub-module determined based on the first feature map₁For the first feature mapping, b₁Bias parameters, m, for deeply separated convolutional layer sub-modules determined based on the first feature map₁A predetermined mean parameter, δ, for said deep separation convolution module₁A predetermined standard deviation parameter, s, for said deep separation convolution module₁A predetermined scale parameter, t, for said deep separation convolution module₁And presetting an offset parameter for the depth separation convolution module. The second constrained linear unit layer sub-module 2022' may be configured to perform constrained linear processing on the output data of the deep separation convolutional batch layer sub-module to obtain the second feature map.

In a specific implementation, the channel compression module 203 may include a second convolution batch layer sub-module 2031 '. specifically, the second convolution batch layer sub-module 2031' may be configured to determine (e · Min) as the number of input channels of the second convolution batch layer sub-module, and perform Mout times M × M convolution on data input to the channel compression module and perform batch normalization using the following formula, where Mout is a positive integer and represents the number of output channels of the channel compression module,

wherein z is₂For the third feature mapping, w₂A weight parameter, x, for a second convolution batch layer sub-module determined based on the second feature map₂For the second feature mapping, b₂Bias parameters, m, for a second convolution batch layer sub-module determined based on the second feature map₂A predetermined mean parameter for a second convolution batch layer sub-module determined based on the second feature mapping，δ₂A predetermined standard deviation parameter, s, for a second convolution batch layer sub-module determined based on the second feature map₂A predetermined scale parameter, t, for a second convolution batch layer sub-module determined based on the second feature map₂When M is 1, M × M convolution is point-by-point convolution, and the operation amount can be reduced.

In a specific implementation, the convolution apparatus 200 may further include: a residual block 204. Specifically, when the number of channels of the feature map input to the channel expansion module 201 is equal to the number of channels of the output data of the channel compression module 203, the residual module 204 may calculate the sum of each data element of the feature map and each data element of the output data.

As a preferred embodiment, M is 1, and N is 3, which may specifically refer to the embodiment shown in fig. 4 and will not be described herein again.

Further, when the convolution device 200 does not include the residual error module 204, the convolution device 200 may further include a point-by-point convolution module (not shown). The point-by-point convolution module may be located after the channel compression module 203, and performs point-by-point convolution on the data output by the channel compression module 203. As a variation, when the convolution device 200 includes the residual block 204, the convolution device 200 may further include a point-by-point convolution block (not shown) located after the residual block 204. The point-by-point convolution module may perform point-by-point convolution on the data output by the residual error module 204 to further reduce the dimension of the output data and reduce the computational complexity.

Fig. 6 is a schematic structural diagram of a CNN network device according to an embodiment of the present invention. Referring to fig. 6, the CNN network device 300 may include an input layer module 301, a first convolution layer module 302 connected to the input layer module 301, and the convolution device 200 shown in fig. 2 to 5. The convolution device 200 can perform convolution operation on the image data output by the first convolution layer module 302 to extract feature information and reduce data dimensionality.

In a specific implementation, the CNN network device 300 may further include a second convolutional layer module 303. The second convolutional layer module 303 may receive the image data output by the convolutional device and perform point-by-point convolution on the image data.

In a specific embodiment, the second convolutional layer module 303 may further connect a third convolutional layer module 304, where the third convolutional layer module 304 may include a plurality of cascaded third convolutional layer submodules 3041, each third convolutional layer submodule 3041 may be configured to perform N × N convolution or M × M convolution with a sliding step size P, P being a positive integer greater than 1 and M, N being a positive integer, for example, the third convolutional layer submodule 3041 may perform 3 × 3 convolution with a sliding step size 2, where the sliding step size refers to a distance between positions of two adjacent convolutional scan feature maps of a convolutional kernel, and when the sliding step size is 1, the convolutional kernel scans elements of the feature map one by one, and when the sliding step size is N, skips (N-1) pixels in the next scan.

Further, the CNN network apparatus 300 may further include an extracted feature layer module 305, where the extracted feature layer module 305 may include a plurality of cascaded extracted feature layer submodules 3051, and each extracted feature layer submodule 3051 may be configured to receive convolution results output by the second convolutional layer module 303 and each third convolutional layer submodule 3041, and perform N × N convolution on each convolution result to extract feature information of the image data, for example, N equals 3.

Those skilled in the art will understand that, in the CNN network device 300, the convolution operation and batch normalization may also be combined by using the above optimization method to reduce the computational complexity, and will not be described in detail here.

Fig. 7 is a schematic structural diagram of an image target detection apparatus according to an embodiment of the present invention. Based on the target detection device 400, multi-target detection can be performed based on a mobile terminal.

Specifically, the object detection apparatus 400 may include a feature extraction module 401 adapted to extract feature information of image data based on the CNN network apparatus 300 shown in fig. 6; a prediction module 402, adapted to predict a preset anchor point window based on the feature information to obtain a prediction result; the Suppression module 403 is adapted to perform Non-extreme Suppression (NMS) processing on the prediction result to obtain each detection target.

Those skilled in the art will understand that the target detection device 400 based on CNN network device will generally cut out the basic CNN network based on the classification network as the feature extraction module 401 for feature extraction. Specifically, target detection may be performed on a forward propagation-based CNN network by selecting information of a plurality of extracted feature sub-modules to predict a preset anchor point window, where the prediction variables include a confidence of a target category and an offset of a target position, and then performing non-extremum suppression to obtain a final detection result.

Fig. 8 is a schematic diagram of a classification network according to an embodiment of the present invention, as shown in fig. 8, the classification network 500 is used to train a basic CNN network 501, as a non-limiting example, the basic CNN network 501 may include a 3 × 3 convolutional layer module 5011, a plurality of cascaded convolutional devices 5012 and a 1 × 1 convolutional layer module 5013, it should be noted that, in a specific implementation, a sliding step of a deep separation convolutional module in the cascaded convolutional device 5012 may be 1 or 2 when performing N × N deep separation convolution, and a spatial scale of a result of the deep separation convolution may be reduced if the sliding step is greater than 1.

Training the underlying CNN network 501 may be pre-trained on an image network database (ImageNet) data set, which may specifically refer to the prior art and is not described in detail here.

After pre-training the classification network, the underlying CNN network 501 may be pruned out for use in detection devices. The number of convolution devices 5012 in the underlying CNN network 501 may be adjusted according to the specific task. It should be noted that, in order to obtain the high-resolution convolution characteristic result, the output data of the partial cascaded convolution device 5012 therein may be used as a high-resolution convolution characteristic layer in a subsequent processing module (not shown).

Those skilled in the art will appreciate that after pre-training is completed, and after the underlying CNN network 501 is obtained, other modules may be added to obtain the target detection apparatus. The target detection device may then be trained.

The training objective function of the object detection device can comprise a plurality of object classes, and the simultaneous detection of the objects of the plurality of classes is realized. Specifically, it is possible to set

Is an indicator as a result of the matching of the ith anchor window and the jth annotation window of target class p. If the overlapping rate of the two windows is higher than the threshold value T0

Is 1, otherwise is 0. Matching policy allows

So that multiple anchor windows can match the same annotation window. The trained global target loss function is a weighted sum of the confidence loss function and the localization loss function, as shown in equation (4):

where N is the number of anchor point windows matched, if N is 0, the target penalty is 0, α is the weight coefficient for the localization penalty, f represents the indicator vector, c represents the confidence vector, t represents the prediction window position vector, g represents the target annotation window vector, L_conf(f, c) represents a confidence loss function, L_loc(f, t, g) represents a localization loss function.

In a specific implementation, the confidence loss function is to calculate a flexible maximum transfer function (Softmax) loss for confidence of a plurality of classes, as shown in equations (5) and (6):

wherein log represents a logarithmic function, exp represents an exponential function,

is the confidence that the ith prediction window belongs to the target class p. Pos represents a positive sample set and Neg represents a negative sample set. When the overlapping rate of the anchor point window and all the target labeling windows is less than T0, the result is a negative sample. P-0 represents a background class, i.e., a negative sample class.

In a specific implementation, the localization loss function is a quantitative estimate of the difference between the prediction window and the target annotation window. Before calculating the positioning loss function, encoding the target labeling window by using the anchor point window, as shown in formula (7):

wherein,

the abscissa, ordinate, width and height of the central position of the ith anchor point window;

the abscissa, ordinate, width, height of the center position of the jth target marking window;

the abscissa, ordinate, width, height of the center position after the jth target marking window is coded.

The positioning loss function may then be calculated using the smoothed first order norm, as shown in equation (8):

wherein m epsilon (cx, cy, w, h) is a window position parameterThe numbers are respectively the abscissa, ordinate, width and height of the center position.

Is the m-th position parameter of the i-th prediction window,

is the m-th position parameter after the j-th target marking window is coded. Smoothed first order norm H_L1As shown in formula (9):

those skilled in the art will appreciate that training the target detection device may use training data as input, forward propagate the entire network, and calculate loss values according to equation (4). And then the model parameters of the whole network are updated through back propagation. In specific implementation, iterative optimization can be performed by using a Stochastic Gradient Descent (SGD) method, so as to obtain each model parameter. Further, after the training is completed, the new image may be subjected to target detection using the model parameters obtained by the training.

In a specific implementation, with reference to fig. 6 and 7, each third convolutional layer submodule 3041 in the third convolutional layer module 304 can be used to perform 3 × convolution with a sliding step size of 2 and point-by-point convolution, so that the data dimension of the third convolutional layer submodule 3041 is gradually reduced, and the output result corresponds to different data dimension.

For example, taking fig. 6 as an example, the output data Xi of a third convolutional layer submodule 3041 has a data dimension [ Hi, Wi, Ci ], and the dimension values respectively represent the height, width, and number of channels of the output data Xi; the data of the corresponding feature extraction submodule 3051 is Fi, the data dimension is [ Kh, Kw, Ci, p +4], Kh, Kw, and Ci respectively represent the height, width, input channel number, and output channel number of the feature extraction submodule 3051, where p represents the number of object categories, and 4 represents four position parameters of an object. Convolution of Xi with Fi yields the predicted data Yi, with data dimensions [ Hi, Wi, p +4 ].

Since the objects in the actual scene have different scales and aspect ratios, several anchor point windows may be generated for any position of the selected third convolution feature layer sub-module 3041. Thus, the scale-specific parameter s of the target can be calculated according to the index k of the selected third convolution feature layer sub-module 3041_kAs shown in formula (11):

wherein s is_minIs the minimum dimension, s_maxIs the largest scale, m represents the number of selected third convolution feature layer sub-modules 3041, s_kIs the target scale for the kth layer in the selected third convolutional feature layer sub-module.

Further, the sequence a of the aspect ratio may be set_r∈ {1,2,3,1/2,1/3}, where the width of any anchor point window of the kth layer third convolution feature layer sub-module 3041 is equal to

Is high as

Fig. 9 is a flowchart illustrating a method for extracting features of an image according to an embodiment of the present invention. The feature extraction method may be performed by using the CNN network device shown in fig. 6. Specifically, the feature extraction method may include the steps of:

step S101, performing convolution operation on the feature mapping of the image data, and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping;

step S102, carrying out depth separation convolution on the first feature mapping to obtain a second feature mapping;

step S103, performing convolution operation on the second feature map, and compressing the number of channels of the data after the convolution operation to obtain a third feature map, so that the number of channels of the third feature map is smaller than the number of channels of the first feature map.

Specifically, in step S101, the image data may be convolved to obtain a feature map of the image, and then the feature map input to the convolution device may be convolved, and the number of channels of the convolved feature map may be expanded to obtain the first feature map.

In specific implementation, the method includes determining (e · Min) as the channel number of the first feature mapping, where e represents a preset expansion coefficient, e >1, and e and Min are positive integers, and Min represents the channel number of the image data, performing (e · Min) M × M convolution on the image data to obtain a first convolution result, where M is a positive integer, then performing batch normalization on the first convolution result to obtain a first normalized result, and further performing limited linear processing on the first normalized result to obtain the first feature mapping.

As a variation, it may be determined that (e · Min) represents the number of channels of the first feature map, e represents a preset expansion coefficient, e >1, e is a positive integer, and Min represents the number of channels of the image data, and (e · Min) times M × M convolution is performed on the image data and batch normalization is performed using the following formula, M is a positive integer, e represents a preset expansion coefficient, e >1, e, and Min are positive integers, and Min represents the number of channels of the feature map;

performing limited linear processing on the output data after batch processing normalization to obtain the first feature mapping; wherein z is the first feature mapping, w is a weight parameter corresponding to the first feature mapping, b is a bias parameter corresponding to the first feature mapping, x is the feature mapping of the image data, m is a preset mean parameter, δ is a preset standard deviation parameter, s is a preset scale parameter, and t is a preset offset parameter.

In step S102, the first feature map may be subjected to a depth separation convolution to obtain a second feature map, specifically, the first feature map may be subjected to an N × N depth separation convolution to obtain a second convolution result, where N > M and N is a positive integer, the second convolution result may be subjected to batch normalization to obtain a second normalization result, and the second normalization result may be subjected to limited linear processing to obtain the second feature map.

As a variation, the first feature map may be subjected to N × N depth separation convolution and batch normalization using the following formula, where N > M and N is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the second feature mapping; wherein z is₁For the second feature mapping, w₁A weight parameter, x, determined for the first feature map₁For the first feature mapping, b₁Bias parameters, m, determined for the first feature map₁For the preset mean parameter, δ₁To preset standard deviation parameters, s₁To preset scale parameters, t₁Is a preset offset parameter.

In step S103, a convolution operation may be performed on the second feature map, and the number of channels of the data after the convolution operation is compressed to obtain a third feature map, so that the number of channels of the third feature map is smaller than the number of channels of the first feature map.

Specifically, Mout can be determined as the number of channels of the third feature map, Mout times of M × M convolution is conducted on the second feature map to obtain a third convolution result, and batch processing normalization is conducted on the third convolution result to obtain the third feature map.

As a variation, Mout may be determined as the number of channels of the third feature map, Mout times M × M convolution and batch normalization may be performed on the second feature map using the following formula,

wherein z is₂For the third feature mapping, w₂Weight parameter, x, determined for the second feature map₂For the second feature mapping, b₂Bias parameters, δ, determined for the second feature map₂To preset standard deviation parameters, s₂To preset scale parameters, t₂Is a preset offset parameter.

As a preferred embodiment, M is 1 and N is 3.

Further, when the number of channels of the feature map of the image data is equal to the number of channels of the third feature map, a sum of each data element of the feature map and each data element of the third feature map is calculated to obtain a fourth feature map.

Further, a point-by-point convolution may be performed on the fourth feature map to obtain a fifth feature map.

Further, a point-by-point convolution may be performed on the third feature map to obtain a sixth feature map.

Fig. 10 is a flowchart illustrating an image target detection method according to an embodiment of the present invention. The object detection method can be used for multi-object detection of image data and can be applied to mobile terminals. Specifically, the target detection method may include:

step S201, extracting feature information of the image data based on the feature extraction method of the image shown in fig. 9;

step S202, predicting a preset anchor point window based on the characteristic information to obtain a prediction result;

step S203, carrying out non-extremum inhibition processing on the prediction result to obtain each detection target.

In a specific implementation, step S201 may be performed, namely, feature information of the image data is extracted according to the feature extraction method of the image shown in fig. 9.

In step S202, a preset anchor point window may be predicted based on the characteristic information to obtain a prediction result.

In step S203, a non-extremum suppression process may be performed on the prediction result to obtain each detection target.

In order to perform performance comparison, the target detection apparatus provided in the embodiment of the present invention performs training and testing on a computer vision standard data set (PASCAL VOC) data set, in which 2012 sets and 2007 sets are used as training sets, and 2007 sets are used as test sets, the input image data is 300 pixels, 17 convolution devices are used in the basic CNN network, the expansion coefficient e of the convolution devices is 6, and the multiplicative coefficient β is [1, 0.75] to perform experimental simulation, and it is noted that training on the basic CNN network is performed on a single-block image processor (GPU).

In specific implementation, the VOC data set has 20 types of targets, and the index for evaluating the detection performance is an average accuracy mean (mAP for short), as shown in formula (12):

where r denotes a Recall (Recall), p (r) denotes a Precision (Precision) corresponding to a certain Recall, p_interp(r) represents the maximum accuracy when the recall rate is greater than r, AP represents the calculation accuracy mean value when the recall rate is {0, 0.1, … …,1.0}, mAP represents the average result of the calculation accuracy mean values for a plurality of types of objects, and detected object Q is 20.

TABLE 1

Table 1 shows a comparison between the target detection device provided in the embodiment of the present invention and the performance of the conventional MobileNet-SSD detector, where the multiplicative coefficient β is 1 in the first embodiment of the present invention, and the multiplicative coefficient β is 0.75 in the second embodiment of the present invention, it can be found from the table that the average accuracy mean of the target detection device provided in the first embodiment of the present invention is slightly lower than that of the MobileNet-SSD detector, but the model size (MB) is about one-half of that of the MobileNet-SSD detector.

Therefore, the embodiment of the invention can provide the convolution device with lower computational complexity, and the convolution neural network and the target detection device with lower computational complexity can be obtained based on the convolution device.

Further, the embodiment of the present invention also discloses a storage medium, on which computer instructions are stored, and when the computer instructions are executed, the technical solutions of the methods described in the embodiments shown in fig. 9 and fig. 10 are executed. Preferably, the storage medium may include a computer-readable storage medium such as a Non-Volatile (Non-Volatile) memory or a Non-Transitory (Non-transient) memory. The storage medium may include ROM, RAM, magnetic or optical disks, etc.

Further, an embodiment of the present invention further discloses a terminal, which includes a memory and a processor, where the memory stores a computer instruction capable of running on the processor, and the processor executes the method technical solution described in the embodiments shown in fig. 1 to 5 when running the computer instruction.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An apparatus for convolving an image, comprising:

the channel expansion module is used for performing convolution operation on the feature mapping of the image data and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping;

the depth separation convolution module is used for performing depth separation convolution on the first feature mapping output by the channel expansion module to obtain a second feature mapping;

and the channel compression module is used for receiving the second feature mapping output by the depth separation convolution module, carrying out convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, wherein the channel number of the third feature mapping is smaller than that of the first feature mapping.

2. The convolution device of claim 1, wherein the channel expansion module comprises a first convolution layer sub-module for determining (e-Min) as the number of output channels of the first convolution layer sub-module and performing (e-Min) M × M convolutions to the feature map input to the channel expansion module, M, Min is a positive integer, e represents a preset expansion coefficient, e >1, and e is a positive integer, Min represents the number of channels of the feature map;

the first batch normalization layer submodule is used for carrying out batch normalization on the output result of the first convolution layer submodule;

and the first restricted linear unit layer submodule is used for performing restricted linear processing on the data output by the first batch normalization layer submodule to obtain the first feature mapping.

3. The convolution device of claim 2, wherein the depth separation convolution module comprises:

a depth separation convolution layer module for performing N × N depth separation convolution on the first feature map, wherein N > M and N is a positive integer;

the second batch processing normalization layer submodule is used for carrying out batch processing normalization on the convolution result obtained by the depth separation convolution layer submodule;

and the second limited linear unit layer submodule is used for performing limited linear processing on the data obtained by the second batch processing normalization layer submodule to obtain the second feature mapping.

4. The convolution device of claim 3, wherein the channel compression module comprises:

a second convolutional layer submodule for determining (e · Min) as the number of input channels of the second convolutional layer submodule, and performing Mout times of M × M convolution on the second feature map, where Mout is a positive integer and represents the number of output channels of the channel compression module;

and the third batch normalization layer submodule is used for carrying out batch normalization on the convolution result output by the second convolution layer submodule.

5. The convolution device of claim 1, wherein the channel expansion module comprises a first convolution batch layer sub-module for determining (e-Min) as the number of output channels of the first convolution batch layer sub-module, and performing (e-Min) times M × M convolution and batch normalization on the feature map input to the channel expansion module using the following formula, wherein M is a positive integer, e represents a preset expansion coefficient, e >1, and e and Min are positive integers, and Min represents the number of channels of the feature map,

the first limited linear unit layer submodule is used for performing limited linear processing on the output data of the first volume batch processing layer submodule to obtain the first feature mapping;

wherein z is output data of the first convolution batch processing layer sub-module, w is a weight parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, b is a bias parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, x is the feature mapping of the image data, m is a preset mean parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, δ is a preset standard deviation parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, s is a preset scale parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping, and t is a preset offset parameter of the first convolution batch processing layer sub-module corresponding to the feature mapping.

6. The convolution device of claim 5, wherein the depth separation convolution module comprises:

a deep separation convolution batch layer submodule for performing N × N deep separation convolution and batch normalization on the data input to the deep separation convolution module by adopting the following formula, wherein N > M and N is a positive integer,

the second limited linear unit layer submodule is used for performing limited linear processing on the output data of the deep separation convolution batch processing layer submodule to obtain second feature mapping;

wherein z is₁For the second feature mapping, w₁Depth score determined for mapping based on the first featureWeight parameter, x, of deconvolution batch layer submodule₁For the first feature mapping, b₁Bias parameters, m, for deeply separated convolutional layer sub-modules determined based on the first feature map₁A predetermined mean parameter, δ, for a deeply separated convolutional layer sub-module determined based on the first feature mapping₁A predetermined standard deviation parameter, s, for a deeply separated convolutional layer sub-module determined based on the first feature mapping₁A predetermined scale parameter, t, for a deeply separated convolutional layer sub-module determined based on the first feature mapping₁And presetting an offset parameter of the deep separation convolution batch layer submodule determined based on the first feature mapping.

7. The convolution device of claim 6, wherein the channel compression module comprises a second convolution batch layer sub-module for determining (e-Min) as the number of input channels of the second convolution batch layer sub-module, performing Mout times M × M convolution on the data input to the channel compression module and performing batch normalization by using the following formula, wherein Mout is a positive integer and represents the number of output channels of the channel compression module,

wherein z is₂For the third feature mapping, w₂A weight parameter, x, for a second convolution batch layer sub-module determined based on the second feature map₂For the second feature mapping, b₂Bias parameters, m, for a second convolution batch layer sub-module determined based on the second feature map₂A predetermined mean parameter, δ, of a second convolution batch layer sub-module determined based on the second feature map₂A predetermined standard deviation parameter, s, for a second convolution batch layer sub-module determined based on the second feature map₂A predetermined scale parameter, t, for a second convolution batch layer sub-module determined based on the second feature map₂A preset offset parameter for a second convolution batch layer sub-module determined based on the second feature map.

8. The convolution device of claim 3 or 4 or 6 or 7 wherein M-1 and N-3.

9. The convolution device of claim 1, further comprising:

a residual module, configured to calculate a sum of each data element of the feature map and each data element of the output data when the number of channels of the feature map input to the channel expansion module is equal to the number of channels of the output data of the channel compression module.

10. The convolution device according to any one of claims 1 to 7 and 9, further comprising:

and the point-by-point convolution module is suitable for performing point-by-point convolution on the data input to the point-by-point convolution module.

11. A CNN network device, comprising an input layer module, a first convolutional layer module connected to the input layer module, and further comprising:

convolution means for performing a convolution operation on a feature map of image data output by the first convolution layer module, the convolution means being the convolution means according to any one of claims 1 to 10.

12. The CNN network device of claim 11, further comprising:

and the second convolution layer module is used for receiving the third feature mapping output by the convolution device and performing point-by-point convolution on the third feature mapping.

13. The CNN network device of claim 12, further comprising:

a third convolutional layer module connected to the second convolutional layer module, the third convolutional layer module comprising a plurality of cascaded third convolutional layer submodules, each third convolutional layer submodule being configured to perform an N × N convolution or an M × M convolution with a sliding step size P, P being a positive integer greater than 1, M, N being a positive integer.

14. The CNN network device of claim 13, further comprising:

and the feature layer extracting module comprises a plurality of cascaded feature layer extracting submodules, and each feature layer extracting submodule is used for receiving the convolution results output by the second convolution layer module and each third convolution layer submodule and carrying out N × N convolution on each convolution result so as to extract the feature information of the image data.

15. An object detection apparatus for an image, comprising:

a feature extraction module adapted to extract feature information of image data based on the CNN network device of any one of claims 11 to 14;

the prediction module is suitable for predicting a preset anchor point window based on the characteristic information to obtain a prediction result;

and the suppression module is suitable for carrying out non-extreme suppression processing on the prediction result to obtain each detection target.

16. A method for extracting features of an image, comprising:

performing convolution operation on the feature mapping of the image data, and expanding the channel number of the feature mapping obtained by convolution to obtain a first feature mapping;

performing depth separation convolution on the first feature mapping to obtain a second feature mapping;

and performing convolution operation on the second feature mapping, and compressing the channel number of the data subjected to the convolution operation to obtain a third feature mapping, so that the channel number of the third feature mapping is smaller than the channel number of the first feature mapping.

17. The feature extraction method of claim 16, wherein performing a convolution operation on the feature map of the image data and expanding the number of channels of the feature map obtained by the convolution to obtain the first feature map comprises:

determining (e.Min) as the channel number of the first feature mapping, wherein e represents a preset expansion coefficient, e >1, and e and Min are positive integers, and Min represents the channel number of the feature mapping;

performing (e.Min) M × M convolution on the feature map to obtain a first convolution result, wherein M is a positive integer;

carrying out batch processing normalization on the first convolution result to obtain a first normalization result;

and performing limited linear processing on the first normalization result to obtain the first feature mapping.

18. The feature extraction method of claim 17, wherein the depth separation convolution of the first feature map to obtain a second feature map comprises:

performing N × N depth separation convolution on the first feature map to obtain a second convolution result, wherein N > M and N is a positive integer;

carrying out batch processing normalization on the second convolution result to obtain a second normalization result;

and performing limited linear processing on the second normalization result to obtain the second feature mapping.

19. The feature extraction method according to claim 18, wherein performing a convolution operation on the second feature map and compressing the number of output channels of the data after the convolution operation comprises:

determining Mout as the number of channels of the third feature mapping, wherein Mout is a positive integer;

conducting Mout times of M × M convolution on the second feature mapping to obtain a third convolution result;

and carrying out batch processing normalization on the third convolution result to obtain the third feature mapping.

20. The feature extraction method of claim 16, wherein performing a convolution operation on the feature map of the image data and expanding the number of channels of the feature map obtained by the convolution to obtain the first feature map comprises:

performing (e.Min) times of M × M convolution on the feature map by adopting the following formula and performing batch normalization, wherein M is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the first feature mapping;

wherein z is the first feature mapping, w is a weight parameter determined by the feature mapping, b is a bias parameter corresponding to the feature data, x is the feature mapping of the image data, m is a preset mean parameter, δ is a preset standard deviation parameter, s is a preset scale parameter, and t is a preset offset parameter.

21. The feature extraction method of claim 20, wherein the depth separation convolution of the first feature map to obtain a second feature map comprises:

performing N × N deep separation convolution on the first feature mapping by adopting the following formula and performing batch normalization, wherein N is greater than M and is a positive integer;

performing limited linear processing on the output data after batch processing normalization to obtain the second feature mapping;

wherein z is₁For the second feature mapping, w₁Mapping corresponding weight parameters, x, based on the first feature₁For the first feature mapping, b₁Mapping corresponding bias parameters, m, based on the first characteristics₁For the preset mean parameter, δ₁To preset standard deviation parameters, s₁To preset scale parameters, t₁Is a preset offset parameter.

22. The feature extraction method of claim 21, wherein performing a convolution operation on the second feature map and compressing the number of output channels of the convolved data comprises:

determining Mout as the number of channels of the third feature mapping, wherein Mout is a positive integer and represents the number of output channels of the channel compression module;

mout M × M convolutions and batch normalization of the second feature map were performed using the following formula,

23. The feature extraction method according to claim 18, 19, 21 or 22, wherein M is 1 and N is 3.

24. The feature extraction method according to claim 16, further comprising:

when the number of channels of the feature map is equal to the number of channels of the third feature map, calculating a sum of each data element of the feature map and each data element of the third feature map to obtain a fourth feature map.

25. The feature extraction method according to claim 24, further comprising:

and performing point-by-point convolution on the fourth feature map to obtain a fifth feature map.

26. The feature extraction method according to any one of claims 16 to 22, further comprising:

and performing point-by-point convolution on the third feature map to obtain a sixth feature map.

27. An object detection method for an image, comprising:

extracting feature information of the image data based on the feature extraction method of the image according to any one of claims 16 to 26;

predicting a preset anchor point window based on the characteristic information to obtain a prediction result;

and carrying out non-extreme value suppression processing on the prediction result to obtain each detection target.

28. A terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the method of any one of claims 16 to 26 or claim 27.