CN110751214A

CN110751214A - Target detection method and system based on lightweight deformable convolution

Info

Publication number: CN110751214A
Application number: CN201911001669.1A
Authority: CN
Inventors: 张明鑫; 张伟
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-10-21
Filing date: 2019-10-21
Publication date: 2020-02-04

Abstract

The invention discloses a target detection method and a system based on lightweight deformable convolution, which comprises the following steps: constructing a depth separable feature extraction network as a source network; constructing a target network by using the same algorithm structure as the source network without performing deformable convolution layer replacement; performing distance loss approximation on the output of the source network and the target network by simulating the characteristics extracted by the multilayer network layer of the source network; before the classification layer, the last layer of feature extraction is used for enabling the source network to approach the output of the target network layer; after the mutual learning framework of the source network and the target network is jointly trained, a new feature extraction network model is obtained; and (4) taking the new feature extraction network model as a feature extractor to extract the features of the image data, thereby completing the target detection. According to the invention, through mutual learning of the target network and the source network, the identification precision is improved, the data calculation amount is reduced, and the burden on hardware equipment is reduced.

Description

Target detection method and system based on lightweight deformable convolution

Technical Field

The invention relates to the technical field of target detection, in particular to a target detection method and a target detection system based on lightweight deformable convolution.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Target detection is a very active research direction in the fields of computer vision, pattern recognition and machine learning. Target detection is widely used in many fields. In short, given a picture, the question to be answered by target recognition is whether the picture contains a certain object, and the question to be answered by target detection is where the object appears in the picture, that is, a circumscribed rectangle of the object needs to be given. Target detection is a fundamental problem in visual research and also a very challenging problem. The difficulty and the challenge of target detection can be divided into three levels according to different task types: instance level, category level, semantic level. The deep learning is concerned by a plurality of experts in the industry in the target detection task due to the strong performance of the deep learning, but the performance of the deep learning extremely depends on a large amount of data operation, the hardware has higher requirements, the deep learning has great limitation in actual deployment and application, the number of model parameters and the data operation amount are reduced under the condition of realizing sufficient precision, the lightweight network technology is concerned widely by the industry and the academia, and some initiatives for the target detection task based on the deep learning exist at present.

However, according to the knowledge of the inventor, the following problems are often faced in the implementation process of the target detection task based on deep learning: in a target classification task, generally, due to different illumination conditions, shooting visual angles and distances in an image acquisition process, the non-rigid body deformation of an object and partial shielding of other objects cause the appearance characteristics of an object example to generate great changes, and great difficulty is brought to a visual identification algorithm; the performance of the algorithm needs to be enhanced, the increase of the operation amount is inevitably brought, the computing capacity and the storage capacity are restricted when the algorithm is deployed in a mobile device with limited performance, the cost is greatly increased if a high-performance operation device is used, and new requirements are provided for system units such as energy supply and the like.

Disclosure of Invention

In order to solve the problems, the invention provides a target detection method and a system based on lightweight deformable convolution.

In some embodiments, the following technical scheme is adopted:

a target detection method based on lightweight deformable convolution comprises the following steps:

constructing a depth separable feature extraction network as a source network;

constructing a target network by using the same algorithm structure as the source network without performing deformable convolution layer replacement;

performing distance loss approximation on the output of the source network and the target network by simulating the characteristics extracted by the multilayer network layer of the source network;

before the classification layer, the last layer of feature extraction is used for enabling the source network to approach the output of the target network layer;

after the mutual learning framework of the source network and the target network is jointly trained, a new feature extraction network model is obtained;

and (4) taking the new feature extraction network model as a feature extractor to extract the features of the image data, thereby completing the target detection.

In some embodiments, the following technical scheme is adopted:

a lightweight deformable convolution-based target detection system comprising:

means for constructing a depth separable feature extraction network as a source network;

means for constructing a target network using the same algorithmic structure as the source network, but without deformable convolutional layer replacement;

means for approximating the distance loss between the output of the source network and the target network by simulating the features extracted by the multi-layer network layer of the source network;

means for approximating the source network to the output of the target network layer at the last layer of feature extraction prior to the classification layer;

the device is used for obtaining a new feature extraction network model after the mutual learning framework joint training of the source network and the target network;

and the device is used for extracting the characteristics of the image data by taking the new characteristic extraction network model as a characteristic extractor to complete target detection.

In some embodiments, the following technical scheme is adopted:

a terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; the computer readable storage medium stores a plurality of instructions adapted to be loaded by a processor and to perform the above-described method of object detection based on lightweight deformable convolution.

In some embodiments, the following technical scheme is adopted:

a computer-readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute the above-mentioned object detection method based on lightweight deformable convolution.

Compared with the prior art, the invention has the beneficial effects that:

1. the method of the invention is characterized in that convolution kernel related parameters are internally transformed, and the transformation is completely obtained by learning from data without manually designing features; the method is light and easy to expand to other network structures, and can directly realize end-to-end training and prediction.

2. According to the invention, through mutual learning of the target network and the source network, the identification precision is improved, the data calculation amount is reduced, and the burden on hardware equipment is reduced.

Drawings

FIG. 1 is a diagram illustrating a lightweight deformable convolutional network according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a general convolution output according to a first embodiment of the present invention;

fig. 3 is a schematic diagram of k general 1 × 1 convolutions of the input of N × H × W × C according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a feature extraction network Mobilenetv1 according to a first embodiment of the present invention;

FIG. 5 is a diagram illustrating a deformable convolution according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating variable RoI pooling in accordance with one embodiment of the present invention;

fig. 7 is a schematic diagram of a target detection method based on lightweight deformable convolution according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

In one or more embodiments, a new target detection method constructed by a deep learning framework based on feature learning is disclosed, which comprises the following steps:

(1) two deformable convolution (deformable convolution) and deformable pooling (deformable ROI posing) are used for replacing the existing algorithms, such as single or multiple 3x3 convolution network layers and pooling layers in traditional feature extraction networks of VGG, GoogleNet, ResNet, DenseNet, ResNeXt and the like and lightweight feature neural networks of MobileNet, ShuffleNet and the like, and a feature extraction network with separable depth is constructed to serve as a source network.

In this embodiment, the feature extraction network used is Mobilenetv1, which is a basic network of the source network and a target network. The specific network structure is as follows:

the core of Mobilene v1 is the splitting of the convolution into Depthwise + Pointwise two parts.

To explain mobilene v1, assume that there is an input of N × H × W × C, N representing the number of features, H, W representing the height and width of the features, C representing the number of channels of the features; with k 3x3 convolutions. If let pad be 1 and stride be 1, then the normal convolution output is N × H × W × k, as shown in fig. 2.

Depthwise refers to k common 1 x 1 convolutions of the input of N x H x W x C as shown in fig. 3. This corresponds to collecting the feature of each point, i.e., the poitwise feature. The final output of Depthwise + Pointwise is also N H W k.

Thus, a normal convolution is split into two parts, Depthwise + Pointwise. In fact, the conversion of Mobilenetv1 is as follows, as shown in FIG. 4.

The number of multiplications for different convolutions is compared:

the common convolution calculation is: h W C k 3;

the Depth-wise calculated quantity is: h W C3;

the Pointwise calculated amount is: h W C k.

It can be obtained that, through the splitting of Depthwise + Pointwise, the calculated amount of the common convolution is compressed as follows:

of course, in addition to optimizing for convolution, the basic network structure is given by mobilene v1, as shown in table 1. Table 1MobileNet body, table 4 is a MobileNet body after the transformation was replaced with a deformable convolution.

TABLE 1MobileNet v1 Main body Structure

At this point, Mobilenet v1 has been very small, but it is also possible to multiply the number of convolution kernels (filters) in all convolutional layers in the structure uniformly by a reduction factor α (i.e., a width factor) (where α e (0, 1)) to compress the network.

Thus, the total number of depth separable convolution computations can be further reduced to H W α C3 + H W C α k. of course, the number of compressed network computations must be costly, and Table 2 below demonstrates the performance of α different Mobilenet v1 on a 224 × 224 resolution image network (ImageNet) Large Scale visual recognition challenge race (ILSVRC) data set.

TABLE 2 MobileNet v1 comparison of multiple Width factor results

Table 3 shows that the accuracy gap between Mobilene v1 α ═ 1.0 and GoogleNet and VGG16 is very small at the input resolution 224 × 224, but the computational and parametric quantities are much smaller.

Table 3 MobileNet v1 compares GoogleNet with VGG16 networks

Table 4 structure diagram of deformable convolution MobileNet main body

The method of the embodiment is to internally transform the relevant parameters of the convolution kernel, and meanwhile, the transformation is completely obtained by learning from data without manually designing features. The method provided by the invention is light and convenient, is easy to expand to other network structures, and can directly realize end-to-end training and prediction.

The source network has a stronger feature extraction capability, but increases training difficulty and increases data computation amount, and is not favorable for being deployed in equipment with limited performance.

The introduction of two new modules in this embodiment greatly improves the modeling capability of CNN for geometric transformation:

1. deformable convolution.

This adds a 2D offset to the regular sampling network location of the standard convolution, which allows the sampling grid to be freely deformed. The conventional convolution window only needs to train the pixel weight of each convolution window, and the Deformable convolution network must add some parameters to train the shape of the convolution window (offset vector offset of each pixel), as shown in fig. 5, the function of the Deformable convolution (Deformable Conv) is to learn the offset (offset) for the input feature map (feaure map) through one convolution layer, and then obtain the output feature map through interpolation (spine reset). The offset can be learned from the previous feature by adding convolutional layers so that the deformation adapts the input features (features) in a local, dense and adaptive way.

2. Deformable RoI pooling (variable RoI pooling)

Referring to fig. 6, an offset is added to each conventional bin of the previous RoI pooling. Similarly, the offset is also learned from previous features and RoI, enabling adaptive localization of objects of different shapes.

These two modules replace corresponding parts in existing CNNs, which increases the amount of computation and parameters and is difficult to train.

Secondly, to meet the market of deep learning deployment with increasing demand, how to make the network structure smaller is a very important and challenging task.

The existing deep learning network compression model method is roughly divided into two groups. One is to provide a new convolution calculation method to achieve the effects of reducing parameters and compressing models, such as SqueezeNet and MobileNet; the other type is to cut on a trained model, and mainly comprises methods such as pruning, weight sharing, quantization, neural network binarization and the like. These are good attempts. The core of this example is put on "mimicry learning," which means that biologically an animal will acquire features similar to successful species during evolution, and successfully confuse the predator's cognition, approaching the mimicry species.

In the field of deep learning, model distillation is a very common means for improving the detection capability of small network targets, and the essential idea is to integrate a very large and powerful network into a relatively small network, because the small network better meets the requirements of low storage and high efficiency, and the detection capability of the small network is not as good as that of the large network, which is basically due to the difficulty in training model parameters.

(2) Through mutual learning of the target network and the source network, the identification precision is improved, the data calculation amount is reduced, and the burden on hardware equipment is reduced.

The target network is an algorithm structure which uses the same source network, but deformable convolution replacement is not used, distance loss approximation is carried out on the output of the source network and the target network by simulating the characteristics extracted by the multilayer network layers of the source network, the embodiment of the invention uses a kl divergence distance loss function and a sine distance loss target function, so that the multilayer outputs of the two networks are approximated layer by layer, a neural network model with performance approximating to the target network but lighter in weight is extracted, and referring to fig. 1, the last layer of characteristic extraction is carried out before the classification layer, so that the source network approximates to the output of the target network layer, and the training of the source network can be facilitated because the data operation and parameter quantity of the target network are less; thereby obtaining the lightweight deformable convolution network proposed by the embodiment.

(3) After mutual learning framework joint training of the source network and the target network, a high-performance and more light-weight feature extraction network model can be obtained, and the target detection method based on the light-weight deformable convolution provided by the disclosure is formed.

The target detection needs to acquire scene information containing a target, the application refers to image information in particular, the numerical values in a digital matrix with color codes of 0-255 of an RGB image need to be divided by 255, the distribution of the numerical values is changed into 0-1, meanwhile, the size of the image needs to be changed into a fixed size, and the data operation amount is reduced.

The processed image information is multidimensional data, and single-dimensional feature data is output after passing through a feature extraction network. And (4) taking the new feature extraction network model as a feature extractor to extract the features of the image data, thereby completing the target detection.

In the present embodiment, a deep mutual learning strategy is used. In the training process, the two networks can learn in cooperation with each other, but the learning from the large network is unidirectional.

Experiments show that various network architectures can benefit from mutual learning strategies, attractive results are obtained on two data sets, namely a CIFAR-100 data set and a pedestrian re-identification data set Market-1501 data set, mutual learning does not need a powerful 'teacher' network in the past, only a group of student networks are needed for effective training, and the results are stronger than the networks guided by the 'teacher', but the existing deep mutual learning framework only makes distance loss on the last layer of network output features and is difficult to enable the networks to be in controllable mutual fitting, so that the embodiment improves on the basis of the existing framework, and multi-layer feature one-way fitting is added on the basis of a prediction layer.

The model design is shown in fig. 7. While the individual network is calculated according to the original softmax loss, a loss function (loss) between KL divergence and other network predicted values is added, namely:

L1＝LCE1+DKL(p2||p1)

L2＝LCE2+DKL(p1||p2)

wherein, KL divergence is calculated:

p1, p2 are the outputs of network 1 and network 2, respectively; d_KLIs KL divergence distance similarity calculation formula; LCE1, LCE2 are image recognition loss functions of network 1 and network 2, respectively, i.e. distance functions between training output results and real results, L1, L2 refer to total loss functions of model1 and model2, respectively.

The function implementation flow shown in fig. 7 is as follows:

for an image with an arbitrary size of PxQ, the network firstly scales to a fixed size MxN, and then the MxN image is sent to a feature extraction network; then, the data are sent to an identification network, the method uses a faster rcnn network to carry out convolution by 3x3, then coordinate offset of an identification frame is generated respectively, and then a candidate area is calculated; and the Roi pooling layer extracts the candidate region features from the feature map extracted from the feature network by using the candidate region and sends the candidate region features to a subsequent full-connection and classification network for classification tasks.

Training:

step 1: initializing two different networks or initializing two identical networks in different initialization manners

Step 2: forward model1 gets p1 while forward model2 gets p 2;

and step 3: calculating the cross entropy loss of model1, and calculating KL divergence by taking p1 as a variable;

and 4, step 4: model1 is updated with the loss and back propagation header of the third step;

and 5: forward model2, yielding p2, while Forward model1 yields p 1;

step 6: calculating the cross entropy loss of model2, and calculating KL divergence by taking p2 as a variable;

and 7: updating model2 with the loss and back propagation header of the sixth step;

and 8: and (5) repeating the steps 2-7 until the model is converged, namely the output result tends to be stable and the precision does not rise any more.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A target detection method based on lightweight deformable convolution is characterized by comprising the following steps:

constructing a depth separable feature extraction network as a source network;

2. The method for detecting the target based on the lightweight deformable convolution as claimed in claim 1, wherein a depth separable feature extraction network is constructed as a source network, specifically: replacing the original convolution network layer and the original pooling layer in the feature extraction network by using deformable convolution and deformable pooling; the obtained new feature extraction network is used as a source network.

3. The method as claimed in claim 2, wherein the deformable convolution is to add a 2D offset to the sampling network position of the standard convolution, so that the sampling grid can be deformed freely.

4. The method as claimed in claim 2, wherein the deformable pooling is to add an offset to each regular box of the original region of interest pooling; the offset is obtained by learning from the previous characteristics and the interest area, and the adaptive positioning can be carried out on the targets with different shapes.

5. The method for detecting the target based on the lightweight deformable convolution as claimed in claim 1, wherein the distance loss approximation is performed on the output of the source network and the target network by simulating the characteristics extracted from the multi-layer network layer of the source network, specifically:

and (3) using a kl divergence distance loss function and a sine distance loss target function to enable the multilayer outputs of the two networks to approach layer by layer, and extracting a neural network model with performance approaching to the target algorithm and lighter weight.

6. A lightweight deformable convolution-based target detection system, comprising:

7. A terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; a computer-readable storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method for object detection based on lightweight deformable convolution of any of claims 1-5.

8. A computer-readable storage medium having stored therein a plurality of instructions, wherein the instructions are adapted to be loaded by a processor of a terminal device and to perform the method for object detection based on lightweight deformable convolution of any of claims 1-5.