CN110796239A - Deep learning target detection method based on channel and space fusion perception - Google Patents

Deep learning target detection method based on channel and space fusion perception Download PDF

Info

Publication number
CN110796239A
CN110796239A CN201911048207.5A CN201911048207A CN110796239A CN 110796239 A CN110796239 A CN 110796239A CN 201911048207 A CN201911048207 A CN 201911048207A CN 110796239 A CN110796239 A CN 110796239A
Authority
CN
China
Prior art keywords
channel
perception
space
neural network
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911048207.5A
Other languages
Chinese (zh)
Inventor
吴林煌
杨绣郡
范振嘉
陈志峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201911048207.5A priority Critical patent/CN110796239A/en
Publication of CN110796239A publication Critical patent/CN110796239A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a deep learning target detection method of channel and space fusion perception, which comprises the steps of firstly constructing a channel and space fusion perception module, embedding the channel and space fusion perception module into a deep neural network architecture, and then carrying out target detection on a target picture by utilizing the improved deep neural network architecture; the construction of the channel and space fusion perception module specifically comprises the following steps: firstly, channel perception is carried out on the feature diagram which is input originally, and then spatial perception cascade is carried out. The method does not deepen the depth or width of the network, does not introduce additional space vectors, and simultaneously ensures the instantaneity and the precision.

Description

Deep learning target detection method based on channel and space fusion perception
Technical Field
The invention relates to the technical field of image recognition, in particular to a deep learning target detection method based on channel and space fusion perception.
Background
Currently, target detection frameworks based on deep learning are mainly divided into two categories: a two-stage detector and a single-stage detector; two-stage target detection is named after two-stage processing of a picture, and is also called a region-based method, which abstracts detection into two processes, wherein firstly, a plurality of regions possibly containing objects are selected and proposed based on the picture by using randomness, namely, local cutting of the picture is called a candidate region; and secondly, after the feature vectors of the generated regions are coded by the deep convolutional neural network, the feature vectors are used for predicting each category of the candidate regions, so that the category of the object in each region is obtained. The two-stage detector algorithm is based on a neural network with high operation cost, and the speed is used for improving the precision, so that the single-stage target detection algorithm is generated at the same time; the single-stage detector does not have an intermediate region detection process, but directly obtains a prediction result from a picture; typical examples include YOLO, SSD, etc., the network architecture of these direct regression algorithms is relatively simpler, but compared with two-stage target detection, such as Mask-RCNN, the speed of them is increased by 8 times, and the accuracy is also reduced by about 12%;
over the past few years, the improvement in detector performance has largely relied on increasing the depth or width of the network: compared with AlexNet, VGG-16 increases the network depth by stacking convolution layers to improve the expressive power of the model; ResNet effectively trains the network through residual blocks, with model depths continuing to increase (e.g., from 16 layers to 152 layers), enabling training of high-capacity models to improve performance; GoogleNet uses an inference module to apply different proportions of convolution kernels on the same feature map to increase the model width to improve learning. Although the performance can be improved by simply pursuing a deeper and wider network structure, the more complex the network is, the higher the calculation cost is, and the lower the inference speed is. Such as DSSD and RetinaNet, which can compete with top two-stage network performance, the improvement in performance comes from the extremely deep ResNet-101 network that limits efficiency.
In addition to relying solely on network depth, many methods for improving network performance by designing functional modules for improving feature characterization capabilities appear at present: the FPN combines deep features with shallow features, and strengthens the shallow features with strong space through deeper rich semantic information; on the basis of SSD, when deeper basic network ResNet-101 and deconvolution layer characteristics are fused, the DSSD provides better characterization capability for a shallow characteristic diagram through jump connection, but the speed is obviously reduced while the performance is remarkably improved; the methods do not deepen the model to enhance the feature representation of the network, but transversely enhance the learning of deep features in the convolutional neural network by directly performing operations such as superposition, sampling, connection and the like on the feature map so as to improve the performance. However, these operations are all reprocessing of the whole feature map rather than the inside thereof, and all the operations are to perform feature fusion by introducing an additional space vector, and do not particularly emphasize the importance degree between the channels or spaces inside the feature map based on the inside of the feature map.
Disclosure of Invention
In view of this, the present invention provides a method for detecting a deep learning target by channel and space fusion sensing, which does not deepen the depth or width of a network, does not introduce additional space vectors, and simultaneously ensures real-time performance and precision.
The invention is realized by adopting the following scheme: a deep learning target detection method based on channel and space fusion perception specifically comprises the following steps:
constructing a channel and space fusion sensing module, embedding the channel and space fusion sensing module into a deep neural network architecture, and carrying out target detection on a target picture by using the improved deep neural network architecture;
the channel and space fusion perception module is specifically constructed as follows: firstly, channel perception is carried out on the feature diagram which is input originally, and then spatial perception cascade is carried out.
Further, the channel sensing of the feature map of the original input specifically includes the following steps:
step S11: feature map F to be inputH×W×CBased on channel slicing, dividing into C slice zones Z ═ Z1,z2,...,zCTherein of
Figure BDA0002254648920000031
Represents the ith patch; h, W, C represents the height, width and channel number of the characteristic diagram;
step S12: carrying out global average pooling operation on the C slice areas to obtain a vector U1={u1,u2,...,uC},
Figure BDA0002254648920000032
The kth element ukThe calculation formula of (2) is as follows:
Figure BDA0002254648920000033
in the formula, zk(i, j) is the patch zkPixel values corresponding to the upper coordinates (i, j);
step S13: will U1Through a full connection layer operation and ReLu activation, the method is obtained
Figure BDA0002254648920000034
Wherein W1Refers to the weight coefficients of the full convolution of the layer,
Figure BDA0002254648920000035
r refers to the scaling factor, δ (-) refers to the ReLu activation operation, resulting in a dimension of
Figure BDA0002254648920000036
Step S14: will be provided with
Figure BDA0002254648920000037
Through a full connection layer operation and activated by using a sigmoid function, obtaining
Figure BDA0002254648920000038
Wherein W2Refers to the weight coefficients of the full convolution of the layer,
Figure BDA0002254648920000039
σ (-) refers to sigmoid activation operation, with the resulting dimension being
Figure BDA00022546489200000310
Step S15: will U2The value of the channel is multiplied by the channel slice division vector Z of the input original feature map F correspondingly to obtain the feature map F obtained by channel perception1The formula is as follows:
F1=U2·Z;
where, denotes the kth vector Z in ZkAll values of (2) are multiplied by U2K value of (U)2k
Further, the cascade for performing spatial perception specifically includes the following steps:
step S21: feature map F obtained by channel perception1Based on spatial slicing, the slices are divided into H multiplied by W slices, namely Z '═ Z'1,z'2,z'3,...,z'H×MTherein of
Figure BDA0002254648920000041
So vector
Figure BDA0002254648920000042
Wherein K ═ hxw;
step S22: all vectors Z' obtained in step S21 are weighted by W3The convolution layer of (1) also obtains a spatial perception weight while compressing the channel, and the calculation formula is U3=W3Z' wherein
Figure BDA0002254648920000043
Symbol denotes each
Figure BDA0002254648920000044
And
Figure BDA0002254648920000045
linear combination to obtain a value
Figure BDA0002254648920000046
Then mapping the probability to [0,1] by using sigmoid activation function]In the interval, the formula is as follows:
U3=σ(W3*Z′);
in the formula, W3Sigma (-) is sigmoid activated function expression as parameter weight of convolution layer;
step S23: will U3Feature map F obtained by the above numerical value and channel perception1The values on the space slice vector Z' are correspondingly multiplied to obtain a feature map F after space perception cascade2The calculation formula is as follows:
F2=U3⊙Z;
wherein ⊙ represents the kth vector Z 'in Z'kAll values of
Figure BDA0002254648920000047
All ride on U3K value of
Figure BDA0002254648920000048
Further, the method of the embodiment specifically includes the following steps:
step S1: acquiring an image training data set and a test data set and respective label files thereof from an MSCOCO or PASCAL VOC official website;
step S2: scaling the images of the training data set to the same size, and then inputting the images into a deep neural network;
step S3: constructing a deep neural network architecture;
step S4: constructing a channel and space fusion sensing module, and embedding the channel and space fusion sensing module into a built deep neural network architecture;
step S4 is as described above, and includes the following two steps:
step S41: the input channel and the spatial fusion perception module are a layer with the size of FH×W×CH, W, C are respectively the height, width and channel number of the characteristic diagram. The characteristic diagram is firstly based on channel slices through channel perception, and channel perception is carried out through operations of global average pooling, full connection (FC for short), activation function and the like, so that the weight distribution of each channel is continuously learned and updated, and the method is combined with the methodMultiplying the initial characteristic diagram F to obtain the size F1 H×W×CA characteristic diagram of (1);
step S42: f obtained in step S411The feature graph is based on space slices and carries out space perception through operations such as convolution, activation functions and the like, so that the weight distribution of each space is continuously learned and updated, and the feature graph F is obtained through channel perception1Multiplying to obtain the final size F2 H×W×CThe characteristic diagram of (1).
Step S5: training the modified deep neural network, and storing each weighted value of the neural network;
step S6: and (5) inputting the image of the test data set downloaded in the step (S1) into the trained deep neural network of the embedded channel and spatial fusion perception module, and outputting a detection result. And the target detection effect is evaluated.
Further, in step S5, the training uses the average accuracy mAP as an evaluation index of the target detection, and the calculation is as follows:
Figure BDA0002254648920000051
wherein R is recall rate and P is accuracy rate;
when the recall rate and the accuracy rate are calculated, α is taken as the coincidence rate between the predicted bounding box and the labeled real bounding box, meanwhile, the predicted box with α being more than or equal to 0.5 is taken as a positive example, the predicted box with α being less than 0.5 is taken as a negative example, wherein the calculation of α is as follows:
Figure BDA0002254648920000061
in the formula, BoxpreFor predicted bounding boxes, BoxgtTo label the real bounding box, ∩ is the intersection area between the two, and ∪ is the combined area of the two.
Compared with the prior art, the invention has the following beneficial effects: the method of the invention does not deepen the depth or width of the network, does not introduce extra space vectors, ensures the real-time performance and the precision at the same time, provides a new idea for the field of target detection, has strong portability, can be used for embedding various deep learning networks, and can be widely applied to the field needing target identification.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Fig. 2 is a diagram illustrating the effect of partial data set downloaded from the MSCOCO official website according to the embodiment of the present invention.
Fig. 3 is a diagram illustrating the effect of the tag file downloaded from the MSCOCO official website according to the embodiment of the present invention.
Fig. 4 is a block diagram of a deep learning network architecture of ResNet-18 according to an embodiment of the present invention.
Fig. 5 is a block diagram of a residual connection structure according to an embodiment of the present invention.
Fig. 6 is an architecture block diagram of embedding a channel and space fusion aware module into a deep neural network architecture according to an embodiment of the present invention.
Fig. 7 is a block diagram of a channel and spatial perception fusion module architecture according to an embodiment of the present invention. Wherein, (a) is a channel perception module block diagram, and (b) is a space perception module block diagram.
Fig. 8 is a block diagram of an inference phase structure according to an embodiment of the present invention.
Fig. 9 is a graph of the effect of uploading to CodeLab official website evaluation according to an embodiment of the present invention.
FIG. 10 is a diagram showing the results of CodeLab official website evaluation according to the embodiment of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present embodiment provides a deep learning target detection method based on channel and space fusion sensing, which specifically includes:
constructing a channel and space fusion sensing module, embedding the channel and space fusion sensing module into a deep neural network architecture, and carrying out target detection on a target picture by using the improved deep neural network architecture;
the channel and space fusion perception module is specifically constructed as follows: firstly, channel perception is carried out on the feature diagram which is input originally, and then spatial perception cascade is carried out.
In this embodiment, the channel sensing of the originally input feature map specifically includes the following steps:
step S11: feature map F to be inputH×W×CBased on channel slicing, dividing into C slice zones Z ═ Z1,z2,...,zCTherein of
Figure BDA0002254648920000081
Represents the ith patch; h, W, C represents the height, width and channel number of the characteristic diagram;
step S12: carrying out global average pooling operation on the C slice areas to obtain a vector U1={u1,u2,...,uC},
Figure BDA0002254648920000082
The kth element ukThe calculation formula of (2) is as follows:
Figure BDA0002254648920000083
in the formula, zk(i, j) is the patch zkPixel values corresponding to the upper coordinates (i, j);
step S13: will U1By means of a full-link layer operation,then performing ReLu activation to obtain
Figure BDA0002254648920000084
Wherein W1Refers to the weight coefficients of the full convolution of the layer,
Figure BDA0002254648920000085
r refers to the scaling factor, δ (-) refers to the ReLu activation operation, resulting in a dimension of
Figure BDA0002254648920000086
Step S14: will be provided with
Figure BDA0002254648920000087
Through a full connection layer operation and activated by using a sigmoid function, obtaining
Figure BDA0002254648920000088
Wherein W2Refers to the weight coefficients of the full convolution of the layer,
Figure BDA0002254648920000089
σ (-) refers to sigmoid activation operation, with the resulting dimension being
Step S15: will U2The value of the channel is multiplied by the channel slice division vector Z of the input original feature map F correspondingly to obtain the feature map F obtained by channel perception1The formula is as follows:
F1=U2·Z;
where, denotes the kth vector Z in ZkAll values of (2) are multiplied by U2K value of (U)2k
In this embodiment, the cascade for performing spatial sensing specifically includes the following steps:
step S21: feature map F obtained by channel perception1Based on spatial slicing, the slices are divided into H multiplied by W slices, namely Z '═ Z'1,z'2,z'3,...,z'H×MTherein of
Figure BDA0002254648920000091
So vector
Figure BDA0002254648920000092
Wherein K ═ hxw;
step S22: all vectors Z' obtained in step S21 are weighted by W3The convolution layer of (1) also obtains a spatial perception weight while compressing the channel, and the calculation formula is U3=W3Z' wherein
Figure BDA0002254648920000093
Symbol denotes eachAnd
Figure BDA0002254648920000095
linear combination to obtain a value
Figure BDA0002254648920000096
Then mapping the probability to [0,1] by using sigmoid activation function]In the interval, the formula is as follows:
U3=σ(W3*Z);
in the formula, W3Sigma (-) is sigmoid activated function expression as parameter weight of convolution layer;
step S23: will U3Feature map F obtained by the above numerical value and channel perception1The values on the space slice vector Z' are correspondingly multiplied to obtain a feature map F after space perception cascade2The calculation formula is as follows:
F2=U3⊙Z;
wherein ⊙ represents the kth vector Z 'in Z'kAll values of
Figure BDA0002254648920000097
All ride on U3K value of
Figure BDA0002254648920000098
In this embodiment, the method of this embodiment specifically includes the following steps:
step S1: acquiring an image training data set and a test data set and respective label files thereof from an MSCOCO or PASCAL VOC official website;
step S2: scaling the images of the training data set to the same size, and then inputting the images into a deep neural network;
step S3: constructing a deep neural network architecture;
step S4: constructing a channel and space fusion sensing module, and embedding the channel and space fusion sensing module into a built deep neural network architecture;
step S4 is as described above, and includes the following two steps:
step S41: the input channel and the spatial fusion perception module are a layer with the size of FH×W×CH, W, C are respectively the height, width and channel number of the characteristic diagram. The characteristic diagram is firstly subjected to channel perception based on channel slices, and is subjected to channel perception through operations such as global average pooling, full connection (FC for short), activation function and the like, so that the weight distribution of each channel is continuously learned and updated, and is multiplied by an initial characteristic diagram F to obtain the value F1 H×W×CA characteristic diagram of (1);
step S42: f obtained in step S411The feature graph is based on space slices and carries out space perception through operations such as convolution, activation functions and the like, so that the weight distribution of each space is continuously learned and updated, and the feature graph F is obtained through channel perception1Multiplying to obtain the final size F2 H×W×CThe characteristic diagram of (1).
Step S5: training the modified deep neural network, and storing each weighted value of the neural network;
step S6: and (5) inputting the image of the test data set downloaded in the step (S1) into the trained deep neural network of the embedded channel and spatial fusion perception module, and outputting a detection result. And the target detection effect is evaluated.
In this embodiment, in step S5, the training uses the average accuracy rate mAP as the evaluation index of the target detection, and is calculated as:
wherein R is recall rate and P is accuracy rate;
when the recall rate and the accuracy rate are calculated, α is taken as the coincidence rate between the predicted bounding box and the labeled real bounding box, meanwhile, the predicted box with α being more than or equal to 0.5 is taken as a positive example, the predicted box with α being less than 0.5 is taken as a negative example, wherein the calculation of α is as follows:
Figure BDA0002254648920000111
in the formula, BoxpreFor predicted bounding boxes, BoxgtTo label the real bounding box, ∩ is the intersection area between the two, and ∪ is the combined area of the two.
The embodiment provides a novel module named channel and space fusion perception, which is used for enhancing deep feature learning in a neural network and can be embedded into any neural network. Firstly, an original feature map is partitioned based on a channel, feature distribution of the channel is subjected to relearning definition through a full connection layer, then obtained weights are multiplied by all the channels, then feature distribution of the space is subjected to relearning definition through a convolution layer based on space partitioning, and then the obtained weights are multiplied by all the spaces to obtain a feature map after fusion perception, so that redistribution of features of the feature map on the channel and the space is realized, and important features are more important and negligible feature influence is smaller.
Particularly, the method comprises the steps of data set downloading, deep neural network construction, channel and space fusion perception module embedding (channel perception and space perception) and target detection performance evaluation;
wherein, the data set downloading is realized by downloading the data set and the corresponding label file, such as the MSCOCO and PASCAL VOC data set, from the currently mainstream data set website.
The PASCAL VOCs provide an excellent set of standardized data for image identification and classification, with 4 major categories of games held each year since 2005: vehicle, household, animal and person, subdivided into 20 subclasses. The data set is primarily concerned with classification and detection tasks. There are two commonly used versions of PASCAL VOCs: 2007 and 2012 versions. Among them, a combination of 2007train val and 2012train val is commonly used as a training set, and 16551 pictures containing 40058 objects are common, and 2007test is commonly used as a verification set, and 4952 pictures containing 12032 objects are common.
The MSCOCO data set is a large image data set developed and maintained by Microsoft, a large number of daily scene pictures containing common objects are collected, the detection tasks totally contain 12 large classes which are divided into 80 small classes, each class contains more instances, the distribution is balanced, compared with PASCAL VOC, the MSCOCO also contains more small objects, wherein train2017 is commonly used as a training set and comprises 118287 pictures; val2017 is commonly used as a verification set, including 5000 pictures; test2017 is commonly used as a test set, and comprises 20288 pictures.
The deep neural network building is based on the current deep neural network, a proper framework is selected for building, the current mainstream detector framework comprises VGG, AlexNet, ResNet, GoogleNet and the like, smaller neural networks have fewer learning parameters in the training stage, and therefore the speed in the training and reasoning stages is high.
The channel and spatial fusion perception module in the embodiment is embedded in a proper position of the deep neural network architecture. The main steps are that the module firstly carries out channel perception on the feature diagram which is originally input, and then carries out cascade of space perception.
The channel perception is that the input feature map is firstly sliced based on channels, and then global average pooling is carried out, so that global information of the feature map is extracted, and space compression is carried out; secondly, through full connection operation and ReLu activation, the purpose is to reduce the number of channels so as to reduce the calculation complexity; then, mapping the probability of the data to be between [0,1] through a full connection layer and a sigmoid function so as to recover the original dimension; and finally, multiplying the sliced vector by the vector after sigmoid operation correspondingly to obtain a channel sensing result. The two fully-connected layers can obtain features on the input feature map through end-to-end learning, and the operation can obtain perception on the channel.
The spatial perception is that a feature map obtained through channel perception is firstly based on spatial slicing, then is convoluted and is activated through a sigmoid function to normalize the spatial perception weight, and finally, the normalized vector is correspondingly multiplied with the vector obtained before and through the space, so that a result of channel and space fusion perception is obtained. The convolution layer continuously learns the parameters along with the continuous learning of the network, so that the parameters are adaptively adjusted to be ignored or important emphasized space coefficients, and through the cascade connection of channels and space perception, the weight redistribution can be realized, the useful information is more sensitive, and meanwhile, the inhibition on the useless information is more obvious.
When the target detection performance is evaluated, the test data set downloaded from the official website is input into the trained deep neural network of the embedded channel and the spatial fusion perception module for testing to obtain a test result file, and the test result file is compared and evaluated with the labeled label file provided by the official website to obtain an evaluation result.
Specifically, the following describes a specific embodiment of the present embodiment with reference to the drawings.
The step S1 specifically includes:
step S1: acquiring an image training data set and a test data set from an MSCOCO official website, and respective label files thereof; in this embodiment, train2017 is used as a training data set, and test2017 is used as a testing data set. A partial screenshot of a training data set is shown in fig. 2, and the label files of the training set and the test set are shown in fig. 3;
the step S2 specifically includes:
step S2: the image size of the training data set is scaled to 512 x 512, and then input into a deep neural network;
in this embodiment, the step S3 specifically includes the following steps:
step S3: as shown in fig. 4, a deep learning network architecture of ResNet-18 is built, and the size of the input image is 512 × 512 × 3, wherein the length and the width of the image are 512, and the number of channels is 3;
the deep neural network architecture involved in the step S3 is as follows:
an input layer: since a 512 × 512 RGB image is input, and the dimension of the RGB image in the three-dimensional space is 512 × 512 × 3, the image vector size of the input layer is [512, 3 ];
conv 1: in this embodiment, the size of the first convolutional layer is 7 × 7, the convolution depth is 64, the step size is set to 2 (2 rows above, below, left, and right of the original input image are filled with pixel 0 before convolution), which is equivalent to that the pixel of the input image under the window is convolved with 64 sliding windows of 7 × 7 by step size 2, and the obtained image vector size is [256, 64 ];
in this embodiment, the first pooling step is set to be 2, the pooling size is 3 × 3, and the pooling mode is maximum pooling, i.e. scanning the image pixels obtained by the convolution layer 1 in a sliding window with the size of 3 × 3 by step 2, storing the maximum value under the window, so that the image vector size [128, 64] obtained after pooling of the layer is obtained;
conv2_ x: as shown in fig. 5, in the residual connection portion of the resnet-18 network in this embodiment, x represents the input of the current first layer, and leads to two directions, one is directly-connected identity mapping, and two layers of convolution layers with a convolution value of 3 × 3 and a convolution depth of 64 are skipped before directly sending values to the ReLu activation function of the subsequent layer, and the other is output mapping h (x) through two layers of weights (first, the convolution layer is normalized by batchnorm (BN for short) and the ReLu activation function, and then repeatedly performed once), f (x) is equivalent to a residual function, as shown in the following formula:
F(X)=H(X)-X;
conv3_ x, conv4_ x, conv5_ x are similar to conv2_ x in structure, the difference is that the vector size of the input x is different from the convolution depth of each convolution layer, and finally the vector size of the image output at conv5_ x is [16, 512 ];
in this embodiment, the size of the first deconvolution is set to be 4 × 4, the convolution depth is set to be 256, and the step size is set to be 2, which corresponds to that on the feature map output by conv5_ x, one pixel on each feature map is deconvoluted and mapped onto a 4 × 4 sliding window by using 256 convolution kernels with step size 2, and the image vector size of the output map is [32, 256 ];
similarly, the deconv2_ x and the deconv3_ x have similar structures to the deconv1_ x, the only difference is that the vector size of the input feature map is different, and finally the vector size of the image output at the deconv3_ x is [128, 64 ];
an output layer: finally, the vector obtained by deconv3_ x passes through a 3 × 3 convolutional layer, a ReLu active layer and a 1 × 1 convolutional layer, and the probability size of 80 classes (80 classes are labeled by the category label of coco) and the coordinates of four predicted bounding boxes (the upper left corner coordinate and the lower right corner coordinate of the bounding box) are output.
In this embodiment, step S41 specifically includes:
as described in fig. 6, the module for channel and spatial fusion perception is embedded in the deep neural network architecture.
Step S41: taking the input of decon1_ x as an example, the input channel and the spatial fusion perception module have a layer size of F16×16×512Wherein the height and width of the feature map F are both 16, and the number of channels is 512. The characteristic diagram is firstly subjected to channel perception based on channel slices, and is subjected to channel perception through operations such as global average pooling, full connection (FC for short), activation function and the like, so that the weight distribution of each channel is continuously learned and updated, and is multiplied by an initial characteristic diagram F to obtain F1 16×16×512A characteristic diagram of (1);
step S42: the characteristic diagram F obtained in the step S411Based on space slices, the space perception is carried out through operations such as convolution, activation functions and the like, so that the weight distribution of each space is continuously learned and updated, and the weight distribution is combined with a feature map F obtained through channel perception1Multiplying to obtain the final size F2 H×W×CThe characteristic diagram of (1).
As shown in fig. 7 (a), S41 specifically includes the following steps:
step S411: feature map F to be input16×16×512Based on channel slice, divided into 512 slice regions, i.e. Z ═ Z1,z2,z3,......z512Therein of
Figure BDA0002254648920000151
Represents the ith patch;
step S412: carrying out global average pooling operation on 512 regions to obtain a vector U1={u1,u2,u3,......uC},
Figure BDA0002254648920000152
The formula for the k element is:
Figure BDA0002254648920000153
wherein z isk(i, j) is the patch zkPixel values corresponding to the upper coordinates (i, j);
step S413: will U1Through a full connection layer operation and ReLu activation, the method is obtained
Figure BDA0002254648920000154
Wherein W1Refers to the weight coefficients of the full convolution of the layer,
Figure BDA0002254648920000155
r refers to a scaling factor, in this embodiment, r is 16, which refers to ReLu activation operation, so the dimension obtained in this way is
Figure BDA0002254648920000161
Step S414: will be provided with
Figure BDA0002254648920000162
Through a full connection layer operation and activated by using a sigmoid function, obtaining
Figure BDA0002254648920000163
Wherein W2Refers to the layerThe weight coefficients of the full convolution are,
Figure BDA0002254648920000164
σ (-) refers to sigmoid activation operation, so the resulting dimension is
Step S415: will U2The value of (c) is multiplied by the channel slice division vector Z of the input original feature map F correspondingly, and the formula is as follows:
Fl=U2·Z;
where, all values representing the kth vector in Z
Figure BDA0002254648920000166
All ride on U2K value of
Figure BDA0002254648920000167
As shown in fig. 7 (b), in the present embodiment, the step S42 specifically includes the following steps:
step S421: feature map F obtained by channel perception1Based on the spatial slice, the slice is divided into 16 × 16 slices, namely Z '═ Z'1,z′2,z′3,......z′16×16Therein ofSo vectorWherein K is 16 × 16 × 256;
step S422: all vectors Z' obtained in step S421 are weighted into W3The object of the convolution layer is to obtain the spatial perception weight while compressing the channel, and the calculation formula is U3=W3Z' wherein
Figure BDA00022546489200001610
Symbol denotes eachCombining the sum with a linear to obtain a value
Figure BDA00022546489200001611
Then mapping the probability to [0,1] by using sigmoid activation function]In the interval, the formula is shown as follows;
U3=σ(W3*Z′);
wherein, W3Sigma (-) is sigmoid activated function expression as parameter weight of convolution layer;
step S423: will U3The above values are multiplied by the values in the spatial slice vector Z' of F1, respectively, and the calculation formula is as follows:
F2=U3⊙Z;
wherein ⊙ represents the kth vector Z 'in Z'kAll values of
Figure BDA0002254648920000171
All ride on U3K value of
Figure BDA0002254648920000172
In this embodiment, the step S5 specifically includes the following steps:
step S5: training a deep neural network of the embedded channel and the spatial fusion perception module, and storing each weight value of the neural network;
as shown in fig. 8, in this embodiment, the step S6 specifically includes the following steps:
step S61: inputting the image of the mscocotest 2017 test data set downloaded in step S1 into the deep neural network of the embedded channel and spatial fusion perception module trained in step S4 to obtain the json file of the final detection result, as shown in fig. 9, naming the json file and packing the json file into a zip file and uploading the zip file to the CodaLab website for evaluation.
Step S62: the results of the inventive evaluation on MSCOCO are shown in fig. 10, where mAP is 35.6%, and AP5054.8% (AP value when α ═ 0.5), AP7538.4% (when α is 0)AP value at 75), APS17.7% (when bounding box area)<322AP value of) of the APM36.4% (when 32)2<Area of boundary frame<AP value of 96), APL48.0% (when bounding box area)>962And configured at the server to: fps (reasoning speed per second) on i9-900K CPU, 2080Ti GPU, CUDA 10.1, CUDNN7.6 and Pyorch 1.1.0 reaches 111 frames/second, so the deep learning target detection method integrating perception with space not only has strong real-time performance, but also has high performance.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims (5)

1. A deep learning target detection method based on channel and space fusion perception is characterized by comprising the following steps:
constructing a channel and space fusion sensing module, embedding the channel and space fusion sensing module into a deep neural network architecture, and carrying out target detection on a target picture by using the improved deep neural network architecture;
the channel and space fusion perception module is specifically constructed as follows: firstly, channel perception is carried out on the feature diagram which is input originally, and then spatial perception cascade is carried out.
2. The method for detecting the deep learning target by channel and space fusion perception according to claim 1, wherein the step of channel perception on the feature map of the original input specifically comprises the following steps:
step S11: feature map F to be inputH×W×CBased on channel slicing, dividing into C slice zones Z ═ Z1,z2,...,zCTherein of
Figure FDA0002254648910000011
Represents the ith patch; h, W, C represents the height, width and channel number of the characteristic diagram;
step S12: carrying out global average pooling operation on the C slice areas to obtain a vector U1={u1,u2,...,uC},
Figure FDA0002254648910000012
The kth element ukThe calculation formula of (2) is as follows:
Figure FDA0002254648910000013
in the formula, zk(i, j) is the patch zkPixel values corresponding to the upper coordinates (i, j);
step S13: will U1Through a full connection layer operation and ReLu activation, the method is obtained
Figure FDA0002254648910000014
Wherein W1Refers to the weight coefficients of the full convolution of the layer,
Figure FDA0002254648910000015
r refers to the scaling factor, δ (-) refers to the ReLu activation operation, resulting in a dimension of
Figure FDA0002254648910000016
Step S14: will be provided with
Figure FDA0002254648910000017
Through a full connection layer operation and activated by using a sigmoid function, obtaining
Figure FDA0002254648910000021
Wherein W2Refers to the weight coefficients of the full convolution of the layer,
Figure FDA0002254648910000022
σ (-) refers to sigmoid activation operation, with the resulting dimension being
Step S15: will U2The value of the channel is multiplied by the channel slice division vector Z of the input original feature map F correspondingly to obtain the feature map F obtained by channel perception1The formula is as follows:
F1=U2·Z;
where, denotes the kth vector Z in ZkAll values of (2) are multiplied by U2K value of (U)2k
3. The method for detecting the deep learning target by fusion perception of the channel and the space as claimed in claim 1, wherein the cascading of the spatial perception specifically includes the following steps:
step S21: feature map F obtained by channel perception1Based on spatial slicing, the slices are divided into H multiplied by W slices, namely Z '═ Z'1,z'2,z'3,...,z'H×MTherein of
Figure FDA0002254648910000024
So vector
Figure FDA0002254648910000025
Wherein K ═ hxw;
step S22: all vectors Z' obtained in step S21 are weighted by W3The convolution layer of (1) also obtains a spatial perception weight while compressing the channel, and the calculation formula is U3=W3Z' wherein
Figure FDA0002254648910000026
Symbol denotes each
Figure FDA0002254648910000027
And
Figure FDA0002254648910000028
linear combination to obtain a value
Figure FDA0002254648910000029
Then mapping the probability to [0,1] by using sigmoid activation function]In the interval, the formula is as follows:
U3=σ(W3*Z′);
in the formula, W3Sigma (-) is sigmoid activated function expression as parameter weight of convolution layer;
step S23: will U3Feature map F obtained by the above numerical value and channel perception1The values on the space slice vector Z' are correspondingly multiplied to obtain a feature map F after space perception cascade2The calculation formula is as follows:
F2=U3⊙Z′;
wherein ⊙ represents the kth vector Z 'in Z'kAll values of (2) are multiplied by U3K value of
Figure FDA0002254648910000031
4. The method for detecting the deep learning target by fusing the perception of the channel and the space according to any one of claims 1 to 3, characterized by comprising the following steps:
step S1: acquiring an image training data set and a test data set and respective label files thereof from an MSCOCO or PASCALVOC organ network;
step S2: scaling the images of the training data set to the same size, and then inputting the images into a deep neural network;
step S3: constructing a deep neural network architecture;
step S4: constructing a channel and space fusion sensing module, and embedding the channel and space fusion sensing module into a built deep neural network architecture;
step S5: training the modified deep neural network, and storing each weighted value of the neural network;
step S6: and (5) inputting the image of the test data set downloaded in the step (S1) into the trained deep neural network of the embedded channel and spatial fusion perception module, and outputting a detection result.
5. The method for detecting the deep learning target of the channel and space fusion perception according to claim 4, wherein in step S5, training uses an average accuracy rate mAP as an evaluation index of target detection, and is calculated as:
Figure FDA0002254648910000032
wherein R is recall rate and P is accuracy rate;
when the recall rate and the accuracy rate are calculated, α is set as the coincidence rate between the predicted bounding box and the labeled real bounding box, meanwhile, the predicted box with α being more than or equal to 0.5 is regarded as a positive example, and the predicted box with α being less than 0.5 is regarded as a negative example.
CN201911048207.5A 2019-10-30 2019-10-30 Deep learning target detection method based on channel and space fusion perception Pending CN110796239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911048207.5A CN110796239A (en) 2019-10-30 2019-10-30 Deep learning target detection method based on channel and space fusion perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911048207.5A CN110796239A (en) 2019-10-30 2019-10-30 Deep learning target detection method based on channel and space fusion perception

Publications (1)

Publication Number Publication Date
CN110796239A true CN110796239A (en) 2020-02-14

Family

ID=69442233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911048207.5A Pending CN110796239A (en) 2019-10-30 2019-10-30 Deep learning target detection method based on channel and space fusion perception

Country Status (1)

Country Link
CN (1) CN110796239A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401294A (en) * 2020-03-27 2020-07-10 山东财经大学 Multitask face attribute classification method and system based on self-adaptive feature fusion
CN111414994A (en) * 2020-03-03 2020-07-14 哈尔滨工业大学 FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN112712138A (en) * 2021-01-19 2021-04-27 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN113158790A (en) * 2021-03-15 2021-07-23 河北工业职业技术学院 Processing edge lane line detection system based on geometric context coding network model
CN115631153A (en) * 2022-10-14 2023-01-20 佳源科技股份有限公司 Pipe gallery visual defect detection method and system based on perception learning structure

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084794A (en) * 2019-04-22 2019-08-02 华南理工大学 A kind of cutaneum carcinoma image identification method based on attention convolutional neural networks
CN110110751A (en) * 2019-03-31 2019-08-09 华南理工大学 A kind of Chinese herbal medicine recognition methods of the pyramid network based on attention mechanism
CN110197208A (en) * 2019-05-14 2019-09-03 江苏理工学院 A kind of textile flaw intelligent measurement classification method and device
KR20190113119A (en) * 2018-03-27 2019-10-08 삼성전자주식회사 Method of calculating attention for convolutional neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190113119A (en) * 2018-03-27 2019-10-08 삼성전자주식회사 Method of calculating attention for convolutional neural network
CN110110751A (en) * 2019-03-31 2019-08-09 华南理工大学 A kind of Chinese herbal medicine recognition methods of the pyramid network based on attention mechanism
CN110084794A (en) * 2019-04-22 2019-08-02 华南理工大学 A kind of cutaneum carcinoma image identification method based on attention convolutional neural networks
CN110197208A (en) * 2019-05-14 2019-09-03 江苏理工学院 A kind of textile flaw intelligent measurement classification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SANGHYUN WOO 等: "CBAM: Convolutional Block Attention Module", 《COMPUTER VISION –ECCV 2018》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414994A (en) * 2020-03-03 2020-07-14 哈尔滨工业大学 FPGA-based Yolov3 network computing acceleration system and acceleration method thereof
CN111401294A (en) * 2020-03-27 2020-07-10 山东财经大学 Multitask face attribute classification method and system based on self-adaptive feature fusion
CN112712138A (en) * 2021-01-19 2021-04-27 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN113158790A (en) * 2021-03-15 2021-07-23 河北工业职业技术学院 Processing edge lane line detection system based on geometric context coding network model
CN115631153A (en) * 2022-10-14 2023-01-20 佳源科技股份有限公司 Pipe gallery visual defect detection method and system based on perception learning structure

Similar Documents

Publication Publication Date Title
CN110796239A (en) Deep learning target detection method based on channel and space fusion perception
CN112529150B (en) Model structure, model training method, image enhancement method and device
CN110909651B (en) Method, device and equipment for identifying video main body characters and readable storage medium
EP4145353A1 (en) Neural network construction method and apparatus
JP2020519995A (en) Action recognition in video using 3D space-time convolutional neural network
CN112446398A (en) Image classification method and device
CN111275057B (en) Image processing method, device and equipment
CN106874857A (en) A kind of living body determination method and system based on video analysis
CN110222718B (en) Image processing method and device
CN112862828B (en) Semantic segmentation method, model training method and device
US11244188B2 (en) Dense and discriminative neural network architectures for improved object detection and instance segmentation
EP3942462B1 (en) Convolution neural network based landmark tracker
CN113221787A (en) Pedestrian multi-target tracking method based on multivariate difference fusion
CN113469074B (en) Remote sensing image change detection method and system based on twin attention fusion network
CN111192277A (en) Instance partitioning method and device
CN113065576A (en) Feature extraction method and device
KR20200067631A (en) Image processing apparatus and operating method for the same
CN111695673A (en) Method for training neural network predictor, image processing method and device
CN111738074B (en) Pedestrian attribute identification method, system and device based on weak supervision learning
CN111325766A (en) Three-dimensional edge detection method and device, storage medium and computer equipment
CN111291695B (en) Training method and recognition method for recognition model of personnel illegal behaviors and computer equipment
CN115018039A (en) Neural network distillation method, target detection method and device
CN115620122A (en) Training method of neural network model, image re-recognition method and related equipment
CN113627421A (en) Image processing method, model training method and related equipment
CN113673308A (en) Object identification method, device and electronic system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200214

RJ01 Rejection of invention patent application after publication