CN110796239A

CN110796239A - Deep learning target detection method based on channel and space fusion perception

Info

Publication number: CN110796239A
Application number: CN201911048207.5A
Authority: CN
Inventors: 吴林煌; 杨绣郡; 范振嘉; 陈志峰
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-14

Abstract

The invention relates to a deep learning target detection method of channel and space fusion perception, which comprises the steps of firstly constructing a channel and space fusion perception module, embedding the channel and space fusion perception module into a deep neural network architecture, and then carrying out target detection on a target picture by utilizing the improved deep neural network architecture; the construction of the channel and space fusion perception module specifically comprises the following steps: firstly, channel perception is carried out on the feature diagram which is input originally, and then spatial perception cascade is carried out. The method does not deepen the depth or width of the network, does not introduce additional space vectors, and simultaneously ensures the instantaneity and the precision.

Description

Deep learning target detection method based on channel and space fusion perception

Technical Field

The invention relates to the technical field of image recognition, in particular to a deep learning target detection method based on channel and space fusion perception.

Background

Currently, target detection frameworks based on deep learning are mainly divided into two categories: a two-stage detector and a single-stage detector; two-stage target detection is named after two-stage processing of a picture, and is also called a region-based method, which abstracts detection into two processes, wherein firstly, a plurality of regions possibly containing objects are selected and proposed based on the picture by using randomness, namely, local cutting of the picture is called a candidate region; and secondly, after the feature vectors of the generated regions are coded by the deep convolutional neural network, the feature vectors are used for predicting each category of the candidate regions, so that the category of the object in each region is obtained. The two-stage detector algorithm is based on a neural network with high operation cost, and the speed is used for improving the precision, so that the single-stage target detection algorithm is generated at the same time; the single-stage detector does not have an intermediate region detection process, but directly obtains a prediction result from a picture; typical examples include YOLO, SSD, etc., the network architecture of these direct regression algorithms is relatively simpler, but compared with two-stage target detection, such as Mask-RCNN, the speed of them is increased by 8 times, and the accuracy is also reduced by about 12%;

over the past few years, the improvement in detector performance has largely relied on increasing the depth or width of the network: compared with AlexNet, VGG-16 increases the network depth by stacking convolution layers to improve the expressive power of the model; ResNet effectively trains the network through residual blocks, with model depths continuing to increase (e.g., from 16 layers to 152 layers), enabling training of high-capacity models to improve performance; GoogleNet uses an inference module to apply different proportions of convolution kernels on the same feature map to increase the model width to improve learning. Although the performance can be improved by simply pursuing a deeper and wider network structure, the more complex the network is, the higher the calculation cost is, and the lower the inference speed is. Such as DSSD and RetinaNet, which can compete with top two-stage network performance, the improvement in performance comes from the extremely deep ResNet-101 network that limits efficiency.

In addition to relying solely on network depth, many methods for improving network performance by designing functional modules for improving feature characterization capabilities appear at present: the FPN combines deep features with shallow features, and strengthens the shallow features with strong space through deeper rich semantic information; on the basis of SSD, when deeper basic network ResNet-101 and deconvolution layer characteristics are fused, the DSSD provides better characterization capability for a shallow characteristic diagram through jump connection, but the speed is obviously reduced while the performance is remarkably improved; the methods do not deepen the model to enhance the feature representation of the network, but transversely enhance the learning of deep features in the convolutional neural network by directly performing operations such as superposition, sampling, connection and the like on the feature map so as to improve the performance. However, these operations are all reprocessing of the whole feature map rather than the inside thereof, and all the operations are to perform feature fusion by introducing an additional space vector, and do not particularly emphasize the importance degree between the channels or spaces inside the feature map based on the inside of the feature map.

Disclosure of Invention

In view of this, the present invention provides a method for detecting a deep learning target by channel and space fusion sensing, which does not deepen the depth or width of a network, does not introduce additional space vectors, and simultaneously ensures real-time performance and precision.

The invention is realized by adopting the following scheme: a deep learning target detection method based on channel and space fusion perception specifically comprises the following steps:

constructing a channel and space fusion sensing module, embedding the channel and space fusion sensing module into a deep neural network architecture, and carrying out target detection on a target picture by using the improved deep neural network architecture;

the channel and space fusion perception module is specifically constructed as follows: firstly, channel perception is carried out on the feature diagram which is input originally, and then spatial perception cascade is carried out.

Further, the channel sensing of the feature map of the original input specifically includes the following steps:

step S11: feature map F to be input^H×W×CBased on channel slicing, dividing into C slice zones Z ═ Z₁,z₂,...,z_CTherein of

Represents the ith patch; h, W, C represents the height, width and channel number of the characteristic diagram;

step S12: carrying out global average pooling operation on the C slice areas to obtain a vector U₁＝{u₁,u₂,...,u_C}，

The kth element u_kThe calculation formula of (2) is as follows:

in the formula, z_k(i, j) is the patch z_kPixel values corresponding to the upper coordinates (i, j);

step S13: will U₁Through a full connection layer operation and ReLu activation, the method is obtained

Wherein W₁Refers to the weight coefficients of the full convolution of the layer,

r refers to the scaling factor, δ (-) refers to the ReLu activation operation, resulting in a dimension of

Step S14: will be provided with

Through a full connection layer operation and activated by using a sigmoid function, obtaining

Wherein W₂Refers to the weight coefficients of the full convolution of the layer,

σ (-) refers to sigmoid activation operation, with the resulting dimension being

Step S15: will U₂The value of the channel is multiplied by the channel slice division vector Z of the input original feature map F correspondingly to obtain the feature map F obtained by channel perception₁The formula is as follows:

F₁＝U₂·Z；

where, denotes the kth vector Z in Z_kAll values of (2) are multiplied by U₂K value of (U)_2k。

Further, the cascade for performing spatial perception specifically includes the following steps:

step S21: feature map F obtained by channel perception₁Based on spatial slicing, the slices are divided into H multiplied by W slices, namely Z '═ Z'₁,z'₂,z'₃,...,z'_H×MTherein of

So vector

Wherein K ═ hxw;

step S22: all vectors Z' obtained in step S21 are weighted by W₃The convolution layer of (1) also obtains a spatial perception weight while compressing the channel, and the calculation formula is U₃＝W₃Z' wherein

Symbol denotes each

And

linear combination to obtain a value

Then mapping the probability to [0,1] by using sigmoid activation function]In the interval, the formula is as follows:

U₃＝σ(W₃*Z′)；

in the formula, W₃Sigma (-) is sigmoid activated function expression as parameter weight of convolution layer;

step S23: will U₃Feature map F obtained by the above numerical value and channel perception₁The values on the space slice vector Z' are correspondingly multiplied to obtain a feature map F after space perception cascade₂The calculation formula is as follows:

F₂＝U₃⊙Z；

wherein ⊙ represents the kth vector Z 'in Z'_kAll values of

All ride on U₃K value of

Further, the method of the embodiment specifically includes the following steps:

step S1: acquiring an image training data set and a test data set and respective label files thereof from an MSCOCO or PASCAL VOC official website;

step S2: scaling the images of the training data set to the same size, and then inputting the images into a deep neural network;

step S3: constructing a deep neural network architecture;

step S4: constructing a channel and space fusion sensing module, and embedding the channel and space fusion sensing module into a built deep neural network architecture;

step S4 is as described above, and includes the following two steps:

step S41: the input channel and the spatial fusion perception module are a layer with the size of F^H×W×CH, W, C are respectively the height, width and channel number of the characteristic diagram. The characteristic diagram is firstly based on channel slices through channel perception, and channel perception is carried out through operations of global average pooling, full connection (FC for short), activation function and the like, so that the weight distribution of each channel is continuously learned and updated, and the method is combined with the methodMultiplying the initial characteristic diagram F to obtain the size F₁ ^H×W×CA characteristic diagram of (1);

step S42: f obtained in step S41₁The feature graph is based on space slices and carries out space perception through operations such as convolution, activation functions and the like, so that the weight distribution of each space is continuously learned and updated, and the feature graph F is obtained through channel perception₁Multiplying to obtain the final size F₂ ^H×W×CThe characteristic diagram of (1).

Step S5: training the modified deep neural network, and storing each weighted value of the neural network;

step S6: and (5) inputting the image of the test data set downloaded in the step (S1) into the trained deep neural network of the embedded channel and spatial fusion perception module, and outputting a detection result. And the target detection effect is evaluated.

Further, in step S5, the training uses the average accuracy mAP as an evaluation index of the target detection, and the calculation is as follows:

wherein R is recall rate and P is accuracy rate;

when the recall rate and the accuracy rate are calculated, α is taken as the coincidence rate between the predicted bounding box and the labeled real bounding box, meanwhile, the predicted box with α being more than or equal to 0.5 is taken as a positive example, the predicted box with α being less than 0.5 is taken as a negative example, wherein the calculation of α is as follows:

in the formula, Box_preFor predicted bounding boxes, Box_gtTo label the real bounding box, ∩ is the intersection area between the two, and ∪ is the combined area of the two.

Compared with the prior art, the invention has the following beneficial effects: the method of the invention does not deepen the depth or width of the network, does not introduce extra space vectors, ensures the real-time performance and the precision at the same time, provides a new idea for the field of target detection, has strong portability, can be used for embedding various deep learning networks, and can be widely applied to the field needing target identification.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.

Fig. 2 is a diagram illustrating the effect of partial data set downloaded from the MSCOCO official website according to the embodiment of the present invention.

Fig. 3 is a diagram illustrating the effect of the tag file downloaded from the MSCOCO official website according to the embodiment of the present invention.

Fig. 4 is a block diagram of a deep learning network architecture of ResNet-18 according to an embodiment of the present invention.

Fig. 5 is a block diagram of a residual connection structure according to an embodiment of the present invention.

Fig. 6 is an architecture block diagram of embedding a channel and space fusion aware module into a deep neural network architecture according to an embodiment of the present invention.

Fig. 7 is a block diagram of a channel and spatial perception fusion module architecture according to an embodiment of the present invention. Wherein, (a) is a channel perception module block diagram, and (b) is a space perception module block diagram.

Fig. 8 is a block diagram of an inference phase structure according to an embodiment of the present invention.

Fig. 9 is a graph of the effect of uploading to CodeLab official website evaluation according to an embodiment of the present invention.

FIG. 10 is a diagram showing the results of CodeLab official website evaluation according to the embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a deep learning target detection method based on channel and space fusion sensing, which specifically includes:

In this embodiment, the channel sensing of the originally input feature map specifically includes the following steps:

The kth element u_kThe calculation formula of (2) is as follows:

step S13: will U₁By means of a full-link layer operation,then performing ReLu activation to obtain

Step S14: will be provided with

F₁＝U₂·Z；

In this embodiment, the cascade for performing spatial sensing specifically includes the following steps:

So vector

Wherein K ═ hxw;

Symbol denotes eachAnd

linear combination to obtain a value

U₃＝σ(W₃*Z)；

F₂＝U₃⊙Z；

wherein ⊙ represents the kth vector Z 'in Z'_kAll values of

All ride on U₃K value of

In this embodiment, the method of this embodiment specifically includes the following steps:

step S3: constructing a deep neural network architecture;

step S4 is as described above, and includes the following two steps:

step S41: the input channel and the spatial fusion perception module are a layer with the size of F^H×W×CH, W, C are respectively the height, width and channel number of the characteristic diagram. The characteristic diagram is firstly subjected to channel perception based on channel slices, and is subjected to channel perception through operations such as global average pooling, full connection (FC for short), activation function and the like, so that the weight distribution of each channel is continuously learned and updated, and is multiplied by an initial characteristic diagram F to obtain the value F₁ ^H×W×CA characteristic diagram of (1);

In this embodiment, in step S5, the training uses the average accuracy rate mAP as the evaluation index of the target detection, and is calculated as:

wherein R is recall rate and P is accuracy rate;

The embodiment provides a novel module named channel and space fusion perception, which is used for enhancing deep feature learning in a neural network and can be embedded into any neural network. Firstly, an original feature map is partitioned based on a channel, feature distribution of the channel is subjected to relearning definition through a full connection layer, then obtained weights are multiplied by all the channels, then feature distribution of the space is subjected to relearning definition through a convolution layer based on space partitioning, and then the obtained weights are multiplied by all the spaces to obtain a feature map after fusion perception, so that redistribution of features of the feature map on the channel and the space is realized, and important features are more important and negligible feature influence is smaller.

Particularly, the method comprises the steps of data set downloading, deep neural network construction, channel and space fusion perception module embedding (channel perception and space perception) and target detection performance evaluation;

wherein, the data set downloading is realized by downloading the data set and the corresponding label file, such as the MSCOCO and PASCAL VOC data set, from the currently mainstream data set website.

The PASCAL VOCs provide an excellent set of standardized data for image identification and classification, with 4 major categories of games held each year since 2005: vehicle, household, animal and person, subdivided into 20 subclasses. The data set is primarily concerned with classification and detection tasks. There are two commonly used versions of PASCAL VOCs: 2007 and 2012 versions. Among them, a combination of 2007train val and 2012train val is commonly used as a training set, and 16551 pictures containing 40058 objects are common, and 2007test is commonly used as a verification set, and 4952 pictures containing 12032 objects are common.

The MSCOCO data set is a large image data set developed and maintained by Microsoft, a large number of daily scene pictures containing common objects are collected, the detection tasks totally contain 12 large classes which are divided into 80 small classes, each class contains more instances, the distribution is balanced, compared with PASCAL VOC, the MSCOCO also contains more small objects, wherein train2017 is commonly used as a training set and comprises 118287 pictures; val2017 is commonly used as a verification set, including 5000 pictures; test2017 is commonly used as a test set, and comprises 20288 pictures.

The deep neural network building is based on the current deep neural network, a proper framework is selected for building, the current mainstream detector framework comprises VGG, AlexNet, ResNet, GoogleNet and the like, smaller neural networks have fewer learning parameters in the training stage, and therefore the speed in the training and reasoning stages is high.

The channel and spatial fusion perception module in the embodiment is embedded in a proper position of the deep neural network architecture. The main steps are that the module firstly carries out channel perception on the feature diagram which is originally input, and then carries out cascade of space perception.

The channel perception is that the input feature map is firstly sliced based on channels, and then global average pooling is carried out, so that global information of the feature map is extracted, and space compression is carried out; secondly, through full connection operation and ReLu activation, the purpose is to reduce the number of channels so as to reduce the calculation complexity; then, mapping the probability of the data to be between [0,1] through a full connection layer and a sigmoid function so as to recover the original dimension; and finally, multiplying the sliced vector by the vector after sigmoid operation correspondingly to obtain a channel sensing result. The two fully-connected layers can obtain features on the input feature map through end-to-end learning, and the operation can obtain perception on the channel.

The spatial perception is that a feature map obtained through channel perception is firstly based on spatial slicing, then is convoluted and is activated through a sigmoid function to normalize the spatial perception weight, and finally, the normalized vector is correspondingly multiplied with the vector obtained before and through the space, so that a result of channel and space fusion perception is obtained. The convolution layer continuously learns the parameters along with the continuous learning of the network, so that the parameters are adaptively adjusted to be ignored or important emphasized space coefficients, and through the cascade connection of channels and space perception, the weight redistribution can be realized, the useful information is more sensitive, and meanwhile, the inhibition on the useless information is more obvious.

When the target detection performance is evaluated, the test data set downloaded from the official website is input into the trained deep neural network of the embedded channel and the spatial fusion perception module for testing to obtain a test result file, and the test result file is compared and evaluated with the labeled label file provided by the official website to obtain an evaluation result.

Specifically, the following describes a specific embodiment of the present embodiment with reference to the drawings.

The step S1 specifically includes:

step S1: acquiring an image training data set and a test data set from an MSCOCO official website, and respective label files thereof; in this embodiment, train2017 is used as a training data set, and test2017 is used as a testing data set. A partial screenshot of a training data set is shown in fig. 2, and the label files of the training set and the test set are shown in fig. 3;

the step S2 specifically includes:

step S2: the image size of the training data set is scaled to 512 x 512, and then input into a deep neural network;

in this embodiment, the step S3 specifically includes the following steps:

step S3: as shown in fig. 4, a deep learning network architecture of ResNet-18 is built, and the size of the input image is 512 × 512 × 3, wherein the length and the width of the image are 512, and the number of channels is 3;

the deep neural network architecture involved in the step S3 is as follows:

an input layer: since a 512 × 512 RGB image is input, and the dimension of the RGB image in the three-dimensional space is 512 × 512 × 3, the image vector size of the input layer is [512, 3 ];

conv 1: in this embodiment, the size of the first convolutional layer is 7 × 7, the convolution depth is 64, the step size is set to 2 (2 rows above, below, left, and right of the original input image are filled with pixel 0 before convolution), which is equivalent to that the pixel of the input image under the window is convolved with 64 sliding windows of 7 × 7 by step size 2, and the obtained image vector size is [256, 64 ];

in this embodiment, the first pooling step is set to be 2, the pooling size is 3 × 3, and the pooling mode is maximum pooling, i.e. scanning the image pixels obtained by the convolution layer 1 in a sliding window with the size of 3 × 3 by step 2, storing the maximum value under the window, so that the image vector size [128, 64] obtained after pooling of the layer is obtained;

conv2_ x: as shown in fig. 5, in the residual connection portion of the resnet-18 network in this embodiment, x represents the input of the current first layer, and leads to two directions, one is directly-connected identity mapping, and two layers of convolution layers with a convolution value of 3 × 3 and a convolution depth of 64 are skipped before directly sending values to the ReLu activation function of the subsequent layer, and the other is output mapping h (x) through two layers of weights (first, the convolution layer is normalized by batchnorm (BN for short) and the ReLu activation function, and then repeatedly performed once), f (x) is equivalent to a residual function, as shown in the following formula:

F(X)＝H(X)-X；

conv3_ x, conv4_ x, conv5_ x are similar to conv2_ x in structure, the difference is that the vector size of the input x is different from the convolution depth of each convolution layer, and finally the vector size of the image output at conv5_ x is [16, 512 ];

in this embodiment, the size of the first deconvolution is set to be 4 × 4, the convolution depth is set to be 256, and the step size is set to be 2, which corresponds to that on the feature map output by conv5_ x, one pixel on each feature map is deconvoluted and mapped onto a 4 × 4 sliding window by using 256 convolution kernels with step size 2, and the image vector size of the output map is [32, 256 ];

similarly, the deconv2_ x and the deconv3_ x have similar structures to the deconv1_ x, the only difference is that the vector size of the input feature map is different, and finally the vector size of the image output at the deconv3_ x is [128, 64 ];

an output layer: finally, the vector obtained by deconv3_ x passes through a 3 × 3 convolutional layer, a ReLu active layer and a 1 × 1 convolutional layer, and the probability size of 80 classes (80 classes are labeled by the category label of coco) and the coordinates of four predicted bounding boxes (the upper left corner coordinate and the lower right corner coordinate of the bounding box) are output.

In this embodiment, step S41 specifically includes:

as described in fig. 6, the module for channel and spatial fusion perception is embedded in the deep neural network architecture.

Step S41: taking the input of decon1_ x as an example, the input channel and the spatial fusion perception module have a layer size of F^16×16×512Wherein the height and width of the feature map F are both 16, and the number of channels is 512. The characteristic diagram is firstly subjected to channel perception based on channel slices, and is subjected to channel perception through operations such as global average pooling, full connection (FC for short), activation function and the like, so that the weight distribution of each channel is continuously learned and updated, and is multiplied by an initial characteristic diagram F to obtain F₁ ^16×16×512A characteristic diagram of (1);

step S42: the characteristic diagram F obtained in the step S41₁Based on space slices, the space perception is carried out through operations such as convolution, activation functions and the like, so that the weight distribution of each space is continuously learned and updated, and the weight distribution is combined with a feature map F obtained through channel perception₁Multiplying to obtain the final size F₂ ^H×W×CThe characteristic diagram of (1).

As shown in fig. 7 (a), S41 specifically includes the following steps:

step S411: feature map F to be input^16×16×512Based on channel slice, divided into 512 slice regions, i.e. Z ═ Z₁，z₂，z₃，......z₅₁₂Therein of

Represents the ith patch;

step S412: carrying out global average pooling operation on 512 regions to obtain a vector U₁＝{u₁，u₂，u₃，......u_C}，

The formula for the k element is:

wherein z is_k(i, j) is the patch z_kPixel values corresponding to the upper coordinates (i, j);

step S413: will U₁Through a full connection layer operation and ReLu activation, the method is obtained

r refers to a scaling factor, in this embodiment, r is 16, which refers to ReLu activation operation, so the dimension obtained in this way is

Step S414: will be provided with

Wherein W₂Refers to the layerThe weight coefficients of the full convolution are,

σ (-) refers to sigmoid activation operation, so the resulting dimension is

Step S415: will U₂The value of (c) is multiplied by the channel slice division vector Z of the input original feature map F correspondingly, and the formula is as follows:

F_l＝U₂·Z；

where, all values representing the kth vector in Z

All ride on U₂K value of

As shown in fig. 7 (b), in the present embodiment, the step S42 specifically includes the following steps:

step S421: feature map F obtained by channel perception₁Based on the spatial slice, the slice is divided into 16 × 16 slices, namely Z '═ Z'₁，z′₂，z′₃，......z′_16×16Therein ofSo vectorWherein K is 16 × 16 × 256;

step S422: all vectors Z' obtained in step S421 are weighted into W₃The object of the convolution layer is to obtain the spatial perception weight while compressing the channel, and the calculation formula is U₃＝W₃Z' wherein

Symbol denotes eachCombining the sum with a linear to obtain a value

Then mapping the probability to [0,1] by using sigmoid activation function]In the interval, the formula is shown as follows;

U₃＝σ(W₃*Z′)；

wherein, W₃Sigma (-) is sigmoid activated function expression as parameter weight of convolution layer;

step S423: will U₃The above values are multiplied by the values in the spatial slice vector Z' of F1, respectively, and the calculation formula is as follows:

F₂＝U₃⊙Z；

wherein ⊙ represents the kth vector Z 'in Z'_kAll values of

All ride on U₃K value of

In this embodiment, the step S5 specifically includes the following steps:

step S5: training a deep neural network of the embedded channel and the spatial fusion perception module, and storing each weight value of the neural network;

as shown in fig. 8, in this embodiment, the step S6 specifically includes the following steps:

step S61: inputting the image of the mscocotest 2017 test data set downloaded in step S1 into the deep neural network of the embedded channel and spatial fusion perception module trained in step S4 to obtain the json file of the final detection result, as shown in fig. 9, naming the json file and packing the json file into a zip file and uploading the zip file to the CodaLab website for evaluation.

Step S62: the results of the inventive evaluation on MSCOCO are shown in fig. 10, where mAP is 35.6%, and AP₅₀54.8% (AP value when α ═ 0.5), AP₇₅38.4% (when α is 0)AP value at 75), AP_S17.7% (when bounding box area)<32²AP value of) of the AP_M36.4% (when 32)²<Area of boundary frame<AP value of 96), AP_L48.0% (when bounding box area)>96²And configured at the server to: fps (reasoning speed per second) on i9-900K CPU, 2080Ti GPU, CUDA 10.1, CUDNN7.6 and Pyorch 1.1.0 reaches 111 frames/second, so the deep learning target detection method integrating perception with space not only has strong real-time performance, but also has high performance.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A deep learning target detection method based on channel and space fusion perception is characterized by comprising the following steps:

2. The method for detecting the deep learning target by channel and space fusion perception according to claim 1, wherein the step of channel perception on the feature map of the original input specifically comprises the following steps:

The kth element u_kThe calculation formula of (2) is as follows:

Step S14: will be provided with

F₁＝U₂·Z；

3. The method for detecting the deep learning target by fusion perception of the channel and the space as claimed in claim 1, wherein the cascading of the spatial perception specifically includes the following steps:

So vector

Wherein K ═ hxw;

Symbol denotes each

And

linear combination to obtain a value

U₃＝σ(W₃*Z′)；

F₂＝U₃⊙Z′；

wherein ⊙ represents the kth vector Z 'in Z'_kAll values of (2) are multiplied by U₃K value of

4. The method for detecting the deep learning target by fusing the perception of the channel and the space according to any one of claims 1 to 3, characterized by comprising the following steps:

step S1: acquiring an image training data set and a test data set and respective label files thereof from an MSCOCO or PASCALVOC organ network;

step S3: constructing a deep neural network architecture;

step S6: and (5) inputting the image of the test data set downloaded in the step (S1) into the trained deep neural network of the embedded channel and spatial fusion perception module, and outputting a detection result.

5. The method for detecting the deep learning target of the channel and space fusion perception according to claim 4, wherein in step S5, training uses an average accuracy rate mAP as an evaluation index of target detection, and is calculated as:

wherein R is recall rate and P is accuracy rate;

when the recall rate and the accuracy rate are calculated, α is set as the coincidence rate between the predicted bounding box and the labeled real bounding box, meanwhile, the predicted box with α being more than or equal to 0.5 is regarded as a positive example, and the predicted box with α being less than 0.5 is regarded as a negative example.