CN111666842B

CN111666842B - Shadow detection method based on double-current-cavity convolution neural network

Info

Publication number: CN111666842B
Application number: CN202010449023.6A
Authority: CN
Inventors: 李大威; 王思凡
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2022-08-26
Anticipated expiration: 2040-05-25
Also published as: CN111666842A

Abstract

The invention relates to a shadow detection method based on a double-current cavity convolution neuron network, which comprises the following steps: inputting the image with shadow into the network in a RGB three-channel mode; respectively extracting image features by using a pooling channel and a residual channel; fusing global and local features of the feature map through a multi-level cavity pooling module; the pooling channel utilizes a decoder form to up-sample the feature map to be the same as the size of the input image, the residual channel continuously keeps low-dimensional features, and the two channels are subjected to feature fusion after the up-sampling is carried out to be the same as the size of the input image; training the network by using a cross entropy loss function to obtain a group of weights with the lowest loss values; the weights are used to detect shadows in the test image, and the argmax function is used to generate a shadow binary map. The invention has higher shadow detection accuracy and better shadow edge keeping effect. The method can be used for removing the wrongly detected figure and target shadow pixels after common algorithms such as target detection, change detection and the like.

Description

Shadow detection method based on double-flow cavity convolution neural network

Technical Field

The invention relates to the technical field of deep learning and image processing, in particular to a robust shadow detection method based on a semantic segmentation network for a single picture.

Background

Shadow detection, namely, a shadow part in a color image is marked by taking a pixel as a unit, and the task to be completed is to utilize a designed network structure to train, obtain a group of weights with the highest accuracy rate to detect the shadow part in a single picture, and call the shadow part as a foreground, mark the shadow part as white, and call the rest part as a background, and mark the shadow part as black. Shadows are ubiquitous in all natural scenes and can be created as long as objects block the path of light from a light source. However, in most cases, the shadow existing in the image is liable to bring interference and difficulty to the image processing, for example, when performing tasks such as foreground detection and segmentation, the shadow is often mistaken for the target because the shadow is the same as the detected target and has a significant difference from the background color, thereby greatly reducing the accuracy of detection. If the shadow can be detected before the machine vision task is carried out, the task accuracy rate can be greatly improved. Therefore, shadow detection has long been a key task in the field of machine vision.

At present, the research on the shadow detection method is developed from the angle of manually extracting features and the deep learning, but the two methods are different in nature. In the early traditional algorithm, some scholars analyze the structural features and the color features of shadows by using an optical angle and an image processing method, and need an algorithm designer to know a large amount of optical knowledge and image processing knowledge, such as analyzing a color histogram under an HSV channel and analyzing the influence of the intensity of ambient light and the transparency of an object on the shadows. At the present stage, some scholars use deep learning, namely, design a detector based on a convolutional neural network to detect shadows in images, and use sampling methods such as convolution, pooling and the like to extract a feature map of each image, so that the network learns the structural features of the shadows by self, thereby realizing end-to-end shadow detection.

Disclosure of Invention

The purpose of the invention is: the shadow area in the single image is accurately detected.

In order to achieve the above object, the technical solution of the present invention is to provide a shadow detection method based on a dual-flow void convolutional neural network, which is characterized by comprising the following steps:

step S1: inputting single pictures in the training set into a designed network in an RGB three-channel mode in sequence;

step S2: the image input network is firstly divided into two channel operations, namely a pooling channel and a residual channel, wherein: the pooling channel is subjected to down-sampling in the form of an encoder through a cavity convolution module, and high-dimensional features are gradually extracted; the residual channel consists of a plurality of cross-flow residual modules, the cavity convolution module is used for performing convolution to extract characteristics, and a layer of characteristic graph and characteristic information of the corresponding pooling channel are superposed to keep low-dimensional characteristics;

step S3: sending feature maps obtained from the front four layers of the pooling channel into a multi-level cavity pooling module, pooling the feature maps obtained from the front three layers into the same size according to cavity convolutions with different expansion rates to obtain a first part three-layer feature map, performing global average pooling on the feature map of the fourth layer, performing bilinear interpolation on the feature map of the fourth layer to obtain a pooled size of the first part three-layer feature map so as to obtain a second part feature map, and finally performing feature fusion on the feature maps of the fourth layer to obtain the output of a final down-sampling part;

step S4: performing up-sampling completely symmetrical to the down-sampling process on the output characteristic diagram of the multi-level cavity pooling module through a cavity convolution module in a decoder mode, and finally up-sampling the image to the size same as that of the input image;

step S5: after an input layer, a hidden layer and an output layer of the network are determined, images and labels in a data set are all sent to the network for training according to the steps from S1 to S4, the labels are shadow binary images with the same size as the images and mark shadow areas and non-shadow areas according to pixels, the number of training rounds is determined according to the convergence trend of a loss function obtained by training, and the calculation process is divided into two steps: in the first step, the logits value of the sample is recorded as x, and the logits value of the sample is recorded asConversion to probability

Secondly, calculating a loss value by using a cross entropy formula-zxSigma y 'xLog (y) with weight, wherein y' is a label, y is a logits probability value calculated in the first step, and z is self-defined weight;

step S6: the image needing to be detected is tested by utilizing the stored weight, after the parameter of the weight is determined, the detected image can output a shadow feature map and a non-shadow feature map after being input into a network, and then the two feature maps are converted into a detected shadow binary map through an argmax function.

Preferably, in step S2, the selected hole convolution module includes four layers: the first layer is a common convolution with a convolution kernel of 3 × 3; the second layer is a hole convolution with the expansion rate of 3 and the convolution kernel size of 11 multiplied by 11; the third layer is the same as the second layer; the fourth layer is the same as the first layer.

Preferably, the loss function in step S5 is weighted cross-entropy loss.

Preferably, in step S6, after a single color image is input into a designed network, a shadow feature map and a non-shadow feature map are output, and stored in the form of an array, the argmax function compares the detected values of corresponding pixels in the two feature maps, if the value of the foreground of the pixel is large, the pixel is considered as a foreground, i.e., a required shadow part, and is marked as 255, and is displayed as a white part in the shadow binary image, and if the value of the background of the pixel is large, the pixel is considered as a background, i.e., a non-shadow part, and is marked as 0, and is displayed as a black part in the shadow binary image, and the detected shadow binary image is obtained by the above method.

The invention adopts a double-flow network structure, namely, residual flow is used for keeping the stability of low-level image features in learning, and pooling flow is used for keeping the extraction and fusion of the features from low level to deep level of the image. The network can have a good detection effect on large shadows and broken shadows by introducing the hole convolution. The method can be used for removing the character and the target shadow pixels which are detected by mistake after algorithms such as common target detection, change detection and the like are carried out; the present invention can also be used alone as an image shadow area detector.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects:

1) the invention utilizes the cavity convolution module to extract the characteristics, and the cavity convolution is different from the common convolution, so that the cavity convolution can not only increase the receptive field, but also retain the pixel position information of the characteristics, thereby greatly improving the missing detection phenomenon caused by the shadow covering on different texture areas, such as: when the shadow of the vehicle covers the asphalt road with the white lane line, other detectors easily miss the shadow area on the white lane line, but our detector can detect all the shadow areas.

2) The multi-level hole pooling module method extracts the global features and fuses the global features and the local features, so that the network simultaneously extracts the local and global features of the shadow, and can better judge whether the dark color region is a shadow region or a region with a darker color, for example, other detectors can easily detect a person wearing black clothes and the shadow as the shadow together, and the detector can well distinguish the black clothes from the shadow.

3) In order to simplify the computational complexity of the network, the invention designs a cross-flow residual error module, which adds a residual error flow output characteristic diagram of the previous stage, an output characteristic diagram of the previous stage passing through a cavity convolution module and a corresponding pooling flow output characteristic diagram which is up-sampled to the same size, thereby keeping the low-dimensional characteristics and accelerating the training process.

4) And a large number of data sets are adopted for training, so that the robustness of the detector is enhanced, shadows on various textures can be accurately detected, and the detection method is fully adaptive to the change of a detection scene.

5) The detection speed is high, the detection time of a single image of a single 480 multiplied by 480 color image is only 0.12s, and the method is suitable for a shadow detection task in a video sequence.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a network structure diagram of the inventive network, which is composed of three modules, namely a hole convolution module (ACM), a multi-level hole pooling module (MLAPM), and a cross-flow residual error module (CSRM), and the upper right corner is a legend.

Fig. 3 is an internal structure diagram of a hole convolution module, where rate is the expansion rate of a convolution kernel, that is, the size of a hole injected into one convolution kernel, one rate is 3, and a convolution kernel with a size of 3 × 3 performs hole convolution, and the size of its receptive field is the same as that of a 7 × 7 ordinary convolution kernel, in this module, we use ordinary convolution for the first and fourth layers, and use hole convolution with a rate of 3 for the second and third layers.

Fig. 4 is an internal structure of a multi-level cavity pooling module, which is composed of two parts, the first part is the fusion of the first three-layer network feature maps, the cavity convolution is performed by using expansion rates of different sizes, then three groups of feature maps are sampled to have the same size, the second part is simplified multi-scale pyramid pooling, global features are obtained by adopting global tie pooling, then up-sampling is performed to have the same size as the first part, and finally the features of the two parts are fused into one group.

Fig. 5 is a cross-flow residual module that fuses the information of the pooled channels with the residual channel information and preserves the features of the previous stage using the feature preservation method of the residual network to preserve the completeness of the local features.

FIG. 6 is a diagram showing the effect of shadow detection on SBU and ISTD data sets using the present invention, where input is the color image to be tested and ground route is the shadow image of the artificial mark.

Fig. 7 shows shadow detection on a "Bungalows" video sequence on a CDnet2012 data set, five frames are randomly selected for presentation, wherein a second behavior WeSamBe foreground detection algorithm can be seen to detect a foreground object and its shadow together, but the shadow belongs to false detection. The third row is the detection of the shadow area by our shadow detection algorithm based on the WeSamBe result, where the detected shadow area is shown in light gray.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

As shown in fig. 1, this embodiment discloses a shadow detection method based on a Double-stream convolutional neural Network (Double-stream convolutional neural Network), which includes the following specific steps:

step S1: acquiring a single color image, and inputting the single color image into a designed network in an RGB (red, green and blue) form;

step S2: the image input network is first divided into two channel operations, namely a pooling channel and a residual channel. The pooling channel is downsampled in the form of an encoder through a cavity convolution module, and high-dimensional features such as semantics are gradually extracted. And the residual channel consists of a plurality of cross flow residual modules, namely, a cavity convolution module is used for convolution to extract features, and a layer of feature graph and feature information of a corresponding pooling channel are superposed to keep low-dimensional features. It is worth noting that the selected hole convolution module contains four layers, the first layer is a common convolution with a convolution kernel of 3 × 3; the second layer is a hole convolution with a dilation rate of 3 and a convolution kernel size of 11 × 11, the third layer is the same as the second layer, and the fourth layer is the same as the first layer.

Step S3: and the second part is the size of the first part after the feature map of the fourth layer is subjected to bilinear interpolation after global average pooling. And finally, performing feature fusion on the four layers of feature maps to obtain the final output of the down-sampling part.

Step S4: the output characteristic diagram of the multi-level cavity pooling module is up-sampled by a decoder through a cavity convolution module, the up-sampling is completely symmetrical to the down-sampling process, and finally the image is up-sampled to the size same as that of the input image.

Step S5: after an input layer, a hidden layer and an output layer of the network are determined, images and labels (with the same size as the images and binary images of shadow areas and non-shadow areas marked according to pixels) in a data set are all sent to the network for training according to the four steps, a training algebra can be determined according to the convergence trend of a loss function obtained by training, the loss function is weighted cross-error, and the calculation process is divided into two steps: the first step is to mark the sample's logits value as x, and first convert the sample's logits value to a probability

And in the second step, calculating a loss value by using a cross entropy formula-zxSigma y 'xLog (y) with weight, wherein y' is a label, y is a logits probability value calculated in the first step, and z is self-defined weight. And (4) continuously minimizing a loss function by utilizing the network, and storing the training parameters after the optimal training effect is achieved.

Step S6: the stored weight is used for testing the picture to be detected, the parameters of the weight are input into the network to obtain two types of feature maps, and the two feature maps are converted into a detected shadow binary map through an argmax function.

The step S6 of converting the two feature maps into a shadow binary map specifically includes the following steps:

step 6.1, the detected image is sent into a network and then stored in an array form, and the pixels are divided into two types by our network: one is foreground (shaded) and one is background (unshaded);

step 6.2, coding by utilizing an argmax function, comparing the values of corresponding positions in the two characteristic graphs by using the argmax function, and if the value of the foreground of the part is large, considering the pixel as the foreground, namely the needed shadow part is marked as 255 and displayed as white; the same background, i.e., the non-shaded portion, marked as 0, is displayed in black, and the detected shaded binary image is obtained by the above method.

Claims

1. A shadow detection method based on a double-current cavity convolution neural network is characterized by comprising the following steps:

step S5: after an input layer, a hidden layer and an output layer of the network are determined, images and labels in a data set are all sent to the network for training according to the steps from S1 to S4, the labels are shadow binary images with the same size as the images and mark shadow areas and non-shadow areas according to pixels, the number of training rounds is determined according to the convergence trend of a loss function obtained by training, and the calculation process is divided into two steps: the first step is to mark the logits value of the sample as x and convert the logits value of the sample into probability

The second step calculates the loss value by using a cross entropy formula with weight-zxSigma y 'xLog (y), wherein y' is a label and y is the first stepThe calculated logits probability value, z is the self-defined weight;

2. The shadow detection method based on the dual-flow hole convolution neural network of claim 1, wherein in step S2, the selected hole convolution module includes four layers: the first layer is a common convolution with a convolution kernel of 3 × 3; the second layer is a hole convolution with the expansion rate of 3 and the convolution kernel size of 11 multiplied by 11; the third layer is the same as the second layer; the fourth layer is the same as the first layer.

3. The method for detecting the shadow based on the dual-flow hole convolution neural network as claimed in claim 1, wherein the loss function in step S5 is weighted cross-entropy.

4. The method according to claim 1, wherein in step S6, after a single color image is input into a designed network, a shadow feature map and a non-shadow feature map are output and stored in an array form, an argmax function compares the detected values of corresponding pixels in the two feature maps, if the foreground value of the pixel is large, the pixel is considered as a foreground, i.e., a required shadow portion, which is marked as 255, and displayed as a white portion in the shadow binary map, and if the background value of the pixel is large, the pixel is considered as a background, i.e., a non-shadow portion, which is marked as 0, and displayed as a black portion in the shadow binary map, and the detected shadow binary map is obtained by the above method.