CN116188790A

CN116188790A - Camera shielding detection method and device, storage medium and electronic equipment

Info

Publication number: CN116188790A
Application number: CN202211718783.8A
Authority: CN
Inventors: 陶江龙; 胡治满; 于亚洲; 陶和平; 王毅成; 闫帅
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-30

Abstract

The invention discloses a camera shielding detection method and device, a storage medium and electronic equipment. Wherein the method comprises the following steps: acquiring an initial camera occlusion detection model, wherein the initial camera occlusion detection model at least comprises a feature extraction network, a mixed space pyramid network and a mixed up-sampling network, and the output end of the feature extraction network is connected with the input end of the mixed space pyramid network and the input end of the mixed up-sampling network; inputting target image data into an initial camera occlusion detection model for training to obtain a trained camera occlusion detection model; and under the condition that the target loss function value meets the loss function threshold, taking the trained camera shielding detection model as a target camera shielding detection model. The method solves the technical problems of low recognition efficiency, poor accuracy and single output result of the camera shielding detection method in the related art, and cannot meet the real-time video analysis requirement.

Description

Camera shielding detection method and device, storage medium and electronic equipment

Technical Field

The invention relates to the field of artificial intelligence and intelligent video analysis, in particular to a camera shielding detection method, a device, a storage medium and electronic equipment.

Background

The video monitoring field relies on video stream data transmitted by a camera, and if the camera is shielded, the acquired data lose the monitoring meaning. The traditional method is to identify camera shielding by a machine learning edge detection and feature extraction method, and the method is effective only in a full shielding or large-area shielding scene of the camera, and cannot be used for the 'invisible' of the camera caused by half shielding or darkness. The recognition of camera shielding through a neural network has been tried in the industry, but the detection of camera shielding by the neural network is low in accuracy and single in capability, and cannot meet the real-time video analysis requirement. The high-level features in the deep neural network contain rich position information and semantic information, but the distinguishing degree features cannot be reasonably utilized in the related technology, so that the model prediction accuracy is low.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a camera shielding detection method, a device, a storage medium and electronic equipment, which at least solve the technical problems that the camera shielding detection method in the related art is low in recognition efficiency, poor in accuracy and single in output result, and cannot meet the real-time video analysis requirement.

According to an aspect of an embodiment of the present invention, there is provided a camera occlusion detection method, including: acquiring a pre-constructed initial camera occlusion detection model, wherein the initial camera occlusion detection model at least comprises a feature extraction network, a mixed space pyramid network and a mixed up-sampling network, the output end of the feature extraction network is connected with the input end of the mixed space pyramid network, and the output end of the feature extraction network is connected with the input end of the mixed up-sampling network; inputting target image data into the initial camera occlusion detection model for training to obtain a trained camera occlusion detection model; and under the condition that the target loss function value corresponding to the trained camera occlusion detection model meets the loss function threshold, taking the trained camera occlusion detection model as a target camera occlusion detection model, wherein the target loss function value is determined based on a first loss function value corresponding to the mixed space pyramid network and a second loss function value corresponding to the mixed up-sampling network.

According to another aspect of the embodiment of the present invention, there is also provided a camera occlusion detection device, including: the device comprises an acquisition module, a detection module and a detection module, wherein the acquisition module is used for acquiring a pre-constructed initial camera shielding detection model, the initial camera shielding detection model at least comprises a characteristic extraction network, a mixed space pyramid network and a mixed up-sampling network, the output end of the characteristic extraction network is connected with the input end of the mixed space pyramid network, and the output end of the characteristic extraction network is connected with the input end of the mixed up-sampling network; the training module is used for inputting the target image data into the initial camera shielding detection model for training to obtain a trained camera shielding detection model; the determining module is configured to, when a target loss function value corresponding to the trained camera occlusion detection model meets a loss function threshold, use the trained camera occlusion detection model as a target camera occlusion detection model, where the target loss function value is determined based on a first loss function value corresponding to the hybrid spatial pyramid network and a second loss function value corresponding to the hybrid upsampling network.

According to another aspect of the embodiments of the present invention, there is further provided a non-volatile storage medium storing a plurality of instructions adapted to be loaded and executed by a processor to any one of the above-mentioned camera occlusion detection methods.

According to another aspect of the embodiments of the present invention, there is further provided an electronic device including one or more processors and a memory, where the memory is configured to store one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement any one of the methods for detecting camera occlusion.

In the embodiment of the invention, a pre-built initial camera occlusion detection model is obtained, wherein the initial camera occlusion detection model at least comprises a feature extraction network, a mixed space pyramid network and a mixed up-sampling network, the output end of the feature extraction network is connected with the input end of the mixed space pyramid network, and the output end of the feature extraction network is connected with the input end of the mixed up-sampling network; inputting target image data into the initial camera occlusion detection model for training to obtain a trained camera occlusion detection model; under the condition that a target loss function value corresponding to a trained camera shielding detection model meets a loss function threshold, the trained camera shielding detection model is used as the target camera shielding detection model, wherein the target loss function value is determined based on a first loss function value corresponding to the mixed space pyramid network and a second loss function value corresponding to the mixed up-sampling network, the purpose of constructing a camera shielding detection model with higher camera shielding recognition accuracy and better performance is achieved, the technical effects of improving the camera shielding recognition accuracy and recognition efficiency are achieved, and the technical problems that the camera shielding detection method in the related art is low in recognition efficiency, poor in accuracy and single in output result, and cannot meet real-time video analysis requirements are solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a schematic diagram of a camera occlusion detection method according to an embodiment of the present invention;

FIG. 2 is an alternative hybrid spatial pyramid network generation flow diagram according to an embodiment of the present invention;

FIG. 3 is an alternative spatially hidden feature network generation flow diagram in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a model structure of an alternative initial camera occlusion detection model in accordance with an embodiment of the present invention;

FIG. 5 is an alternative schematic view of the occlusion status and occlusion ratio output of an image to be identified according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a camera occlusion detection device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The video monitoring field relies on video stream data transmitted by a camera, and if the camera is shielded, the acquired data lose the monitoring meaning. The traditional method is to identify camera shielding by a machine learning edge detection and feature extraction method, and the method is effective only in a full shielding or large-area shielding scene of the camera, and cannot be used for the 'invisible' of the camera caused by half shielding or darkness. The recognition of camera shielding through a neural network has been tried in the industry, but the detection of camera shielding by the neural network is low in accuracy and single in capability, and cannot meet the real-time video analysis requirement. The high-level features in the deep neural network contain rich position information and semantic information, but the distinguishing degree features cannot be reasonably utilized in the related technology, so that the model prediction accuracy is low. The advantages and disadvantages of the existing camera shielding detection system include:

(1) Identifying the shielding advantage of the camera based on the traditional method: the image texture features are extracted through the manually designed feature extractor, and the large-area or extremely obvious camera shielding condition can be rapidly detected by combining an edge detection algorithm. Identifying a camera shielding defect based on a traditional method: the model can only deal with large-area or extremely obvious shielding scenes, cannot be used for semi-shielding or dark scenes, and has a single output result; depending on other auxiliary devices or manual decisions.

(2) Based on neural network discernment camera shelter from advantage: the strong characteristic extraction capability of the neural network can easily cope with various shielding scenes, and the shielding of the camera is detected in a fully intelligent mode. Identifying a camera occlusion defect based on a neural network: the model depends on a large number of camera occlusion training data, has low accuracy and single capability, and the existing neural network cannot fully utilize middle-high layer semantic features and position information and cannot meet the real-time video analysis requirement.

In view of the foregoing, embodiments of the present invention provide a method embodiment for camera occlusion detection, it should be noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

Fig. 1 is a flowchart of a method for detecting camera occlusion according to an embodiment of the present invention, as shown in fig. 1, the method includes the steps of:

step S102, a pre-built initial camera occlusion detection model is obtained, wherein the initial camera occlusion detection model at least comprises a feature extraction network, a hybrid space pyramid network and a hybrid up-sampling network, the output end of the feature extraction network is connected with the input end of the hybrid space pyramid network, and the output end of the feature extraction network is connected with the input end of the hybrid up-sampling network.

Optionally, the feature extraction network includes: a third number of inverted residual layers, a fourth number of channel attention layers, and a fifth number of spatial attention layers, wherein the inverted residual layers comprise two convolution layers, two batch normalization layers, two activation layers, one data warehouse layer, and one residual layer; the channel attention layer comprises two convolution layers, an average pooling layer and a maximum pooling layer, and two full connection layers; the spatial attention layer includes a convolution layer and a sigmoid activation function. The mixed space pyramid network comprises a space pyramid sampling network and a space hiding layer characteristic network, wherein the space pyramid network comprises a first sampling network and a second sampling network, and the first sampling network comprises two different convolution layers and a pooling layer; the second sampling network includes an average pooling layer, and two pooling layers. The hybrid upsampling network includes a deconvolution layer and two bilinear interpolation layers.

It can be understood that the initial camera occlusion detection model including the feature extraction network, the hybrid spatial pyramid network and the hybrid upsampling network can be understood as an end-to-end perception network model (named as HybNet model). The HybNet model consists of a feature extraction module, a classification module and a semantic segmentation module. The feature extraction module is composed of a feature extraction network and comprises 15 inverted residual layers, 15 channel attention layers and 15 space attention layers which are stacked. The inverted residual block consists of 2 1×1 convolution layers, 2 batch standardization layer BN layers, 2 activation layers, 1 data warehouse layer DW convolution layers and 1 residual connection, wherein the connection mode is 1×1 convolution layer- > BN layer- > activation layer- > DW convolution layer- >1×1 convolution layer- > BN layer- > activation layer- > residual connection; the channel attention layer consists of 2 convolution layers, 1 average pooling layer, 1 maximum pooling layer and 2 full connection layers; the spatial attention layer consists of 1 convolution layer, 1 sigmoid activation function. The classification module is composed of a mixed space pyramid network (namely an HSP operator) and outputs the shielding state (including shielding and non-shielding) of the camera; the semantic segmentation module is composed of a mixed up-sampling network (namely a HUF operator), a camera shielding area is obtained through learning, and shielding proportion is obtained through calculation according to a shielding proportion formula. The SHF features (i.e., spatial weight parameters) are used in both HSP and HUF operators to assign weights in the spatial dimension, weighing the outputs.

And step S104, inputting the target image data into the initial camera occlusion detection model for training, and obtaining a trained camera occlusion detection model.

Optionally, before performing model training, data preprocessing is required to obtain the target image data, including image smoothing, data enhancement, and data normalization, and the specific steps are as follows: a. image smoothing: and carrying out Gaussian blur processing on the original image data to obtain first image data. b. Data enhancement: performing data enhancement processing on the first image data to obtain second image data, wherein the method specifically comprises the following steps: the first image data is rotated at 15 ° intervals between-90 ° and 90 °, the image is flipped horizontally and flipped vertically to expand the image data, and the expanded image data is used as the second image data. c. Data normalization: the second image data is standardized to obtain the target image data, which specifically comprises the following steps: the mean value mu and standard deviation sigma corresponding to the second image data are normalized by using the Z-Score, and the normalized image data are used as target image data.

In an alternative embodiment, the inputting the target image dataset into the initial camera occlusion detection model for training, to obtain a trained camera occlusion detection model, includes:

Step S1041, inputting the target image data set into the feature extraction network to obtain a feature extraction result output by the feature extraction network and a trained feature extraction network;

step S1042, inputting a first number of feature graphs in the feature extraction result to the hybrid spatial pyramid network to obtain a first shielding state output by the hybrid spatial pyramid network and a trained hybrid spatial pyramid network.

Optionally, the first number of feature maps includes one or more of a second feature map output by a last network layer of the feature extraction network and a feature map output by an intermediate network layer of the feature extraction network, and the first number of feature maps are different in size. For example, the first number of feature maps includes three size feature maps [40×40,20×20,10×10], where a feature map with a size of 10×10 is a second feature map of the last network layer output of the feature extraction network.

In an alternative embodiment, the above hybrid spatial pyramid network includes: under the condition of a spatial pyramid sampling network and a spatial hidden layer feature network, inputting a first number of feature graphs in the feature extraction result into the mixed spatial pyramid network to obtain a first shielding state output by the mixed spatial pyramid network, wherein the method comprises the following steps: inputting the first number of feature images output by the feature extraction network into the spatial pyramid sampling network to obtain fusion feature images corresponding to the first number of feature images; inputting the second feature map into the spatial hidden layer feature network to obtain spatial weight parameters corresponding to pixel point positions included in the second feature map; and determining the first shielding state of the mixed space pyramid network output based on the fusion feature map and the space weight parameter.

Through the method, feature graphs (namely, the first number of feature graphs) with different sizes output by the feature extraction network are fused, so that the fused feature graphs fused with semantic information with different depths and widths in the feature graphs achieve the effect of feature semantic aggregation, and further, spatial weight parameters obtained through spatial mapping are obtained through the spatial hidden layer feature network, so that numerical values of corresponding positions in each channel feature graph are dynamically adjusted. The shielding state of the image is determined based on the fusion feature map and the space weight parameter, and the accuracy is higher.

In an optional embodiment, in a case where the spatial pyramid network includes a first sampling network and a second sampling network, the inputting the first number of feature maps output by the feature extraction network into the spatial pyramid sampling network, to obtain a fused feature map corresponding to the first number of feature maps includes: inputting the first number of feature images into the first sampling network to obtain first output feature images corresponding to the first number of feature images respectively, wherein the first sampling network comprises two different convolution layers and a pooling layer; inputting the first number of feature images into the second sampling network to obtain second output feature images corresponding to the first number of feature images respectively, wherein the second sampling network comprises an average pooling layer and two pooling layers; performing stitching processing on first output feature images corresponding to the first number of feature images respectively and second output feature images corresponding to the first number of feature images respectively to obtain third output feature images corresponding to the first number of feature images respectively; and carrying out fusion processing on third output feature images corresponding to the first number of feature images respectively to obtain the fusion feature images.

Optionally, but not limited to, a self-attention mechanism is used to perform fusion processing on third output feature graphs corresponding to the first number of feature graphs respectively, so as to obtain the fusion feature graphs.

Optionally, the first number of feature maps includes [40×40,20×20,10×10]Three dimensional feature maps, 10 x 10 in size, are taken as an example of a second feature map output by the last network layer of the feature extraction network, the spatial pyramid sampling network is composed of 2 sampling combinations, namely a first sampling network [3*3 convolution, 5*5 convolution, and maximum pooling]Second sampling network [ average ]Pooling, max pooling]Respectively [40 x 40,20 x 20,10 x 10 ]]The three size feature graphs are input into a first sampling network and a second sampling network in the spatial pyramid sampling network, then two groups of sampling combined output feature graphs are spliced in the channel dimension (namely, the first feature graph output by the first sampling network and the second feature graph output by the second sampling network are correspondingly spliced in the same size) to respectively obtain V ₁ ，V ₂ ，V ₃ Three feature maps (i.e., a third feature map). For the three feature maps V obtained ₁ ，V ₂ ，V ₃ The self-attention mechanism is adopted for fusion, and the fusion formula is as follows:

Wherein F is _out Representing the fusion profile output results, softmax () represents the softmax activation function.

By the mode, semantic information in feature graphs with different sizes is fully fused, and local information and global information are considered. It should be noted that, the two sampling combinations (i.e. the first sampling combination and the second sampling combination) respectively adopt different step sizes s, padding and channel numbers, so that semantic information from different depths and widths in the feature map can be effectively extracted, and the information is gathered together in a channel dimension splicing manner, thereby achieving the effect of feature semantic aggregation.

Optionally, the spatial hidden layer feature (denoted as SHF) is a weight parameter of different pixel positions learned in a spatial dimension, the second feature map is input to the spatial hidden layer feature network, so as to obtain spatial weight parameters corresponding to pixel positions included in the second feature map, and a specific calculation formula in the spatial hidden layer feature network is as follows:

(1) local maximization:

/>

(2) local averaging:

(3) hidden layer feature learning:

(4) normalization:

wherein F is ₁ Representing the maximum value of each pixel position in the feature map (i.e. the second feature map) output by the last network layer of the feature extraction network, F ₂ Mean value of each pixel point position in the feature map representing the last network layer output of the feature extraction network, F ₃ F corresponding to the feature map output by the last network layer of the feature extraction network ₁ And F ₂ And the splice result of cat (F) ₁ ,F ₂ ) Representation of F ₁ And F ₂ Performing splicing treatment, F _in The method is characterized in that a feature map (namely a second feature map, such as a feature map with the size of 10 x 10) output by the last network layer of the feature extraction network is referred to, max is a channel maximum value, avg is a channel average value, conv is a 1 x 1 convolution, SHF is a calculated hidden layer feature (namely a space weight parameter), C represents the channel dimension in the feature extraction network, H represents the feature map length in each channel, and W represents the feature map width in each channel.

It can be understood that the above-mentioned spatial hidden layer feature network is a weight parameter obtained through spatial mapping, and can dynamically adjust the numerical value of the corresponding position in each channel feature map.

Optionally, based on the fused feature map and the spatial weight parameter, the first occlusion state of the hybrid spatial pyramid network output is determined by:

F ^′ _out ＝softmax(avg(F _out *SHF))

wherein F is ^′ _out Is the output of the HSP operator (i.e. the first of the mixed-space pyramid network outputs Shielding state), F _out Is the output of the spatial pyramid network sampling (i.e., fused feature map), and SHF is the computed hidden layer feature (i.e., spatial weight parameter).

Step S1043, inputting the second number of feature maps in the feature extraction result to the mixed up-sampling network, to obtain a first occlusion ratio output by the mixed up-sampling network, and a trained mixed up-sampling network.

Optionally, the second number of feature maps is a plurality of feature maps in the feature maps output by the intermediate network layer of the feature extraction network, and the second number of feature maps are different in size. The second number of feature patterns comprises a third feature pattern F ₁ 20 x 20, fourth feature map F ₂ 40 x 40 and fifth feature map F ₃ 80 by 80, the dimensions of the third feature map, the fourth feature map, and the fifth feature map are in a predetermined multiple (e.g., 2 times) relationship.

In an alternative embodiment, the mixed up-sampling network includes a deconvolution layer and two bilinear interpolation layers, the second number of feature maps includes a third feature map, a fourth feature map, and a fifth feature map, and when the dimensions of the third feature map, the fourth feature map, and the fifth feature map are in a predetermined multiple relationship, the inputting the second number of feature maps in the feature extraction result to the mixed up-sampling network, to obtain a first occlusion ratio output by the mixed up-sampling network includes: performing first upsampling processing on the third feature map by adopting the two bilinear interpolation layers to obtain a third feature map after the first upsampling processing; splicing the third characteristic diagram after the first upsampling treatment with the fourth characteristic diagram to obtain a first spliced characteristic diagram; performing weight distribution processing on pixel point positions included in the first spliced feature map based on the space weight parameters to obtain a second spliced feature map; performing second upsampling processing on the second spliced feature map by adopting the two bilinear interpolation layers to obtain a second spliced feature map after the second upsampling processing; splicing the second spliced characteristic map after the second upsampling treatment with the fifth characteristic map to obtain a third spliced characteristic map; performing weight distribution processing on pixel point positions included in the third spliced feature map based on the space weight parameters to obtain a fourth spliced feature map; inputting the fourth spliced characteristic map into the deconvolution layer to perform third upsampling treatment to obtain a sixth characteristic map, wherein the sixth characteristic map has the same size as the target image data; and carrying out semantic segmentation processing on the sixth feature map to obtain the first shielding proportion.

Optionally, the second number of feature patterns includes a third feature pattern F ₁ 20 x 20, fourth feature map F ₂ 40 x 40, and a fifth feature map F ₃ 80 x 80 is taken as an example, the dimensions of the third feature map, the fourth feature map and the fifth feature map are in a predetermined multiple (such as 2 times) relationship, the mixed up-sampling network acts on the segmentation task and consists of 1 deconvolution layer and 2 bilinear interpolation layers, and the second number of feature maps (such as [ F ₁ :20*20,F ₂ :40*40,F ₃ :80*80]These three dimensional feature maps) are up-sampled. The specific process is that the third characteristic diagram F1 is spliced with the fourth characteristic diagram F2 in the channel dimension after being up-sampled by 2 times by using the bilinear interpolation layer, then weight is distributed by using SHF characteristics (namely space weight parameters), the third characteristic diagram F1 is spliced with the fifth characteristic diagram F3 in the channel dimension after being up-sampled by 2 times by using the bilinear interpolation layer, then weight is distributed by using the SHF characteristics, then 8 times of up-sampling is carried out to the original diagram size by using deconvolution, semantic segmentation is carried out on the image, and then the proportion of an occlusion area to the original image is calculated, so that the first occlusion proportion of the mixed up-sampling network output is obtained. It has been verified that bilinear interpolation can perform better at low up-sampling rates than deconvolution, which is better at high sampling rates. The calculation formula of the first occlusion ratio output by the mixed up-sampling network (i.e. HUF operator) is as follows:

Bilear ^8× {Deconv ^2× [Deconv ^2× (F1),conv(F2)],conv(F3)}

Where conv refers to a 1×1 convolution, deconv refers to deconvolution, 2×or 8×up-sampling by a factor of 2 or 8, bilear refers to bilinear interpolation.

The calculation formula of the shielding proportion comprises the following steps:

wherein O is _ij Referring to the occluded pixels, denoted as 1, N refers to the width of the original image (corresponding to the target graphics data), i represents the column pixel locations, and j represents the row pixel locations.

Step S1044, obtaining the trained camera occlusion detection model based on the trained feature extraction network, the trained hybrid spatial pyramid network, and the trained hybrid upsampling network.

Optionally, after model training is finished, combining the obtained trained feature extraction network, the trained mixed space pyramid network and the trained mixed up-sampling network to obtain the trained camera occlusion detection model.

And step S106, taking the trained camera occlusion detection model as a target camera occlusion detection model under the condition that the target loss function value corresponding to the trained camera occlusion detection model meets a loss function threshold, wherein the target loss function value is determined based on the first loss function value corresponding to the mixed space pyramid network and the second loss function value corresponding to the mixed up-sampling network.

Optionally, in the process of model training of the initial camera occlusion detection model, a random gradient descent SGD optimization algorithm is used, the training minimum batch processing amount is 64, the iteration period is 200 rounds, and the loss function is

The learning rate is respectively [40,80,120 ]]The wheel is lowered 10 times. The model output is the camera shielding condition and the camera shielding proportion, the output result is compared with the real result, the parameter iteration of the model is updated through the SGD optimization algorithm, and the optimal trained camera shielding detection model is obtained and used as the target camera shielding detection model.

In an optional embodiment, before taking the trained camera occlusion detection model as the target camera occlusion detection model, in a case where the target loss function value corresponding to the trained camera occlusion detection model meets the loss function threshold, the method further includes: determining the first loss function value based on the first occlusion state and the actual occlusion state; determining the second loss function value based on the first occlusion ratio and the actual occlusion ratio; the target loss function value is obtained based on the first loss function value and the second loss function value.

It should be noted that, in the embodiment of the present invention, the output of the classification task and the semantic segmentation task is information of two different dimensions and different semantics, in order to train the two tasks simultaneously, and make the two tasks mutually complement and balance each other, the first loss function value is determined as the classification task loss based on the first shielding state and the actual shielding state output by the hybrid spatial pyramid network, the second loss function value is determined as the segmentation task loss based on the second shielding state and the actual shielding state output by the hybrid upsampling network, and the classification task loss (corresponding to the hybrid spatial pyramid network) and the segmentation task loss (corresponding to the hybrid upsampling network) can be dynamically weighted in the training process.

In an alternative embodiment, the obtaining the objective loss function value based on the first loss function value and the second loss function value includes:

the objective loss function value is obtained based on the first loss function value and the second loss function value by:

wherein Loss represents the target Loss function value, lambda represents a preset learnable parameter, L ₁ Representing the first loss function value, L ₂ Represents the second loss function value, d _k Representing the corresponding output dimension of the hybrid upsampling network, the first loss function value being determined based on a cross entropy loss functionThe second loss function value is determined based on a binary cross entropy loss function.

Note that, compared with loss=l ₁ +L ₂ The calculation mode of the loss function value provided by the embodiment of the invention can dynamically adjust heterogeneous task loss, and is more flexible. In addition, log and d are taken _k The parameters can reduce the loss gap between two different tasks, and the two tasks are mutually lifted and interacted in a unified dimension.

In an optional embodiment, in the case that the target loss function value corresponding to the trained camera occlusion detection model meets the loss function threshold, after taking the trained camera occlusion detection model as the target camera occlusion detection model, the method further includes: acquiring an image to be identified; and inputting the image to be identified into the target camera shielding detection model to obtain a second shielding state and a second shielding proportion corresponding to the image to be identified.

Alternatively, the image to be identified may be, but is not limited to, obtained by capturing a photograph of the image in real time through a camera.

By the method, after the target camera shielding detection model is obtained, the image to be identified obtained through real-time snapshot of the camera is input into the target camera shielding detection model for testing, and the obtained testing result comprises shielding states and shielding proportions corresponding to the image to be identified. The target camera shielding detection model can not only know whether shielding exists in the image to be recognized, but also further recognize the image shielding proportion in the image to be recognized, and better meet the requirement of video real-time analysis.

Through the steps S102 to S106, the purpose of constructing a camera shielding detection model with higher camera shielding recognition accuracy and better performance can be achieved, so that the technical effects of improving the camera shielding recognition accuracy and recognition efficiency are achieved, and the technical problems that the camera shielding detection method in the related art is low in recognition efficiency, poor in accuracy and single in output result, and cannot meet the real-time video analysis requirement are solved.

Based on the above embodiment and optional embodiments, the present invention proposes an optional implementation, and the method includes the following steps:

Step S1, constructing an end-to-end perception network model (recorded as a HybNet model) as an initial camera shielding detection model, wherein the method comprises the following specific steps of:

step S11, as shown in fig. 2, generates a hybrid spatial pyramid network (denoted as HSP crossover operator), and specifically includes the following steps:

and S111, constructing a spatial pyramid sampling network. The spatial pyramid sampling network is composed of 2 sampling combinations, namely a first sampling network [3*3 convolution, 5*5 convolution, and maximum pooling]Second sampling network [ average pooling, max pooling ]]Respectively [40 x 40,20 x 20,10 x 10 ]]The three size feature graphs are input into a first sampling network and a second sampling network in the spatial pyramid sampling network, then two groups of sampling combined output feature graphs are spliced in the channel dimension (namely, the first feature graph output by the first sampling network and the second feature graph output by the second sampling network are correspondingly spliced in the same size) to respectively obtain V ₁ ，V ₂ ，V ₃ Three feature maps (i.e., a third feature map). For the three feature maps V obtained ₁ ，V ₂ ，V ₃ And fusing by adopting a self-attention mechanism to obtain a fused characteristic diagram.

Step S112, as shown in FIG. 3, a spatial hidden layer feature network (marked as SHF) is constructed. The spatial hidden layer feature is a weight parameter of different pixel positions learned in a spatial dimension, the second feature map is input into the spatial hidden layer feature network to obtain spatial weight parameters respectively corresponding to pixel point positions included in the second feature map, the spatial hidden layer feature network is a weight parameter obtained through spatial mapping, and numerical values of corresponding positions in each channel feature map can be dynamically adjusted.

And step S113, calculating the output result of the mixed space pyramid network.

The calculation formula of the output result of the mixed space pyramid network (namely the HSP operator) is as follows:

F ^′ _out ＝softmax(avg(F _out *SHF))

wherein F is ^′ _out Is the output of HSP operator (i.e. the first occlusion state of the mixed-space pyramid network output), F _out Is the output of the spatial pyramid network sampling (i.e., fused feature map), and SHF is the computed hidden layer feature (i.e., spatial weight parameter).

Step S12, generating a mixed up-sampling network (marked as HUF operator), which comprises the following specific steps: the mixed up-sampling network acts on the segmentation task and consists of 1 deconvolution layer and 2 bilinear interpolation layers, respectively for a second number of feature maps (e.g. [ F ] ₁ :20*20,F ₂ :40*40,F ₃ :80*80]These three dimensional feature maps) are up-sampled. The specific process is that the third characteristic diagram F1 is spliced with the fourth characteristic diagram F2 in the channel dimension after being up-sampled by 2 times by using the bilinear interpolation layer, then weight is distributed by using SHF characteristics (namely space weight parameters), the third characteristic diagram F1 is spliced with the fifth characteristic diagram F3 in the channel dimension after being up-sampled by 2 times by using the bilinear interpolation layer, then weight is distributed by using the SHF characteristics, then 8 times of up-sampling is carried out to the original diagram size by using deconvolution, semantic segmentation is carried out on the image, and then the proportion of an occlusion area to the original image is calculated, so that the first occlusion proportion of the mixed up-sampling network output is obtained. Bilinear interpolation may work better at low up-sampling rates than deconvolution, which is better at high sampling rates.

Step S13, an end-to-end perception network model is built to serve as an initial camera occlusion detection model, and a model structure of the initial camera occlusion detection model is shown in fig. 5. The initial camera shielding detection model consists of a feature extraction module, a classification module and a semantic segmentation module. The feature extraction module is composed of a feature extraction network and comprises 15 inverted residual layers, 15 channel attention layers and 15 space attention layers which are stacked. The inverted residual block consists of 2 1×1 convolution layers, 2 batch standardization layer BN layers, 2 activation layers, 1 data warehouse layer DW convolution layers and 1 residual connection, wherein the connection mode is 1×1 convolution layer- > BN layer- > activation layer- > DW convolution layer- >1×1 convolution layer- > BN layer- > activation layer- > residual connection; the channel attention layer consists of 2 convolution layers, 1 average pooling layer, 1 maximum pooling layer and 2 full connection layers; the spatial attention layer consists of 1 convolution layer, 1 sigmoid activation function. The classification module is composed of a mixed space pyramid network (namely an HSP operator) and outputs the shielding state (including shielding and non-shielding) of the camera; the semantic segmentation module is composed of a mixed up-sampling network (namely a HUF operator), a camera shielding area is obtained through learning, and shielding proportion is obtained through calculation according to a shielding proportion formula. The SHF features (i.e., spatial weight parameters) are used in both HSP and HUF operators to assign weights in the spatial dimension, weighing the outputs.

Step S2, model training, which comprises the following specific steps:

step S21, data preprocessing, including image smoothing, data enhancement and data normalization, comprising the following specific steps: a. image smoothing: and carrying out Gaussian blur processing on the original image data to obtain first image data. b. Data enhancement: performing data enhancement processing on the first image data to obtain second image data, wherein the method specifically comprises the following steps: the first image data is rotated at 15 ° intervals between-90 ° and 90 °, the image is flipped horizontally and flipped vertically to expand the image data, and the expanded image data is used as the second image data. c. Data normalization: the second image data is standardized to obtain the target image data, which specifically comprises the following steps: the mean value mu and standard deviation sigma corresponding to the second image data are normalized by using the Z-Score, and the normalized image data are used as target image data.

Step S22, defining a loss function. The embodiment of the invention is heterogeneous task learning, the output of a classification task and a semantic segmentation task is information of two different dimensionalities and different semantics, and in order to train the two tasks simultaneously, and the two tasks are mutually complemented and balanced, the classification task loss (corresponding to a mixed space pyramid network) and the segmentation task loss (corresponding to a mixed up-sampling network) can be dynamically weighed in the training process, so that a design loss function is as follows:

Wherein lambda is a learnable parameter, which can be dynamically adjusted during training, L ₁ A first loss function value, which refers to a classification module (i.e., a hybrid spatial pyramid network), is determined based on the cross entropy loss function; l (L) ₂ The second loss function, referred to as the segmentation module (i.e., the hybrid upsampling network), is determined based on the binary cross entropy BCE loss function, d _k Refers to the dimension of the mixed up-sampling network output, for L ₂ And (5) reasonably scaling.

And S23, training a model. In the process of model training of an initial camera occlusion detection model, an SGD (generalized algorithm-based digital simulation) optimization algorithm is used, the training minimum batch processing capacity is 64, the iteration period is 200, and the loss function is

Step S3, after the target camera shielding detection model is obtained, the image to be recognized obtained through real-time snapshot of the camera is input into the target camera shielding detection model for testing, and the obtained testing result comprises shielding states corresponding to the image to be recognized and shielding proportion, and as shown in FIG. 5, shielding states and shielding proportion information of the image to be recognized, which are predicted by the target camera shielding detection model, are displayed. The target camera occlusion detection model can not only know whether occlusion exists in the image to be identified, but also further identify the image occlusion proportion in the image to be identified.

The embodiment of the invention at least can realize the following technical effects: (1) The problem of camera shielding detection is solved by constructing a hybrid feature fusion heterogeneous task learning deep neural network. The hybrid feature fusion in the model is realized by an autonomously designed hybrid spatial pyramid network (HSP) and a hybrid upsampling network (HUF), and heterogeneous task learning means that the model can learn and predict the shielding condition and shielding proportion of the camera at the same time. The HSP operator and the HUF operator are respectively used for fusing middle-high layer characteristic semantic information and position information in camera shielding prediction and shielding proportion prediction. The invention designs a self-learning loss function aiming at heterogeneous task learning, and the function can be iteratively updated in the back propagation process to balance conflict among heterogeneous learning tasks. The model adopts a mixed characteristic fusion and heterogeneous task learning architecture, fully utilizes the semantic information and the position information of the middle-high layer characteristics, balances the multi-task learning loss, has more accurate and reliable output result and higher reasoning speed, and meets the requirements of real-time video detection and analysis. (2) The following capabilities are mainly realized by adopting the technologies of deep learning, edge detection and the like: accurately predicting the shielding condition and shielding proportion information of the camera in real time; the full utilization of the middle-high layer semantic information and the position information is realized; mutual complementation and balance of heterogeneous task learning are realized; greatly reduces the system resource consumption and the cost investment. (3) innovatively designing HSP operators and HUF operators: HSP operators act on classification tasks, semantic information and position information of high-level features in a network are fully fused, and weights are reasonably distributed in space dimension by using Space Hidden Features (SHF); the HUF operator acts on the semantic segmentation task, and samples the middle and high-level feature graphs on multiple scales, so that the problem of distortion of high-level semantic features during upsampling is solved, and the problem of unbalanced feature weights is avoided. (4) The innovation provides a hybrid feature fusion and heterogeneous task learning network (HybNet): building an end-to-end trainable neural network model HybNet, accurately analyzing the shielding condition and the shielding proportion of the camera by utilizing the HSP operator, the HUF operator and the SHF characteristics, and adapting to various shielding scenes of the camera. (5) The self-learning loss function is designed, so that the classification task and the segmentation task are mutually complemented, the heterogeneous task loss is balanced, and the convergence speed and the prediction accuracy of the model are improved. (6) Introducing an inverse residual module and DW convolution to improve the network inference speed: the model uses an inverse residual error module and DW convolution to reduce the parameter quantity, improve the reasoning speed and greatly reduce the resource consumption and the cost investment. (7) Under the market pattern of video monitoring market full-service competition, the method can be applied to camera monitoring scenes such as a safe campus, a smart city and smart security, accurately and intelligently predicts the shielding condition and the shielding proportion of the camera, gives alarm information, greatly reduces the manpower monitoring cost and pulls the economic development of the corresponding scene.

In this embodiment, a camera shielding detection device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the terms "module," "apparatus" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

According to an embodiment of the present invention, there is further provided an apparatus embodiment for implementing the above method for detecting camera occlusion, and fig. 6 is a schematic structural diagram of a camera occlusion detecting apparatus according to an embodiment of the present invention, as shown in fig. 6, where the above camera occlusion detecting apparatus includes: an acquisition module 600, a training module 602, a determination module 604, wherein:

the obtaining module 600 is configured to obtain an initial camera occlusion detection model that is constructed in advance, where the initial camera occlusion detection model at least includes a feature extraction network, a hybrid spatial pyramid network, and a hybrid upsampling network, an output end of the feature extraction network is connected to an input end of the hybrid spatial pyramid network, and an output end of the feature extraction network is connected to an input end of the hybrid upsampling network;

The training module 602, coupled to the obtaining module 600, is configured to input target image data into the initial camera occlusion detection model for training, to obtain a trained camera occlusion detection model;

the determining module 604 is connected to the training module 602, and is configured to take the trained camera occlusion detection model as a target camera occlusion detection model when a target loss function value corresponding to the trained camera occlusion detection model meets a loss function threshold, where the target loss function value is determined based on a first loss function value corresponding to the hybrid spatial pyramid network and a second loss function value corresponding to the hybrid upsampling network.

In the embodiment of the present invention, the acquiring module 600 is configured to acquire a pre-constructed initial camera occlusion detection model, where the initial camera occlusion detection model at least includes a feature extraction network, a hybrid spatial pyramid network, and a hybrid upsampling network, an output end of the feature extraction network is connected to an input end of the hybrid spatial pyramid network, and an output end of the feature extraction network is connected to an input end of the hybrid upsampling network; the training module 602, coupled to the obtaining module 600, is configured to input target image data into the initial camera occlusion detection model for training, to obtain a trained camera occlusion detection model; the determining module 604 is connected to the training module 602, and is configured to use the trained camera occlusion detection model as a target camera occlusion detection model when a target loss function value corresponding to the trained camera occlusion detection model meets a loss function threshold, where the target loss function value is determined based on a first loss function value corresponding to the hybrid spatial pyramid network and a second loss function value corresponding to the hybrid upsampling network, thereby achieving the purpose of constructing a camera occlusion detection model with higher camera occlusion recognition accuracy and better performance, achieving the technical effects of improving camera occlusion recognition accuracy and recognition efficiency, and further solving the technical problems of low recognition efficiency, poor accuracy and single output result, which cannot meet real-time video analysis requirements in the camera occlusion detection method in the related art.

It should be noted that each of the above modules may be implemented by software or hardware, for example, in the latter case, it may be implemented by: the above modules may be located in the same processor; alternatively, the various modules described above may be located in different processors in any combination.

It should be noted that, the acquiring module 600, the training module 602, and the determining module 604 correspond to the steps S102 to S106 in the embodiment, and the modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the embodiment. It should be noted that the above modules may be run in a computer terminal as part of the apparatus.

It should be noted that, the optional or preferred implementation manner of this embodiment may be referred to the related description in the embodiment, and will not be repeated herein.

The camera occlusion detection device may further include a processor and a memory, where the acquisition module 600, the training module 602, the determination module 604, and the like are stored as program modules, and the processor executes the program modules stored in the memory to implement corresponding functions.

The processor comprises a kernel, the kernel accesses the memory to call the corresponding program module, and the kernel can be provided with one or more than one. The memory may include volatile memory, random Access Memory (RAM), and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), among other forms in computer readable media, the memory including at least one memory chip.

According to an embodiment of the present application, there is also provided an embodiment of a nonvolatile storage medium. Optionally, in this embodiment, the nonvolatile storage medium includes a stored program, where the device in which the nonvolatile storage medium is located is controlled to execute any one of the above-mentioned camera occlusion detection methods when the program runs.

Alternatively, in this embodiment, the above-mentioned nonvolatile storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network or in any one of the mobile terminals in the mobile terminal group, and the above-mentioned nonvolatile storage medium includes a stored program.

Optionally, the program controls the device in which the nonvolatile storage medium is located to perform the following functions when running: acquiring a pre-constructed initial camera occlusion detection model, wherein the initial camera occlusion detection model at least comprises a feature extraction network, a mixed space pyramid network and a mixed up-sampling network, the output end of the feature extraction network is connected with the input end of the mixed space pyramid network, and the output end of the feature extraction network is connected with the input end of the mixed up-sampling network; inputting target image data into the initial camera occlusion detection model for training to obtain a trained camera occlusion detection model; and under the condition that the target loss function value corresponding to the trained camera occlusion detection model meets the loss function threshold, taking the trained camera occlusion detection model as a target camera occlusion detection model, wherein the target loss function value is determined based on a first loss function value corresponding to the mixed space pyramid network and a second loss function value corresponding to the mixed up-sampling network.

According to an embodiment of the present application, there is also provided an embodiment of a processor. Optionally, in this embodiment, the processor is configured to run a program, where any one of the methods for detecting camera occlusion is executed when the program runs.

According to an embodiment of the present application, there is also provided an embodiment of a computer program product adapted to perform a program initialized with the steps of any one of the camera occlusion detection methods described above when executed on a data processing device.

Optionally, the computer program product mentioned above, when executed on a data processing device, is adapted to perform a program initialized with the method steps of: acquiring a pre-constructed initial camera occlusion detection model, wherein the initial camera occlusion detection model at least comprises a feature extraction network, a mixed space pyramid network and a mixed up-sampling network, the output end of the feature extraction network is connected with the input end of the mixed space pyramid network, and the output end of the feature extraction network is connected with the input end of the mixed up-sampling network; inputting target image data into the initial camera occlusion detection model for training to obtain a trained camera occlusion detection model; and under the condition that the target loss function value corresponding to the trained camera occlusion detection model meets the loss function threshold, taking the trained camera occlusion detection model as a target camera occlusion detection model, wherein the target loss function value is determined based on a first loss function value corresponding to the mixed space pyramid network and a second loss function value corresponding to the mixed up-sampling network.

The embodiment of the invention provides an electronic device, which comprises a processor, a memory and a program stored on the memory and capable of running on the processor, wherein the following steps are realized when the processor executes the program: acquiring a pre-constructed initial camera occlusion detection model, wherein the initial camera occlusion detection model at least comprises a feature extraction network, a mixed space pyramid network and a mixed up-sampling network, the output end of the feature extraction network is connected with the input end of the mixed space pyramid network, and the output end of the feature extraction network is connected with the input end of the mixed up-sampling network; inputting target image data into the initial camera occlusion detection model for training to obtain a trained camera occlusion detection model; and under the condition that the target loss function value corresponding to the trained camera occlusion detection model meets the loss function threshold, taking the trained camera occlusion detection model as a target camera occlusion detection model, wherein the target loss function value is determined based on a first loss function value corresponding to the mixed space pyramid network and a second loss function value corresponding to the mixed up-sampling network.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the modules may be a logic function division, and there may be another division manner when actually implemented, for example, a plurality of modules or components may be combined or may be integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with respect to each other may be through some interface, module or indirect coupling or communication connection of modules, electrical or otherwise.

The modules described above as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable non-volatile storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a non-volatile storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned nonvolatile storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The camera shielding detection method is characterized by comprising the following steps of:

the method comprises the steps of obtaining a pre-built initial camera occlusion detection model, wherein the initial camera occlusion detection model at least comprises a feature extraction network, a mixed space pyramid network and a mixed up-sampling network, the output end of the feature extraction network is connected with the input end of the mixed space pyramid network, and the output end of the feature extraction network is connected with the input end of the mixed up-sampling network;

inputting target image data into the initial camera occlusion detection model for training to obtain a trained camera occlusion detection model;

under the condition that a target loss function value corresponding to the trained camera occlusion detection model meets a loss function threshold, taking the trained camera occlusion detection model as a target camera occlusion detection model, wherein the target loss function value is determined based on a first loss function value corresponding to the hybrid spatial pyramid network and a second loss function value corresponding to the hybrid upsampling network.

2. The method of claim 1, wherein inputting the target image dataset into the initial camera occlusion detection model for training results in a trained camera occlusion detection model, comprising:

inputting the target image dataset into the feature extraction network to obtain a feature extraction result output by the feature extraction network and a trained feature extraction network;

inputting a first number of feature graphs in the feature extraction result to the mixed space pyramid network to obtain a first shielding state output by the mixed space pyramid network and a trained mixed space pyramid network, wherein the first number of feature graphs comprise one or more of a second feature graph output by a last network layer of the feature extraction network and a feature graph output by an intermediate network layer of the feature extraction network, and the first number of feature graphs are different in size;

inputting a second number of feature graphs in the feature extraction result to the mixed up-sampling network to obtain a first shielding proportion output by the mixed up-sampling network and a trained mixed up-sampling network, wherein the second number of feature graphs are a plurality of feature graphs in the feature graphs output by an intermediate network layer of the feature extraction network, and the second number of feature graphs are different in size;

And obtaining the trained camera occlusion detection model based on the trained feature extraction network, the trained mixed space pyramid network and the trained mixed up-sampling network.

3. The method according to claim 2, wherein in case the objective loss function value corresponding to the trained camera occlusion detection model meets a loss function threshold, before taking the trained camera occlusion detection model as the objective camera occlusion detection model, the method further comprises:

determining the first loss function value based on the first occlusion state and an actual occlusion state;

determining the second loss function value based on the first occlusion proportion and the actual occlusion proportion;

and obtaining the target loss function value based on the first loss function value and the second loss function value.

4. The method of claim 3, wherein the deriving the objective loss function value based on the first loss function value and the second loss function value comprises:

based on the first loss function value and the second loss function value, the objective loss function value is obtained by:

Wherein Loss represents the target Loss function value, lambda represents a preset learnable parameter, L ₁ Representing the first loss function value, L ₂ Representing the second loss function value, d _k Representing the corresponding output dimension of the hybrid upsampling network, the first loss function value being determined based on a cross entropy loss function, the second loss function value being determined based on a binary cross entropy loss function.

5. The method of claim 2, wherein the mixed spatial pyramid network comprises: under the conditions of a spatial pyramid sampling network and a spatial hidden layer feature network, inputting a first number of feature graphs in the feature extraction result to the mixed spatial pyramid network to obtain a first shielding state output by the mixed spatial pyramid network, wherein the method comprises the following steps:

inputting the first number of feature images output by the feature extraction network to the spatial pyramid sampling network to obtain fusion feature images corresponding to the first number of feature images;

inputting the second feature map to the spatial hidden layer feature network to obtain spatial weight parameters respectively corresponding to pixel point positions included in the second feature map;

And determining the first shielding state output by the mixed space pyramid network based on the fusion feature map and the space weight parameter.

6. The method according to claim 5, wherein, in the case that the spatial pyramid network includes a first sampling network and a second sampling network, the inputting the first number of feature maps output by the feature extraction network to the spatial pyramid sampling network, to obtain a fused feature map corresponding to the first number of feature maps, includes:

inputting the first number of feature images into the first sampling network to obtain first output feature images corresponding to the first number of feature images respectively, wherein the first sampling network comprises two different convolution layers and a pooling layer;

inputting the first number of feature images into the second sampling network to obtain second output feature images corresponding to the first number of feature images respectively, wherein the second sampling network comprises an average pooling layer and two pooling layers;

performing splicing processing on the first output feature images corresponding to the first number of feature images and the second output feature images corresponding to the first number of feature images respectively to obtain third output feature images corresponding to the first number of feature images respectively;

And carrying out fusion processing on third output feature images corresponding to the first number of feature images respectively to obtain the fusion feature images.

7. The method according to claim 5, wherein in the case that the mixed up-sampling network includes a deconvolution layer and two bilinear interpolation layers, the second number of feature maps includes a third feature map, a fourth feature map, and a fifth feature map, and the dimensions of the third feature map, the fourth feature map, and the fifth feature map are in a predetermined multiple relationship, the inputting the second number of feature maps in the feature extraction result to the mixed up-sampling network, to obtain a first occlusion ratio of the mixed up-sampling network output includes:

performing first upsampling processing on the third feature map by adopting the two bilinear interpolation layers to obtain a third feature map after the first upsampling processing;

performing splicing processing on the third characteristic diagram after the first upsampling processing and the fourth characteristic diagram to obtain a first spliced characteristic diagram;

performing weight distribution processing on pixel point positions included in the first spliced feature map based on the space weight parameters to obtain a second spliced feature map;

Performing second upsampling processing on the second spliced feature map by adopting the two bilinear interpolation layers to obtain a second spliced feature map after the second upsampling processing;

performing splicing treatment on the second spliced characteristic map after the second upsampling treatment and the fifth characteristic map to obtain a third spliced characteristic map;

performing weight distribution processing on pixel point positions included in the third spliced feature map based on the space weight parameters to obtain a fourth spliced feature map;

inputting the fourth spliced characteristic map to the deconvolution layer for third up-sampling processing to obtain a sixth characteristic map, wherein the sixth characteristic map has the same size as the target image data;

and carrying out semantic segmentation processing on the sixth feature map to obtain the first shielding proportion.

8. The method of claim 1, wherein the feature extraction network comprises: a third number of inverted residual layers, a fourth number of channel attention layers, a fifth number of spatial attention layers, wherein the inverted residual layers comprise two convolution layers, two batch normalization layers, two activation layers, one data warehouse layer, and one residual layer; the channel attention layer comprises two convolution layers, an average pooling layer and a maximum pooling layer, and two full connection layers; the spatial attention layer includes a convolution layer and a sigmoid activation function.

9. The method according to any one of claims 1 to 8, wherein in a case where the objective loss function value corresponding to the trained camera occlusion detection model meets the loss function threshold, after taking the trained camera occlusion detection model as the objective camera occlusion detection model, the method further comprises:

acquiring an image to be identified;

and inputting the image to be identified into the target camera shielding detection model to obtain a second shielding state and a second shielding proportion corresponding to the image to be identified.

10. A camera occlusion detection device, comprising:

the device comprises an acquisition module, a detection module and a detection module, wherein the acquisition module is used for acquiring a pre-constructed initial camera shielding detection model, the initial camera shielding detection model at least comprises a feature extraction network, a mixed space pyramid network and a mixed up-sampling network, the output end of the feature extraction network is connected with the input end of the mixed space pyramid network, and the output end of the feature extraction network is connected with the input end of the mixed up-sampling network;

the training module is used for inputting target image data into the initial camera occlusion detection model for training to obtain a trained camera occlusion detection model;

The determining module is configured to, when a target loss function value corresponding to the trained camera occlusion detection model meets a loss function threshold, use the trained camera occlusion detection model as a target camera occlusion detection model, where the target loss function value is determined based on a first loss function value corresponding to the hybrid spatial pyramid network and a second loss function value corresponding to the hybrid upsampling network.

11. A non-volatile storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the camera occlusion detection method of any of claims 1 to 9.

12. An electronic device comprising one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the camera occlusion detection method of any of claims 1-9.