CN111339892B

CN111339892B - Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

Info

Publication number: CN111339892B
Application number: CN202010106457.6A
Authority: CN
Inventors: 纪刚; 商胜楠; 周萌萌; 周亚敏; 周粉粉
Original assignee: Qingdao Lianhe Chuangzhi Technology Co ltd
Current assignee: Qingdao Lianhe Chuangzhi Technology Co ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2023-04-18
Anticipated expiration: 2040-02-21
Also published as: CN111339892A

Abstract

The invention belongs to the technical field of video monitoring, and relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network; the method comprises the steps of S1, carrying out pixel-level binary marking on an original monitoring video; s2, respectively inputting each video clip into the 3D convolution neural network at the encoder end to obtain an input video clip v _i Characteristic cube F _i (ii) a S3, video clip v _i Each frame in the system uses a segmentation branch to predict the background or the pixel level of the behavior foreground; s4, intercepting feature cube F _i Inputting a recognition branch according to the characteristics of the corresponding position, and performing ToI pooling to obtain a predicted behavior tag; s5, reading real-time video stream of the swimming pool area, positioning the behavior position of a swimmer, predicting a behavior label, and judging whether abnormal behaviors such as drowning occur or not; the method adopts a pixel-level binary marking mode, saves time consumed by marking samples, provides pixel-level behavior positioning, has more accurate positioning mode, and solves the problem of difficult regression convergence of the frame.

Description

Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

The technical field is as follows:

the invention belongs to the technical field of video monitoring, relates to a method for detecting drowning behavior based on monitoring video, and particularly relates to a method for detecting drowning of a swimming pool based on an end-to-end 3D convolutional neural network.

Background art:

the convolutional neural network is a feedforward neural network which comprises convolutional calculation and has a deep structure, and is one of representative algorithms of deep learning. The convolutional neural network has the characteristic learning ability and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network; the convolutional neural network includes an input layer, a hidden layer, and an output layer.

The input layer of the convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, wherein the one-dimensional array is usually a time or frequency spectrum sample; the two-dimensional array may include a plurality of channels; an input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. Hidden layers of the convolutional neural network comprise 3 types of common structures such as convolutional layers, pooling layers and fully-connected layers, and some more modern algorithms may have complicated structures such as an inclusion module and a residual block. The convolution layer has the function of extracting the characteristics of input data, the interior of the convolution layer comprises a plurality of convolution kernels, and each element forming the convolution kernels corresponds to a weight coefficient and a deviation value; after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer contains a preset pooling function, the function of which is to replace the result of a single point in the feature map with the feature map statistics of its neighboring area. The fully-connected layer in the convolutional neural network is equivalent to the hidden layer in the traditional feedforward neural network. The fully-connected layer is located at the last part of the hidden layer of the convolutional neural network and only signals are transmitted to other fully-connected layers. The characteristic diagram loses the space topological structure in the full-connection layer, is expanded into vectors and passes through an excitation function, and the full-connection layer is usually arranged at the upstream of the output layer in the convolutional neural network, so the structure and the working principle of the characteristic diagram are the same as those of the output layer in the traditional feedforward neural network. For the image classification problem, the output layer outputs the classification label using a logistic function or a normalized exponential function.

In the prior art, a chinese patent with publication number CN108304806A discloses a gesture recognition method based on a logarithmic path integral feature and a convolutional neural network, which includes the steps of: labeling video data, and training a hand detector based on fast-RCNN; detecting the video samples frame by using a hand detector to obtain the hand position of each frame; constructing two-dimensional, three-dimensional and four-dimensional hand tracks by combining time and depth based on the hand position of each frame; performing data enhancement on the hand trajectory; extracting corresponding logarithmic path integral characteristics from the enhanced track sample; arranging the logarithmic path integral features according to spatial position information to construct a corresponding feature cube; and taking the feature cube as the input of the convolutional neural network, and finally outputting a recognition result. The Chinese patent with publication number CN108830157A discloses a human behavior recognition method based on an attention mechanism and a 3D convolutional neural network, wherein the human behavior recognition method constructs the 3D convolutional neural network, and an input layer of the 3D convolutional neural network comprises two channels, namely an original gray-scale image and an attention moment array. According to the method, a 3D CNN model for identifying human body behaviors in a video is constructed, an attention mechanism is introduced, the distance between two frames is calculated to serve as an attention matrix, the attention matrix and an original human body behavior video sequence form a double channel to be input into the constructed 3D CNN, and the convolution operation is used for extracting the key features of a visual key area. Meanwhile, a 3DCNN structure is optimized, a Dropout layer is added into the network to randomly freeze a part of network connection weight, and a ReLU activation function is used to improve network sparsity. At present, the number of training samples required for processing and analyzing a monitoring video by utilizing a convolutional neural network is large, a priori anchorars are required to be used, and the problems of difficulty in marking samples, inaccuracy in positioning and difficulty in regression convergence of frames exist by depending on a selection searching method.

The invention content is as follows:

the invention aims to overcome the defects of the prior art, and designs a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network aiming at the defects that the number of training samples required for processing and analyzing the current monitoring video is large, a priori anchor frame is required to be used, the sample is difficult to mark, the positioning is inaccurate, the frame regression is difficult, and the like.

In order to achieve the purpose, the invention relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network, which comprises the following specific process steps:

s1, marking pixel level of training sample

Opening a camera in the swimming pool area to obtain a monitoring video, and performing pixel-level binary marking on the original monitoring video, namely marking foreground pixels as positive and background pixels as negative, wherein the marked monitoring video is used for inputting a subsequent training network;

s2, extracting feature cube

In a given video data set, a certain video V in the data set is encoded _i Divided into n fixed-length video segments, i.e. V _i ＝{v ₁ ,v ₂ ,…,v _n V, each video clip _i The method comprises 8 overlapped pictures, wherein the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end and passes through 8 convolutional layers and 4 maximum pooling layers to obtain a video frequency band v _i Spatio-temporal features f at encoder side _i (ii) a To generate the original size pixel level segmentation map for each frame, a 3D upsampling layer is used at the decoder side to increase the spatio-temporal features f _i The resolution of (a); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segment _i Adopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations _i Characteristic cube F _i ；

S3, positioning the behavior target

In obtaining an input video segment v _i Characteristic cube F _i Then, to the video segment v _i Each frame in the system uses a segmentation branch to carry out pixel-level prediction on a background or a behavior foreground; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtained _i In each frame, i.e. G _i ＝{g ₁ ,g ₂ ,…,g ₈ -the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;

s4, behavior recognition

After the action target in the video is positioned in the step S3, the position information of the action target is used for the feature cube F obtained in the step S2 _i And combining feature cube F _i The feature at the corresponding position is intercepted and used as the feature tube inputEntering an identification branch, wherein the identification branch performs ToI (tube of interest) pooling on the input characteristic tube, then performs 3 times of full-connection operation, and finally obtains a predicted behavior tag which is used for judging whether drowning behavior occurs in the swimming pool area;

s5, detecting abnormal behaviors of swimming pool videos

Based on the method of steps S1-S4, by extracting feature cubes F of video clips _i Marking different behaviors including drowning, standing, swimming and the like on various videos of different swimming pool areas in a pixel-level-based marking mode, marking pixels corresponding to the behaviors as positive and marking other backgrounds as negative, and training by adopting an end-to-end 3D convolutional neural network method to obtain an abnormal behavior detection model of the swimming pool; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the forecast action label, judges whether unusual actions such as drowning appear in the swimming pool region, if unusual actions such as drowning appear then carry out the police dispatch newspaper.

The 3D up-sampling layer in the step S2 of the invention has the function of recovering the low-dimensional characteristics obtained by the maximal pooling of the encoder end so as to obtain the space-time characteristics f with higher resolution _i (ii) a The 3D up-sampling layer adopts a sub-pixel convolution mode to the spatio-temporal characteristics f _i The treatment is carried out, specifically: respectively for depth, height and width of p _d 、p _h And p _w Low resolution spatiotemporal features f _i Carry out upsampling, set p ^LR Representing low resolution features, p ^HR Representing a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, the mapping being implemented according to:

wherein C belongs to {0,. Cndot., C ^H -1}，d∈{0,...,D ^H -1}，h∈{0,...,H ^H -1}，w∈{0,...,W ^H -1}; channels with variables representing features respectivelyNumber, depth of feature, height of feature, and width of feature;

representing the low-resolution feature map after the channel is expanded;

the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows:

step S4 of the invention adopts ToI pooling operation to the feature tubes with different sizes to obtain the feature vectors with fixed length; the ToI pooling operation process comprises the following steps: let I _i Is the ith activation of the TOI pooling layer, O _j Is the jth output, each input variable I _i The partial derivative of the loss function L of (a) can be expressed as:

each pooled output O _j There is a corresponding input position i and the function f (j) represents the maximum selection from the TOI.

Compared with the prior art, the designed swimming pool drowning detection method based on the end-to-end 3D convolutional neural network detects abnormal behaviors in a video by adopting a bottom-to-top end-to-end pixel level segmentation method without searching candidate areas by a priori anchorar frame; a pixel-level binary marking mode is adopted, only a small amount of samples are needed, and time consumed by marking the samples is saved; the invention provides pixel-level behavior positioning which is more accurate than a mode of using a frame to perform behavior positioning, and solves the problem of difficult regression convergence of the frame in the frame mode positioning; the model has high detection performance, standardization capability and good market prospect.

Description of the drawings:

fig. 1 is a schematic diagram of a 3D convolutional neural network structure according to the present invention.

Fig. 2 is a schematic diagram of a split branch structure according to the present invention.

Fig. 3 is a schematic diagram of the identification branch structure according to the present invention.

Fig. 4 is a schematic block diagram of a process flow of a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network according to the present invention.

The specific implementation mode is as follows:

the invention is further illustrated by the following examples in conjunction with the accompanying drawings.

Example 1:

the embodiment relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network, which comprises the following specific process steps:

s1, marking pixel level of training sample

Performing pixel-level binary marking on an original monitoring video, namely marking a foreground pixel as positive and a background pixel as negative, wherein the marked monitoring video is used for inputting a subsequent training network; compared with the method for marking the frame, the marking method of the step is more accurate, fewer training samples need to be marked, a priori anchor frame is not needed, a selective search method is replaced, and the problems that the marked samples are difficult, the positioning is inaccurate, and the regression convergence of the frame is difficult are solved;

s2, extracting feature cube

In a given video data set, a certain video V in the data set is encoded _i Divided into n fixed-length video segments, i.e. V _i ＝{v ₁ ,v ₂ ,…,v _n V, each video clip _i The method comprises 8 overlapped pictures, the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end, and a video frequency band v is obtained after 8 convolutional layers and 4 maximum pooling layers _i At encoder endEmpty feature f _i (ii) a Spatio-temporal features f reduction due to 3D max pooling _i In order to generate pixel-level segmentation maps of the original size for each frame, at the decoder side, a 3D upsampling layer is used to increase the spatio-temporal features f _i The resolution of (c); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segment _i Adopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations _i Characteristic cube F _i ；

The specific structure of the 3D convolutional neural network described in this embodiment is formed by mutually interleaving and combining an input video segment, 12 convolutional layers, 4 maximum pooling layers, and 4 upsampling layers, and finally a feature cube of the input video segment is obtained for subsequent behavior target positioning and behavior recognition; wherein:

conv1 to conv12 are convolution layers, the convolution kernel of each layer is 3 × 3, the number of filters is increased from 64 to 512 in turn and then decreased to 64; performing 4 times of maximum pooling operation and 8 times of convolution operation on input video clips in a crossed manner to obtain the feature of a conv8 layer, then performing upsample1 layer upsampling on the feature of the conv8 layer, cascading the upsample1 layer feature and the conv9 layer feature and inputting the upsample2 layer upsample, cascading the upsample2 layer feature and the conv10 layer feature and inputting the upsample3 layer, cascading the upsample3 layer feature and the conv11 layer feature and inputting the upsample4 layer, and finally cascading the upsample4 layer feature and the conv1 layer feature to be finally extracted as the features for target positioning and classification; the cascade connection has the function of fusing the characteristics with different time domain and space domain information to obtain the characteristics more beneficial to target positioning and classification;

max-pool1 to max-pool4 are maximum pooling layers, the convolution kernel of max-pool1 is 1 × 2, the convolution kernels of the other maximum pooling layers are 2 × 2, and the number of filters is sequentially increased from 64 to 512; the maximum pooling layer is used for reducing the resolution of the feature map, reducing the number of parameters and obtaining the feature map with stronger semantic information;

upsample1 to upsample4 are upsampling layers, the convolution kernel of each layer is 3 × 3, the number of filters of the first 3 layers is 64, and the number of filters of the last layer is 48; as the resolution of the feature map is reduced due to the maximum pooling, in order to generate a pixel-level segmentation map of the original size for each frame of image, upsampling is adopted to increase the resolution of the feature map;

s3, positioning the behavior target

In obtaining an input video segment v _i Characteristic cube F _i Then, to the video segment v _i Each frame in the system uses a segmentation branch to predict the background or the pixel level of the behavior foreground; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtained _i In each frame, i.e. G _i ＝{g ₁ ,g ₂ ,…,g ₈ -the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;

the segmentation branch of the present embodiment specifically includes an input feature cube and 2 convolutional layers; convolution kernels of the convolution layers are all 1 × 1, and the number of the filters is 4096 and 2 respectively; the size of the feature map obtained finally is 2 × 8 × 240 × 320, where 8 represents 8 input frames of images, 240 × 320 represents the size of the input image, 2 represents the classification result of each pixel, and the classification result of each pixel belongs to the foreground or the background; the characteristic graph is a segmentation graph of the original graph and is used for positioning the target to obtain the position information of the target;

s4, behavior recognition

After the action target in the video is positioned in the step S3, the position information of the action target is used for the feature cube F obtained in the step S2 _i And combining feature cube F _i Intercepting the features at the corresponding positions as feature tube input identification branches, performing ToI (tube of interest) pooling on the input feature tube by the identification branches, and then performing full connection operation for 3 times to finally obtain a predicted behavior tag, wherein the behavior tag is used for judging whether drowning behavior occurs in the swimming pool area;

the branch identification method in this embodiment specifically includes: inputting a feature cube, a ToI pooling layer and 3 full-connection layers; the input feature cube is obtained by intercepting the feature cube obtained in the step S2 according to the position information obtained in the step S3, and as the sizes of the obtained targets are different, in order to process the feature vectors with fixed length, the ToI pooling layer operation is adopted for the input feature cube to obtain the feature vectors with fixed length, then 3 full-connection layers are connected to obtain a one-dimensional feature vector for behavior label identification, and a final behavior target classification result is obtained and is used for judging whether the swimming pool area has abnormal behaviors such as drowning and the like;

s5, detecting abnormal behaviors of swimming pool videos

Based on the method of the steps S1-S4, through extracting the feature cube of the video segment, marking various videos of different swimming pool areas with different behaviors in a pixel-level marking mode, wherein the behaviors comprise drowning, standing, swimming and the like, the pixels corresponding to the behaviors are marked as positive, other backgrounds are marked as negative, and an end-to-end 3D convolutional neural network method is adopted to train to obtain a swimming pool abnormal behavior detection model; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the forecast action label, judges whether unusual actions such as drowning appear in the swimming pool region, if unusual actions such as drowning appear then carry out the police dispatch newspaper.

In this embodiment, the 3D upsampling layer in step S2 is used to recover the low-dimensional features obtained after the encoder end is maximally pooled, so as to obtain the spatio-temporal features f with higher resolution _i (ii) a The step adopts a sub-pixel convolution mode to the time-space characteristic f _i And (3) processing: respectively for depth, height and width of p _d 、p _h And p _w Low resolution spatio-temporal features f _i Carry out up-sampling, set p ^LR Representing low resolution features, p ^HR And representing a high-resolution feature map, wherein pixels of the high-resolution feature map are mapped from the low-resolution feature map and are realized according to the following formula:

wherein C is belonged to { 0.,. C ^H -1}，d∈{0,...,D ^H -1}，h∈{0,...,H ^H -1}，w∈{0,...,W ^H -1}; the variables respectively represent the number of channels of the features, the depth of the features, the height of the features and the width of the features;

representing the low-resolution feature map after expanding the channel;

the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows: />

Since the video is processed in segments, feature cubes F of various space-time sizes are generated for different segments _i (ii) a In order to process the fixed-length feature vector, step S4 of this embodiment applies a ToI pooling operation to the different-sized feature tubes to obtain the fixed-length feature vector; since the size, aspect ratio and position of the bounding box (i.e. the targeted behavioral object) may be different, to use spatio-temporal pooling, pooling is achieved in the spatial and temporal domains separately, let I _i Is the ith activation of the TOI pooling layer, O _j Is the jth output, each input variable I _i The partial derivative of the loss function L of (a) can be expressed as:

The method for detecting drowning of the swimming pool based on the end-to-end 3D convolutional neural network adopts a bottom-to-top end-to-end pixel level segmentation method to detect abnormal behaviors existing in a video, and a priori anchors frame is not needed to search for candidate areas; a pixel-level binary marking mode is adopted, only a small amount of samples are needed, and time consumed by marking the samples is saved; the invention provides pixel-level behavior positioning which is more accurate than a mode of using a frame to perform behavior positioning, and solves the problem of difficult regression convergence of the frame in the frame mode positioning; the model has high detection performance and standardization capability.

Claims

1. A swimming pool drowning detection method based on an end-to-end 3D convolutional neural network is characterized in that: the specific process steps are as follows:

s1, marking pixel level of training sample

s2, extracting feature cube

In a given video data set, a certain video V in the data set is encoded _i Divided into n fixed-length video segments, i.e. V _i ＝{v ₁ ,v ₂ ,…,v _n V, each video clip _i The method comprises 8 overlapped pictures, wherein the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end and passes through 8 convolutional layers and 4 maximum pooling layers to obtain a video frequency band v _i Spatio-temporal features f at encoder side _i (ii) a To generate the original size pixel level segmentation map for each frame, a 3D upsampling layer is used at the decoder end to increase the spatio-temporal features f _i The resolution of (a); in order to obtain spatial and temporal information of different scales of the input video segment, after each 3D up-sampling layer, the corresponding spatio-temporal characteristics f of the encoder side are measured _i Adopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations _i Characteristic cube F _i ；

S3, positioning the behavior target

s4, behavior recognition

s5, detecting abnormal behaviors of swimming pool videos

Based on the method of steps S1-S4, by extracting feature cubes F of video clips _i Marking different behaviors of various videos of different swimming pool areas in a pixel-level-based marking mode, wherein the behaviors comprise drowning, standing and swimming, marking pixels corresponding to the behaviors as positive and marking other backgrounds as negative, and training by adopting an end-to-end 3D convolutional neural network method to obtain an abnormal behavior detection model of the swimming pool; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the prediction action label, judges the unusual action of drowning whether appear in the swimming pool region, if the unusual action such as drowning then carries out the police dispatch newspaper.

2. The method of claim 1, wherein the method comprises: the 3D up-sampling layer in the step S2 recovers the low-dimensional features obtained after the encoder end is subjected to maximum pooling so as to obtain the space-time features f with higher resolution _i (ii) a The 3D up-sampling layer adopts a sub-pixel convolution mode to the spatio-temporal characteristics f _i The treatment is carried out, specifically: respectively for depth, height and width of p _d 、p _h And p _w Low resolution spatiotemporal features f _i Carry out upsampling, set p ^LR Representing low resolution features, p ^HR Representing a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, the mapping being implemented according to:

/>

representing the low-resolution feature map after expanding the channel;

3. the end-to-end based 3D volume of claim 2A method for detecting drowning of a swimming pool of a neural network is characterized by comprising the following steps: step S4, toI pooling operation is carried out on the characteristic tubes with different sizes to obtain characteristic vectors with fixed lengths; the ToI pooling process comprises: let I _i Is the ith activation of the TOI pooling layer, O _j Is the jth output, each input variable I _i The partial derivative of the loss function L of (a) can be expressed as:

4. The method of claim 3, wherein the method comprises: the specific structure of the 3D convolutional neural network is formed by mutually interpenetrating and combining an input video clip, 12 convolutional layers, 4 maximum pooling layers and 4 upsampling layers, and finally a feature cube of the input video clip is obtained for subsequent behavior target positioning and behavior identification; wherein:

max-pool1 to max-pool4 are maximum pooling layers, the convolution kernel of max-pool1 is 1 × 2, the convolution kernels of other maximum pooling layers are 2 × 2, and the number of filters is increased from 64 to 512; the maximum pooling layer is used for reducing the resolution of the feature map, reducing the number of parameters and obtaining the feature map with stronger semantic information;

upsample1 to upsample4 are upsampling layers, the convolution kernel of each layer is 3 × 3, the number of filters of the first 3 layers is 64, and the number of filters of the last layer is 48; since maximum pooling reduces the resolution of the feature map, upsampling is used to increase the resolution of the feature map in order to generate an original sized pixel level segmentation map for each frame of image.

5. The method of claim 4, wherein the method comprises the steps of: the segmentation branch specifically comprises an input feature cube and 2 convolutional layers; the convolution kernels of the convolution layers are all 1 × 1, and the number of the filters is 4096 and 2 respectively; the size of the feature map obtained finally is 2 × 8 × 240 × 320, where 8 represents 8 input frames of images, 240 × 320 represents the size of the input image, 2 represents the classification result of each pixel, and the classification result of each pixel belongs to the foreground or the background; the feature map is a segmentation map of the original map, and is used for positioning the target to obtain the position information of the target.

6. The method of claim 5, wherein the method comprises: the identification branch specifically comprises: inputting a characteristic cube, a ToI pooling layer and 3 full-connection layers; and (2) intercepting the input feature cube from the feature cube obtained in the step (S2) according to the position information obtained in the step (S3), wherein the size of the obtained target is different, and in order to process the feature vector with a fixed length, toI pooling layer operation is adopted for the input feature cube to obtain the feature vector with the fixed length, and then 3 full connection layers are connected to obtain a one-dimensional feature vector for behavior label identification to obtain a final behavior target classification result for judging whether the swimming pool area has abnormal behaviors such as drowning.