CN111339892B - Swimming pool drowning detection method based on end-to-end 3D convolutional neural network - Google Patents

Swimming pool drowning detection method based on end-to-end 3D convolutional neural network Download PDF

Info

Publication number
CN111339892B
CN111339892B CN202010106457.6A CN202010106457A CN111339892B CN 111339892 B CN111339892 B CN 111339892B CN 202010106457 A CN202010106457 A CN 202010106457A CN 111339892 B CN111339892 B CN 111339892B
Authority
CN
China
Prior art keywords
feature
layer
behavior
video
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010106457.6A
Other languages
Chinese (zh)
Other versions
CN111339892A (en
Inventor
纪刚
商胜楠
周萌萌
周亚敏
周粉粉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Lianhe Chuangzhi Technology Co ltd
Original Assignee
Qingdao Lianhe Chuangzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Lianhe Chuangzhi Technology Co ltd filed Critical Qingdao Lianhe Chuangzhi Technology Co ltd
Priority to CN202010106457.6A priority Critical patent/CN111339892B/en
Publication of CN111339892A publication Critical patent/CN111339892A/en
Application granted granted Critical
Publication of CN111339892B publication Critical patent/CN111339892B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of video monitoring, and relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network; the method comprises the steps of S1, carrying out pixel-level binary marking on an original monitoring video; s2, respectively inputting each video clip into the 3D convolution neural network at the encoder end to obtain an input video clip v i Characteristic cube F i (ii) a S3, video clip v i Each frame in the system uses a segmentation branch to predict the background or the pixel level of the behavior foreground; s4, intercepting feature cube F i Inputting a recognition branch according to the characteristics of the corresponding position, and performing ToI pooling to obtain a predicted behavior tag; s5, reading real-time video stream of the swimming pool area, positioning the behavior position of a swimmer, predicting a behavior label, and judging whether abnormal behaviors such as drowning occur or not; the method adopts a pixel-level binary marking mode, saves time consumed by marking samples, provides pixel-level behavior positioning, has more accurate positioning mode, and solves the problem of difficult regression convergence of the frame.

Description

Swimming pool drowning detection method based on end-to-end 3D convolutional neural network
The technical field is as follows:
the invention belongs to the technical field of video monitoring, relates to a method for detecting drowning behavior based on monitoring video, and particularly relates to a method for detecting drowning of a swimming pool based on an end-to-end 3D convolutional neural network.
Background art:
the convolutional neural network is a feedforward neural network which comprises convolutional calculation and has a deep structure, and is one of representative algorithms of deep learning. The convolutional neural network has the characteristic learning ability and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network; the convolutional neural network includes an input layer, a hidden layer, and an output layer.
The input layer of the convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, wherein the one-dimensional array is usually a time or frequency spectrum sample; the two-dimensional array may include a plurality of channels; an input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. Hidden layers of the convolutional neural network comprise 3 types of common structures such as convolutional layers, pooling layers and fully-connected layers, and some more modern algorithms may have complicated structures such as an inclusion module and a residual block. The convolution layer has the function of extracting the characteristics of input data, the interior of the convolution layer comprises a plurality of convolution kernels, and each element forming the convolution kernels corresponds to a weight coefficient and a deviation value; after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer contains a preset pooling function, the function of which is to replace the result of a single point in the feature map with the feature map statistics of its neighboring area. The fully-connected layer in the convolutional neural network is equivalent to the hidden layer in the traditional feedforward neural network. The fully-connected layer is located at the last part of the hidden layer of the convolutional neural network and only signals are transmitted to other fully-connected layers. The characteristic diagram loses the space topological structure in the full-connection layer, is expanded into vectors and passes through an excitation function, and the full-connection layer is usually arranged at the upstream of the output layer in the convolutional neural network, so the structure and the working principle of the characteristic diagram are the same as those of the output layer in the traditional feedforward neural network. For the image classification problem, the output layer outputs the classification label using a logistic function or a normalized exponential function.
In the prior art, a chinese patent with publication number CN108304806A discloses a gesture recognition method based on a logarithmic path integral feature and a convolutional neural network, which includes the steps of: labeling video data, and training a hand detector based on fast-RCNN; detecting the video samples frame by using a hand detector to obtain the hand position of each frame; constructing two-dimensional, three-dimensional and four-dimensional hand tracks by combining time and depth based on the hand position of each frame; performing data enhancement on the hand trajectory; extracting corresponding logarithmic path integral characteristics from the enhanced track sample; arranging the logarithmic path integral features according to spatial position information to construct a corresponding feature cube; and taking the feature cube as the input of the convolutional neural network, and finally outputting a recognition result. The Chinese patent with publication number CN108830157A discloses a human behavior recognition method based on an attention mechanism and a 3D convolutional neural network, wherein the human behavior recognition method constructs the 3D convolutional neural network, and an input layer of the 3D convolutional neural network comprises two channels, namely an original gray-scale image and an attention moment array. According to the method, a 3D CNN model for identifying human body behaviors in a video is constructed, an attention mechanism is introduced, the distance between two frames is calculated to serve as an attention matrix, the attention matrix and an original human body behavior video sequence form a double channel to be input into the constructed 3D CNN, and the convolution operation is used for extracting the key features of a visual key area. Meanwhile, a 3DCNN structure is optimized, a Dropout layer is added into the network to randomly freeze a part of network connection weight, and a ReLU activation function is used to improve network sparsity. At present, the number of training samples required for processing and analyzing a monitoring video by utilizing a convolutional neural network is large, a priori anchorars are required to be used, and the problems of difficulty in marking samples, inaccuracy in positioning and difficulty in regression convergence of frames exist by depending on a selection searching method.
The invention content is as follows:
the invention aims to overcome the defects of the prior art, and designs a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network aiming at the defects that the number of training samples required for processing and analyzing the current monitoring video is large, a priori anchor frame is required to be used, the sample is difficult to mark, the positioning is inaccurate, the frame regression is difficult, and the like.
In order to achieve the purpose, the invention relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network, which comprises the following specific process steps:
s1, marking pixel level of training sample
Opening a camera in the swimming pool area to obtain a monitoring video, and performing pixel-level binary marking on the original monitoring video, namely marking foreground pixels as positive and background pixels as negative, wherein the marked monitoring video is used for inputting a subsequent training network;
s2, extracting feature cube
In a given video data set, a certain video V in the data set is encoded i Divided into n fixed-length video segments, i.e. V i ={v 1 ,v 2 ,…,v n V, each video clip i The method comprises 8 overlapped pictures, wherein the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end and passes through 8 convolutional layers and 4 maximum pooling layers to obtain a video frequency band v i Spatio-temporal features f at encoder side i (ii) a To generate the original size pixel level segmentation map for each frame, a 3D upsampling layer is used at the decoder side to increase the spatio-temporal features f i The resolution of (a); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segment i Adopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations i Characteristic cube F i
S3, positioning the behavior target
In obtaining an input video segment v i Characteristic cube F i Then, to the video segment v i Each frame in the system uses a segmentation branch to carry out pixel-level prediction on a background or a behavior foreground; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtained i In each frame, i.e. G i ={g 1 ,g 2 ,…,g 8 -the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;
s4, behavior recognition
After the action target in the video is positioned in the step S3, the position information of the action target is used for the feature cube F obtained in the step S2 i And combining feature cube F i The feature at the corresponding position is intercepted and used as the feature tube inputEntering an identification branch, wherein the identification branch performs ToI (tube of interest) pooling on the input characteristic tube, then performs 3 times of full-connection operation, and finally obtains a predicted behavior tag which is used for judging whether drowning behavior occurs in the swimming pool area;
s5, detecting abnormal behaviors of swimming pool videos
Based on the method of steps S1-S4, by extracting feature cubes F of video clips i Marking different behaviors including drowning, standing, swimming and the like on various videos of different swimming pool areas in a pixel-level-based marking mode, marking pixels corresponding to the behaviors as positive and marking other backgrounds as negative, and training by adopting an end-to-end 3D convolutional neural network method to obtain an abnormal behavior detection model of the swimming pool; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the forecast action label, judges whether unusual actions such as drowning appear in the swimming pool region, if unusual actions such as drowning appear then carry out the police dispatch newspaper.
The 3D up-sampling layer in the step S2 of the invention has the function of recovering the low-dimensional characteristics obtained by the maximal pooling of the encoder end so as to obtain the space-time characteristics f with higher resolution i (ii) a The 3D up-sampling layer adopts a sub-pixel convolution mode to the spatio-temporal characteristics f i The treatment is carried out, specifically: respectively for depth, height and width of p d 、p h And p w Low resolution spatiotemporal features f i Carry out upsampling, set p LR Representing low resolution features, p HR Representing a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, the mapping being implemented according to:
Figure BDA0002388624010000031
wherein C belongs to {0,. Cndot., C H -1},d∈{0,...,D H -1},h∈{0,...,H H -1},w∈{0,...,W H -1}; channels with variables representing features respectivelyNumber, depth of feature, height of feature, and width of feature;
Figure BDA0002388624010000032
representing the low-resolution feature map after the channel is expanded;
Figure BDA0002388624010000041
the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows:
Figure BDA0002388624010000042
step S4 of the invention adopts ToI pooling operation to the feature tubes with different sizes to obtain the feature vectors with fixed length; the ToI pooling operation process comprises the following steps: let I i Is the ith activation of the TOI pooling layer, O j Is the jth output, each input variable I i The partial derivative of the loss function L of (a) can be expressed as:
Figure BDA0002388624010000043
each pooled output O j There is a corresponding input position i and the function f (j) represents the maximum selection from the TOI.
Compared with the prior art, the designed swimming pool drowning detection method based on the end-to-end 3D convolutional neural network detects abnormal behaviors in a video by adopting a bottom-to-top end-to-end pixel level segmentation method without searching candidate areas by a priori anchorar frame; a pixel-level binary marking mode is adopted, only a small amount of samples are needed, and time consumed by marking the samples is saved; the invention provides pixel-level behavior positioning which is more accurate than a mode of using a frame to perform behavior positioning, and solves the problem of difficult regression convergence of the frame in the frame mode positioning; the model has high detection performance, standardization capability and good market prospect.
Description of the drawings:
fig. 1 is a schematic diagram of a 3D convolutional neural network structure according to the present invention.
Fig. 2 is a schematic diagram of a split branch structure according to the present invention.
Fig. 3 is a schematic diagram of the identification branch structure according to the present invention.
Fig. 4 is a schematic block diagram of a process flow of a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network according to the present invention.
The specific implementation mode is as follows:
the invention is further illustrated by the following examples in conjunction with the accompanying drawings.
Example 1:
the embodiment relates to a swimming pool drowning detection method based on an end-to-end 3D convolutional neural network, which comprises the following specific process steps:
s1, marking pixel level of training sample
Performing pixel-level binary marking on an original monitoring video, namely marking a foreground pixel as positive and a background pixel as negative, wherein the marked monitoring video is used for inputting a subsequent training network; compared with the method for marking the frame, the marking method of the step is more accurate, fewer training samples need to be marked, a priori anchor frame is not needed, a selective search method is replaced, and the problems that the marked samples are difficult, the positioning is inaccurate, and the regression convergence of the frame is difficult are solved;
s2, extracting feature cube
In a given video data set, a certain video V in the data set is encoded i Divided into n fixed-length video segments, i.e. V i ={v 1 ,v 2 ,…,v n V, each video clip i The method comprises 8 overlapped pictures, the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end, and a video frequency band v is obtained after 8 convolutional layers and 4 maximum pooling layers i At encoder endEmpty feature f i (ii) a Spatio-temporal features f reduction due to 3D max pooling i In order to generate pixel-level segmentation maps of the original size for each frame, at the decoder side, a 3D upsampling layer is used to increase the spatio-temporal features f i The resolution of (c); after each 3D upsampling layer, corresponding space-time characteristics f of the encoder end are obtained in order to obtain the spatial and temporal information of different scales of the input video segment i Adopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations i Characteristic cube F i
The specific structure of the 3D convolutional neural network described in this embodiment is formed by mutually interleaving and combining an input video segment, 12 convolutional layers, 4 maximum pooling layers, and 4 upsampling layers, and finally a feature cube of the input video segment is obtained for subsequent behavior target positioning and behavior recognition; wherein:
conv1 to conv12 are convolution layers, the convolution kernel of each layer is 3 × 3, the number of filters is increased from 64 to 512 in turn and then decreased to 64; performing 4 times of maximum pooling operation and 8 times of convolution operation on input video clips in a crossed manner to obtain the feature of a conv8 layer, then performing upsample1 layer upsampling on the feature of the conv8 layer, cascading the upsample1 layer feature and the conv9 layer feature and inputting the upsample2 layer upsample, cascading the upsample2 layer feature and the conv10 layer feature and inputting the upsample3 layer, cascading the upsample3 layer feature and the conv11 layer feature and inputting the upsample4 layer, and finally cascading the upsample4 layer feature and the conv1 layer feature to be finally extracted as the features for target positioning and classification; the cascade connection has the function of fusing the characteristics with different time domain and space domain information to obtain the characteristics more beneficial to target positioning and classification;
max-pool1 to max-pool4 are maximum pooling layers, the convolution kernel of max-pool1 is 1 × 2, the convolution kernels of the other maximum pooling layers are 2 × 2, and the number of filters is sequentially increased from 64 to 512; the maximum pooling layer is used for reducing the resolution of the feature map, reducing the number of parameters and obtaining the feature map with stronger semantic information;
upsample1 to upsample4 are upsampling layers, the convolution kernel of each layer is 3 × 3, the number of filters of the first 3 layers is 64, and the number of filters of the last layer is 48; as the resolution of the feature map is reduced due to the maximum pooling, in order to generate a pixel-level segmentation map of the original size for each frame of image, upsampling is adopted to increase the resolution of the feature map;
s3, positioning the behavior target
In obtaining an input video segment v i Characteristic cube F i Then, to the video segment v i Each frame in the system uses a segmentation branch to predict the background or the pixel level of the behavior foreground; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtained i In each frame, i.e. G i ={g 1 ,g 2 ,…,g 8 -the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;
the segmentation branch of the present embodiment specifically includes an input feature cube and 2 convolutional layers; convolution kernels of the convolution layers are all 1 × 1, and the number of the filters is 4096 and 2 respectively; the size of the feature map obtained finally is 2 × 8 × 240 × 320, where 8 represents 8 input frames of images, 240 × 320 represents the size of the input image, 2 represents the classification result of each pixel, and the classification result of each pixel belongs to the foreground or the background; the characteristic graph is a segmentation graph of the original graph and is used for positioning the target to obtain the position information of the target;
s4, behavior recognition
After the action target in the video is positioned in the step S3, the position information of the action target is used for the feature cube F obtained in the step S2 i And combining feature cube F i Intercepting the features at the corresponding positions as feature tube input identification branches, performing ToI (tube of interest) pooling on the input feature tube by the identification branches, and then performing full connection operation for 3 times to finally obtain a predicted behavior tag, wherein the behavior tag is used for judging whether drowning behavior occurs in the swimming pool area;
the branch identification method in this embodiment specifically includes: inputting a feature cube, a ToI pooling layer and 3 full-connection layers; the input feature cube is obtained by intercepting the feature cube obtained in the step S2 according to the position information obtained in the step S3, and as the sizes of the obtained targets are different, in order to process the feature vectors with fixed length, the ToI pooling layer operation is adopted for the input feature cube to obtain the feature vectors with fixed length, then 3 full-connection layers are connected to obtain a one-dimensional feature vector for behavior label identification, and a final behavior target classification result is obtained and is used for judging whether the swimming pool area has abnormal behaviors such as drowning and the like;
s5, detecting abnormal behaviors of swimming pool videos
Based on the method of the steps S1-S4, through extracting the feature cube of the video segment, marking various videos of different swimming pool areas with different behaviors in a pixel-level marking mode, wherein the behaviors comprise drowning, standing, swimming and the like, the pixels corresponding to the behaviors are marked as positive, other backgrounds are marked as negative, and an end-to-end 3D convolutional neural network method is adopted to train to obtain a swimming pool abnormal behavior detection model; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the forecast action label, judges whether unusual actions such as drowning appear in the swimming pool region, if unusual actions such as drowning appear then carry out the police dispatch newspaper.
In this embodiment, the 3D upsampling layer in step S2 is used to recover the low-dimensional features obtained after the encoder end is maximally pooled, so as to obtain the spatio-temporal features f with higher resolution i (ii) a The step adopts a sub-pixel convolution mode to the time-space characteristic f i And (3) processing: respectively for depth, height and width of p d 、p h And p w Low resolution spatio-temporal features f i Carry out up-sampling, set p LR Representing low resolution features, p HR And representing a high-resolution feature map, wherein pixels of the high-resolution feature map are mapped from the low-resolution feature map and are realized according to the following formula:
Figure BDA0002388624010000071
wherein C is belonged to { 0.,. C H -1},d∈{0,...,D H -1},h∈{0,...,H H -1},w∈{0,...,W H -1}; the variables respectively represent the number of channels of the features, the depth of the features, the height of the features and the width of the features;
Figure BDA0002388624010000072
representing the low-resolution feature map after expanding the channel;
Figure BDA0002388624010000073
the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows: />
Figure BDA0002388624010000074
Since the video is processed in segments, feature cubes F of various space-time sizes are generated for different segments i (ii) a In order to process the fixed-length feature vector, step S4 of this embodiment applies a ToI pooling operation to the different-sized feature tubes to obtain the fixed-length feature vector; since the size, aspect ratio and position of the bounding box (i.e. the targeted behavioral object) may be different, to use spatio-temporal pooling, pooling is achieved in the spatial and temporal domains separately, let I i Is the ith activation of the TOI pooling layer, O j Is the jth output, each input variable I i The partial derivative of the loss function L of (a) can be expressed as:
Figure BDA0002388624010000081
each pooled output O j There is a corresponding input position i and the function f (j) represents the maximum selection from the TOI.
The method for detecting drowning of the swimming pool based on the end-to-end 3D convolutional neural network adopts a bottom-to-top end-to-end pixel level segmentation method to detect abnormal behaviors existing in a video, and a priori anchors frame is not needed to search for candidate areas; a pixel-level binary marking mode is adopted, only a small amount of samples are needed, and time consumed by marking the samples is saved; the invention provides pixel-level behavior positioning which is more accurate than a mode of using a frame to perform behavior positioning, and solves the problem of difficult regression convergence of the frame in the frame mode positioning; the model has high detection performance and standardization capability.

Claims (6)

1. A swimming pool drowning detection method based on an end-to-end 3D convolutional neural network is characterized in that: the specific process steps are as follows:
s1, marking pixel level of training sample
Opening a camera in the swimming pool area to obtain a monitoring video, and performing pixel-level binary marking on the original monitoring video, namely marking foreground pixels as positive and background pixels as negative, wherein the marked monitoring video is used for inputting a subsequent training network;
s2, extracting feature cube
In a given video data set, a certain video V in the data set is encoded i Divided into n fixed-length video segments, i.e. V i ={v 1 ,v 2 ,…,v n V, each video clip i The method comprises 8 overlapped pictures, wherein the time span of each frame of picture is 1, each video clip is respectively input into a 3D convolutional neural network at an encoder end and passes through 8 convolutional layers and 4 maximum pooling layers to obtain a video frequency band v i Spatio-temporal features f at encoder side i (ii) a To generate the original size pixel level segmentation map for each frame, a 3D upsampling layer is used at the decoder end to increase the spatio-temporal features f i The resolution of (a); in order to obtain spatial and temporal information of different scales of the input video segment, after each 3D up-sampling layer, the corresponding spatio-temporal characteristics f of the encoder side are measured i Adopting cascade connection; obtaining an input video clip v after 4 times of upsampling and 4 secondary cascade operations i Characteristic cube F i
S3, positioning the behavior target
In obtaining an input video segment v i Characteristic cube F i Then, to the video segment v i Each frame in the system uses a segmentation branch to predict the background or the pixel level of the behavior foreground; the segmentation branch comprises 2 convolution layers of 1 x 1, and after 2 convolution operations, an input video clip v is obtained i In each frame, i.e. G i ={g 1 ,g 2 ,…,g 8 -the segmentation map is a binary segmentation map of the behavior foreground or background; dividing foreground pixels in the image according to the two values of the behavior foreground to deduce a frame in the video frame, namely positioning a behavior target in the video;
s4, behavior recognition
After the action target in the video is positioned in the step S3, the position information of the action target is used for the feature cube F obtained in the step S2 i And combining feature cube F i Intercepting the features at the corresponding positions as feature tube input identification branches, performing ToI (tube of interest) pooling on the input feature tube by the identification branches, and then performing full connection operation for 3 times to finally obtain a predicted behavior tag, wherein the behavior tag is used for judging whether drowning behavior occurs in the swimming pool area;
s5, detecting abnormal behaviors of swimming pool videos
Based on the method of steps S1-S4, by extracting feature cubes F of video clips i Marking different behaviors of various videos of different swimming pool areas in a pixel-level-based marking mode, wherein the behaviors comprise drowning, standing and swimming, marking pixels corresponding to the behaviors as positive and marking other backgrounds as negative, and training by adopting an end-to-end 3D convolutional neural network method to obtain an abnormal behavior detection model of the swimming pool; read regional real-time video stream of swimming pool, obtain the segmentation picture of every frame image through unusual action detection model of swimming pool to the action position of swimming personnel is fixed a position and the prediction action label, judges the unusual action of drowning whether appear in the swimming pool region, if the unusual action such as drowning then carries out the police dispatch newspaper.
2. The method of claim 1, wherein the method comprises: the 3D up-sampling layer in the step S2 recovers the low-dimensional features obtained after the encoder end is subjected to maximum pooling so as to obtain the space-time features f with higher resolution i (ii) a The 3D up-sampling layer adopts a sub-pixel convolution mode to the spatio-temporal characteristics f i The treatment is carried out, specifically: respectively for depth, height and width of p d 、p h And p w Low resolution spatiotemporal features f i Carry out upsampling, set p LR Representing low resolution features, p HR Representing a high resolution feature map, pixels of the high resolution feature map being mapped from a low resolution feature map, the mapping being implemented according to:
Figure FDA0002388622000000021
/>
wherein C is belonged to { 0.,. C H -1},d∈{0,...,D H -1},h∈{0,...,H H -1},w∈{0,...,W H -1}; the variables respectively represent the number of channels of the features, the depth of the features, the height of the features and the width of the features;
Figure FDA0002388622000000022
representing the low-resolution feature map after expanding the channel;
Figure FDA0002388622000000023
the indexes c ', d', h 'and w' in (1) respectively represent the number of feature channels, the feature depth, the feature height and the feature width after expansion, and are defined as follows:
Figure FDA0002388622000000024
3. the end-to-end based 3D volume of claim 2A method for detecting drowning of a swimming pool of a neural network is characterized by comprising the following steps: step S4, toI pooling operation is carried out on the characteristic tubes with different sizes to obtain characteristic vectors with fixed lengths; the ToI pooling process comprises: let I i Is the ith activation of the TOI pooling layer, O j Is the jth output, each input variable I i The partial derivative of the loss function L of (a) can be expressed as:
Figure FDA0002388622000000025
each pooled output O j There is a corresponding input position i and the function f (j) represents the maximum selection from the TOI.
4. The method of claim 3, wherein the method comprises: the specific structure of the 3D convolutional neural network is formed by mutually interpenetrating and combining an input video clip, 12 convolutional layers, 4 maximum pooling layers and 4 upsampling layers, and finally a feature cube of the input video clip is obtained for subsequent behavior target positioning and behavior identification; wherein:
conv1 to conv12 are convolution layers, the convolution kernel of each layer is 3 × 3, the number of filters is increased from 64 to 512 in turn and then decreased to 64; performing 4 times of maximum pooling operation and 8 times of convolution operation on input video clips in a crossed manner to obtain the feature of a conv8 layer, then performing upsample1 layer upsampling on the feature of the conv8 layer, cascading the upsample1 layer feature and the conv9 layer feature and inputting the upsample2 layer upsample, cascading the upsample2 layer feature and the conv10 layer feature and inputting the upsample3 layer, cascading the upsample3 layer feature and the conv11 layer feature and inputting the upsample4 layer, and finally cascading the upsample4 layer feature and the conv1 layer feature to be finally extracted as the features for target positioning and classification; the cascade connection has the function of fusing the characteristics with different time domain and space domain information to obtain the characteristics more beneficial to target positioning and classification;
max-pool1 to max-pool4 are maximum pooling layers, the convolution kernel of max-pool1 is 1 × 2, the convolution kernels of other maximum pooling layers are 2 × 2, and the number of filters is increased from 64 to 512; the maximum pooling layer is used for reducing the resolution of the feature map, reducing the number of parameters and obtaining the feature map with stronger semantic information;
upsample1 to upsample4 are upsampling layers, the convolution kernel of each layer is 3 × 3, the number of filters of the first 3 layers is 64, and the number of filters of the last layer is 48; since maximum pooling reduces the resolution of the feature map, upsampling is used to increase the resolution of the feature map in order to generate an original sized pixel level segmentation map for each frame of image.
5. The method of claim 4, wherein the method comprises the steps of: the segmentation branch specifically comprises an input feature cube and 2 convolutional layers; the convolution kernels of the convolution layers are all 1 × 1, and the number of the filters is 4096 and 2 respectively; the size of the feature map obtained finally is 2 × 8 × 240 × 320, where 8 represents 8 input frames of images, 240 × 320 represents the size of the input image, 2 represents the classification result of each pixel, and the classification result of each pixel belongs to the foreground or the background; the feature map is a segmentation map of the original map, and is used for positioning the target to obtain the position information of the target.
6. The method of claim 5, wherein the method comprises: the identification branch specifically comprises: inputting a characteristic cube, a ToI pooling layer and 3 full-connection layers; and (2) intercepting the input feature cube from the feature cube obtained in the step (S2) according to the position information obtained in the step (S3), wherein the size of the obtained target is different, and in order to process the feature vector with a fixed length, toI pooling layer operation is adopted for the input feature cube to obtain the feature vector with the fixed length, and then 3 full connection layers are connected to obtain a one-dimensional feature vector for behavior label identification to obtain a final behavior target classification result for judging whether the swimming pool area has abnormal behaviors such as drowning.
CN202010106457.6A 2020-02-21 2020-02-21 Swimming pool drowning detection method based on end-to-end 3D convolutional neural network Active CN111339892B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010106457.6A CN111339892B (en) 2020-02-21 2020-02-21 Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010106457.6A CN111339892B (en) 2020-02-21 2020-02-21 Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

Publications (2)

Publication Number Publication Date
CN111339892A CN111339892A (en) 2020-06-26
CN111339892B true CN111339892B (en) 2023-04-18

Family

ID=71181844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010106457.6A Active CN111339892B (en) 2020-02-21 2020-02-21 Swimming pool drowning detection method based on end-to-end 3D convolutional neural network

Country Status (1)

Country Link
CN (1) CN111339892B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613359B (en) * 2020-12-09 2024-02-02 苏州玖合智能科技有限公司 Construction method of neural network for detecting abnormal behaviors of personnel
CN113205008B (en) * 2021-04-16 2023-11-17 深圳供电局有限公司 Alarm control method for dynamic alarm window
CN114022910B (en) * 2022-01-10 2022-04-12 杭州巨岩欣成科技有限公司 Swimming pool drowning prevention supervision method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341452A (en) * 2017-06-20 2017-11-10 东北电力大学 Human bodys' response method based on quaternary number space-time convolutional neural networks
CN108009629A (en) * 2017-11-20 2018-05-08 天津大学 A kind of station symbol dividing method based on full convolution station symbol segmentation network
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
CN109829443A (en) * 2019-02-23 2019-05-31 重庆邮电大学 Video behavior recognition methods based on image enhancement Yu 3D convolutional neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106097353B (en) * 2016-06-15 2018-06-22 北京市商汤科技开发有限公司 Method for segmenting objects and device, computing device based on the fusion of multi-level regional area

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018171109A1 (en) * 2017-03-23 2018-09-27 北京大学深圳研究生院 Video action detection method based on convolutional neural network
CN107341452A (en) * 2017-06-20 2017-11-10 东北电力大学 Human bodys' response method based on quaternary number space-time convolutional neural networks
CN108009629A (en) * 2017-11-20 2018-05-08 天津大学 A kind of station symbol dividing method based on full convolution station symbol segmentation network
CN109829443A (en) * 2019-02-23 2019-05-31 重庆邮电大学 Video behavior recognition methods based on image enhancement Yu 3D convolutional neural networks

Also Published As

Publication number Publication date
CN111339892A (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
Sindagi et al. A survey of recent advances in cnn-based single image crowd counting and density estimation
CN111339892B (en) Swimming pool drowning detection method based on end-to-end 3D convolutional neural network
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN112232231B (en) Pedestrian attribute identification method, system, computer equipment and storage medium
Li et al. Domain adaptation from daytime to nighttime: A situation-sensitive vehicle detection and traffic flow parameter estimation framework
Ullah et al. Intelligent dual stream CNN and echo state network for anomaly detection
CN112232237B (en) Method, system, computer device and storage medium for monitoring vehicle flow
Yao et al. When, where, and what? A new dataset for anomaly detection in driving videos
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
CN112395957B (en) Online learning method for video target detection
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN111738218B (en) Human body abnormal behavior recognition system and method
CN113255616B (en) Video behavior identification method based on deep learning
Zhang et al. Cross-scale generative adversarial network for crowd density estimation from images
CN115880647A (en) Method, system, equipment and storage medium for analyzing abnormal behaviors of examinee examination room
Tao et al. An adaptive frame selection network with enhanced dilated convolution for video smoke recognition
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
Chen et al. Small object detection networks based on classification-oriented super-resolution GAN for UAV aerial imagery
CN112232236A (en) Pedestrian flow monitoring method and system, computer equipment and storage medium
Fan et al. Detecting anomalies in videos using perception generative adversarial network
Hafeezallah et al. Multi-Scale Network with Integrated Attention Unit for Crowd Counting.
Wang et al. Scene uyghur recognition with embedded coordinate attention
CN111079516A (en) Pedestrian gait segmentation method based on deep neural network
Haque et al. Automatic bangla license plate recognition system for low-resolution images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant