CN114267082A

CN114267082A - Bridge side falling behavior identification method based on deep understanding

Info

Publication number: CN114267082A
Application number: CN202111088471.9A
Authority: CN
Inventors: 朱家祥; 成孝刚; 张博; 汪兆斌; 高波; 倪杰; 蔡聪聪; 徐风雷
Original assignee: Nanjing Municipal Public Security Bureau; Nanjing University of Posts and Telecommunications
Current assignee: Nanjing Municipal Public Security Bureau; Nanjing University of Posts and Telecommunications
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2022-04-01
Anticipated expiration: 2041-09-16
Also published as: CN114267082B

Abstract

The invention discloses a bridge side falling behavior recognition method based on deep understanding, which utilizes a camera which always monitors a bridge to capture a signal that a person falls from the bridge side and send an alarm signal, so that the falling person can be rescued in time. A computer vision algorithm is embedded into a camera on a river bridge, wherein the system comprises a personnel climbing railing behavior monitoring module, a personnel falling monitoring module, a falling water bloom monitoring module, a personnel floating detection module and a rescue region prediction module. The system judges whether a person climbs the cross-river bridge rail and falls from the side of the bridge through cross verification of the first three modules, if so, the system can give an alarm and call for rescue in time to prevent missing the optimal rescue time; the latter two modules are used for predicting the position of the person falling into the water and informing a rescue team, so that convenience is created for rescue work.

Description

Bridge side falling behavior identification method based on deep understanding

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a bridge side falling behavior identification method based on deep understanding.

Background

In real life, the accident that a pedestrian falls off a bridge is often heard. But it is difficult to find and cure the first time because of its randomness and chance of occurrence. At present, China mainly depends on a mode of combining manual inspection and passerby alarming, and the efficiency is low. Therefore, the bridge side falling behavior recognition system which is monitored for 7 days multiplied by 24 hours and has higher accuracy rate is developed, and has great social significance.

Target detection algorithms are broadly divided into two categories: "two stage" and "one stage" processes. The two-stage method refers to two stages of detection and identification, and is recommended based on the region. Representative algorithms are RCNN, Fast-RCNN, and the like. And the 'one-stage' is based on regression, and directly regresses the class probability and the position coordinate value of the object. Representative algorithms include the YOLO series and SSD series. The detection speed of the one-stage method is higher than that of the two-stage method, and the method is suitable for the characteristic that the method needs to be timely, so that the one-stage method is utilized.

An Attention Model (AM) is a Model that simulates the brain processing information. The method becomes an important component of a network structure in the field of computer vision, and has been widely applied to the fields of image classification, target detection and the like. The Attention Mechanism (Attention Mechanism) is a resource allocation means for screening out useful information from a large amount of information, focuses on the required information, and then puts more Attention to the places to obtain the detailed information of the required target, and ignores the unimportant areas. For example, when reading an article, we will first focus on the article title to see what type of article this is, and then see the title of each chapter to grasp the overall context of the article. This is a means for human beings to quickly screen out high-value information from a large amount of information with limited attention. For computer vision, the attention mechanism is to obtain a weight distribution by learning, and then apply the weight distribution to the original features to obtain more detailed information of the target of interest, while suppressing other useless information.

Attention mechanisms can be divided into 3 categories: a channel domain attention mechanism, a space domain attention mechanism and a mixed domain attention mechanism. The channel domain attention is apt to ignore the local information in each channel, and the spatial domain attention is apt to ignore the local information at different spatial positions of the same channel. The attention model of the mixed domain combines the ideas of the two, scores the channel attention and the space attention at the same time, and effectively integrates the advantages of the two. The most representative of these modules is the CBAM Module (Convolition Block Attention Module).

Aiming at a plurality of types of target scenes, the target detection method aims at accurately judging the type and the position of a target in an image, and the two-stage method can solve the problems. Researchers mainly generate a candidate frame by a Region Proposal method and then carry out coordinate regression prediction according to the candidate frame. Ross Girshick et al adopts a CNN network to extract image features, improves the representation capability of the features to samples from experience-driven artificial feature normal forms HOG and SIFT to data-driven representation learning models, solves the problems that small samples are difficult to train or even over-fit and the like by adopting a mode of supervised pre-training and fine-tuning of the small samples under large samples, and improves the accuracy of target detection to a certain extent. Ross Girshick et al proposed a Fast convolutional network method (Fast R-CNN) based on regional recommendations for target detection. Fast R-CNN uses deep convolutional networks on the basis of previous work, and can classify objects more efficiently. Compared with the previous work, Fast R-CNN carries out multiple innovations, improves the detection precision and the training and testing speed.

The Chinese patent with the publication number of CN112487920A discloses a climbing behavior identification method based on a convolutional neural network, which is applied to the field of target identification and aims at solving the problem of low detection precision in the behavior identification of pedestrian climbing over a railing in the prior art; the patent overcomes the defects of low real-time performance and unavailable size of the boundary box in the traditional target detection method by drawing the boundary box with the same size as the figure; predicting image feature types by adopting a Yolo target detection network, and tracking a target by adopting a GOTURN network; and finally, rapidly using the relative position relation between the railing and the track point set by a priori knowledge method to judge whether the track point set is a crossing behavior, and if the track point set is the crossing behavior, outputting a crossing label and initiating warning.

Although the patent can accurately identify the behavior of the pedestrian crossing the railing, the aim of the patent is to predict the behavior of the pedestrian possibly crossing the railing in the next step according to the video frame image data collected in real time, and if the algorithm is used for detecting and identifying the falling behavior at the bridge side, the false detection rate is high because the algorithm is not combined with the detection results of different behavior stages of the pedestrian; and the algorithm can only be used for detecting the behavior of crossing the railing, cannot be used for detecting falling of personnel and falling of personnel, and is not suitable for detecting the falling behavior of bridging. Therefore, it is necessary to provide a detection method based on deep learning to realize automatic detection of bridge-based falling behavior, so as to strive for gold rescue for 5 minutes.

Disclosure of Invention

The invention aims to provide a bridge side falling behavior recognition method based on deep understanding, which can effectively realize the detection of the bridge side falling behavior and effectively prevent tragedies from occurring when pedestrians fall at the bridge side but rescue is not in time.

In order to achieve the purpose, the invention adopts the technical scheme that:

a bridge side falling behavior identification method based on depth understanding comprises the following steps:

s1, collecting video data of a panoramic camera of the monitoring bridge beside the bridge at the edge of the river in real time, and preprocessing the video data;

s2, pre-judging whether a pedestrian falls from the side of the bridge or not by using the pre-processed video data; using the fence at the bridge edge and the periphery of the fence as an interest domain, identifying whether a person crosses the fence by using a trained YOLO-Attention model, and verifying whether the person crosses the fence in an auxiliary way by using a monitoring camera on the bridge floor and a warning region algorithm; if the bridge recognizes that the person crosses the fence, a railing boundary crossing signal is generated, and the step S3 is entered; otherwise, returning to the step S1;

s3, detecting whether a person falls in the bridge edge fence or not; taking the area under the bridge and the river as the interest areas, detecting whether a person falls by using a trained YOLO-Attention model, if the person falls, sending a person falling signal, and entering the step S4;

s4, detecting whether falling water bloom exists on the river surface under the bridge; setting the river surface under the bridge as an interest area, detecting whether falling water bloom generated after falling of a person occurs by using a trained YOLO-Attention model, if the falling water bloom occurs, judging that the person falls into water, sending a falling water bloom signal, and entering the step S5;

s5, detecting whether a person floats on the river surface by using the trained YOLO-Attention model, and if detecting that the person floats, sending the position of the floating person to a rescue worker; if the person is not detected to float, judging that the person sinks into the river, and entering step S6;

and S6, constructing a water flow model to predict the approximate position of the person falling into the water according to the water flow speed in the river and the position of falling water bloom, and sending the predicted position information to the rescue personnel.

Specifically, in step S1, the method for preprocessing the video data includes: and judging whether the video image needs to be subjected to defogging processing by adopting a self-adaptive defogging algorithm, if the TBV (total bounded variation) in the image is judged to be larger than a set threshold value, the image does not need to be subjected to defogging processing, and otherwise, the image needs to be subjected to defogging processing.

The invention provides a defogging method based on Deblurganv2, which can effectively remove fog on the river surface and enable the subsequent bridge measurement falling behavior detection to be more accurate; the Deblurganv2 is the core of the defogging algorithm, a Feature Pyramid (FPN) structure is adopted as a core module of a generator, semantic information contained in low-layer feature information extracted by the feature pyramid is less but the target position of the low-layer feature information is accurate, extracted high-layer semantic information is rich but the target position is fuzzy, high-layer features are fused through upsampling and low-layer features, and prediction is independently made after the features of each layer are fused.

The generator backbone network (backbone) selects a more complex inclusion-ResNet-v 2, which combines the inclusion module with the ResNet structure. After the input of the increment module, a plurality of paths can be selected, and the network can select which filter to use, so that sparse or non-sparse characteristics on the same layer can be well obtained; ResNet is the stack of residual module, and neuron study objective function and the difference of input, along with the increase of network depth, can greatly accelerate neural network's convergence, reduce the training error, improve the precision of network.

The generator loss function consists of a weighted sum of pixel-level loss, perceptual loss, and local loss.

L_G＝0.5×L_pix+0.006×L_p+0.01×L_adv

Wherein L is_pixIs the minimum mean square error; l is_pCalculating Euclidean distance from feature maps extracted by a convolution kernel of 3 × 3 of the VGG19 network for perceptual loss; l is_advIs the local loss with Patch size 70 x 70. L is_GTo generator losses; such a combination of losses ensures that the convergence of the network takes into account both the local image details and the overall image style.

The Discriminator of the DeblurgAN-v2 adopts a double-Discriminator structure, not only keeps the PatchGAN structure as a Local Discriminator to discriminate the Patch with the size of 70 multiplied by 70, but also introduces a global Discriminator to discriminate the whole image. Therefore, the discriminator can search a balance point between the whole image information and the local image information to achieve the effect of taking account.

The discriminator loss function uses a modified loss function RaGAN-LS loss to the Least Square GANS (LSGAN), which helps the network converge more smoothly and efficiently.

Wherein D (-) is a discriminator function, G (-) is a generator function, L_DFor discriminator loss, E is the mathematical expectation, x-p_data(x) Generating a distribution p for compliance data_data(x) Data samples x, s-p in (1)_s(s) a noise prior distribution p generated for the compliant data_sNoise samples s in(s).

Specifically, in step S2 and step S5, the person is very small in the monitoring screen when the person turns over; in step S3, the speed of the personnel falling is very fast; in step S4, water bloom may occasionally appear on the river surface, but the characteristics thereof are different from those of water bloom falling into water; in summary of the above challenges, a mixed domain Attention mechanism is added to the YOLO-Attention model to improve the detection accuracy.

Further, the invention adopts a CBAM mixed domain attention mechanism, which combines a channel attention mechanism and a space attention mechanism, under the mechanism, the feature map passes through two models in total, firstly passes through the channel attention model and then passes through the space attention model, and then outputs a reconstructed feature map. The CBAM adopts the idea similar to the attention mechanism of people, and assigns the weight of the feature map again through continuous self-learning so as to attach importance to the feature with large weight and inhibit useless features, thereby improving the network performance.

Further, a channel domain attention mechanism in the CBAM module learns and assigns weight distribution to different channels according to differences of importance of different channels, focuses on important feature channels, weakens influences of other features, achieves the purpose of improving network performance, and assigns weight distribution to the obtained feature graph based on the channels through three-step operation. The specific implementation method comprises the following steps:

in the first step, the Squeeze operation (Squeeze) compresses the two-dimensional feature (H × W) of each channel into a real number through Global Pooling (Global boosting), which belongs to a feature compression of spatial dimension, because the real number is calculated according to all values of the two-dimensional feature, so that the real number has a Global sense field to some extent, the number of channels remains unchanged, and thus the real number becomes 1 × 1 × C after passing the Squeeze operation. The specific operation formula is as follows:

wherein, F_sq(. cndot.) is a squeeze function, W, H is the width and height, respectively, of the feature map to be processed, u_c(i, j) is the element for which the feature map level c channel coordinate is (i, j), z_cOutput characteristics indicating that the c-th layer channel is extruded; after extrusion operation, a one-dimensional tensor with the same length and channel number is formed;

and secondly, exciting operation (Excitation), generating a weight value for each characteristic channel through a parameter W, and outputting the weight values with the same number as that of the input characteristics, wherein the specific operation formula is as follows:

s＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

wherein, F_exRepresenting the excitation operation, z is the output of the squeeze operation, and is the tensor with the size of 1 × 1 × C, and C is the channel number of the feature map; w₁And W₂Is a weight, wherein

R is a scaling parameter for reducing the number of channels and thus reducing the amount of computation, and is represented by R; δ denotes the ReLU activation function, σ denotes the Sigmoid activation function; s is the output of the excitation function and is used for describing the weight of the characteristic diagram; starting from the last equal sign, first use W₁Multiplying by z is a fully-connected operation, the result of the multiplicationFruit dimension of

Then, the output dimension is unchanged through a ReLU layer, and then W is summed₂Multiplication is a full connection process, the output dimension is changed into 1 × 1 × C at this time, and finally s is obtained through a Sigmoid function. The s is the core of the SE module, and is used to characterize the weight of the feature map, and the weight is obtained by the previous learning of the fully-connected layer and the nonlinear layer.

Thirdly, feature weight calibration (Scale), weighting the weight value obtained by the excitation operation on each channel feature, multiplying the weight coefficient by the channel by channel to finish introducing an attention mechanism into the channel dimension, wherein the specific operation formula is as follows:

wherein, F_scale(. cndot.) represents an identification function,

representing the output layer c channel characteristics, s_cRepresents the weight of the c-th channel, u_cRepresenting the features of the c-th channel of the input feature map.

Further, the spatial domain attention mechanism in the CBAM module is that a feature map is formed by utilizing the spatial structure of features, and the spatial relationship of the features is used for modeling; first, maximum pooling and average pooling based on channel dimension are performed on the feature map to obtain two W × H × 1 channel descriptions, which are then connected together according to the channel to generate a valid feature descriptor. Simultaneous application of convolutional layers to generate a spatial attention map M_s(F)∈R^H×WThe spatial attention map represents weight coefficients which encode locations requiring attention or suppression, and two feature maps are generated through two Pooling (Pooling) operations, which in turn represent an average Pooling feature and a maximum Pooling feature, which are passed through an activation function to obtain a final result; the specific operation formula is as follows:

M_S(F)＝σ(f^7×7([AvgPool(F)；MaxPool(F)]))

where σ denotes a sigmoid activation function, f^7×7A convolutional layer representing a convolutional kernel size of 7 × 7; m_s(F) A final space attention diagram is obtained; AvgPool (F) is the average pooling operation based on the channel dimension for the feature map, and MaxPool (F) is the maximum pooling operation based on the channel dimension for the feature map;

in order to maximize the pooling characteristics,

is the average pooling characteristic.

Further, in step S2, since the person is small in the monitoring screen, the surrounding pedestrians interfere with the detection of the model, and false detection is caused. An alert zone algorithm is introduced in step S2 and a camera on the road surface on the large bridge is also embedded in the detection algorithm of the bridge-side fall anticipation module. And taking the warning region algorithm and bridge deck camera detection as auxiliary verification, and combining the yolo-attention detection to form cross verification and prompt the accuracy of the algorithm.

Specifically, the method for implementing the alert zone algorithm in step S2 is as follows:

the first step is as follows: setting a background template of the alert area:

the method comprises the steps of setting an image of a bridge floor guardrail region without a person as a background template, inputting a feature map of the background template region into a deep learning detection network for training, and enabling the network to adaptively find out an alarm ring region through a certain amount of training.

The second step is that: setting early warning characteristics:

and performing image difference operation on the image frames of the bridge floor guardrail with the person and the image frames of the background template, extracting the difference information of the image template and the background template with the person as an early warning characteristic diagram, and inputting the early warning characteristic diagram into a detection network for training to enable the network to obtain an early warning effect. When someone climbs the guardrail beside the bridge, the characteristics of the warning area correspondingly change into early warning characteristics, and at the moment, the detection model can give an early warning.

The third step: screening out influence factors:

due to the influence of optical flow and rainwater, when garbage, sundries, flying objects such as birds and the like thrown down from the bridge surface pass through the warning area, image characteristics are changed, and the images may be used as early warning characteristics by a network to cause false detection. Therefore, the method defines the characteristics of light stream, rainwater, garbage and other sundries, and the characteristics of flying objects such as birds and the like when the flying objects pass through the warning area as influence factors and needs to be screened out. Firstly, the image frame when the influence factor characteristic of the warning area appears and the image frame of the background template are subjected to differential operation, the difference information of the image of the guardrail influence factor and the background template is extracted to be used as an influence factor characteristic diagram, and the influence factor characteristic diagram is input into a detection network for training, so that the network has the capability of identifying the influence factor. When the network detects that the characteristics of the video change but the influence factor, the influence factor is screened.

Specifically, in step S3, if the personnel falling signal is not generated, the method returns to step S2 to detect whether a person crosses the fence again, and if the railing boundary crossing signal is not generated, the railing boundary crossing signal is determined to be a false determination; if the balustrade out-of-range signal is still generated, step S3 is executed again to detect whether a person falls, if the person falling signal is not generated, the balustrade out-of-range signal is determined as a false determination.

Specifically, in step S4, if the falling splash signal is not generated, the process returns to step S3 to detect whether a person falls again, and if the person falls is not detected, the falling signal is determined to be a false determination; if the personnel falling signal is still generated, the step S4 is executed again to detect whether falling splash exists, and if the personnel falling signal is not generated, the personnel falling signal is judged to be misjudgment.

Specifically, in step S6, the invention adopts a rescue area prediction algorithm to predict the approximate position of the person falling into water, the algorithm predicts the drift trajectory of the drift person, the water in the river seems to be straight, but the person falling into water is not straight due to the influence of various factors, and the simple prediction precision is not high; the method takes the wind speed and the wind direction of a person falling into water and the flow speed and the flow direction of water into consideration to establish a motion equation of target drifting:

wherein, V_cIs a wind speed field, V_wIs a water flow velocity field; x (t) is the position of the man falling into water at the moment, and x (t + delta t) is the position of the man falling into water after delta t time; the method obtains drifting data in various parameter forms by utilizing a dummy to carry out a simulated floating experiment in the river, and fits a personnel drifting motion track according to the obtained data. Wind power and flow velocity are discrete data recorded, and the recorded data is limited. In order to further improve the accuracy of the predicted track, the method divides the sampling of the wind speed field and the water flow velocity field into smaller intervals, and applies a Lagrangian interpolation method to the smaller intervals to obtain unknown data, so that the prediction error is further reduced.

Because the implementation difficulty of the manual experiment is high, the time cost is high, and the Monte Carlo simulation method is used for simulating the drift trajectory of the personnel on the basis of the manual experiment data fitting and the Lagrange interpolation method. The monte carlo simulation method is a method for setting a random process, continuously generating a time sequence and researching the distribution of the process by calculating statistic in the time sequence. The concrete operation steps are that drifting personnel are abstracted into particles, and each particle is endowed with an influence factor influenced by factors such as river medium wind speed, water flow speed and the like. Then, the particles are massively copied to generate drift of the particle group. And finally, taking the drifting trajectory of part of the particle cluster as the prediction of the personnel drifting trajectory.

Compared with the prior art, the invention has the beneficial effects that: (1) by adopting the defogging method based on the Deblurganv2, the method can improve the reduction effects of video image detail textures, river surface area color difference and local artifacts, can effectively remove fog on the river surface, and enables the subsequent bridge measurement falling behavior detection to be more accurate; the defogging algorithm adopted by the invention has a self-adaptive function, and is only applied when the river is fogged, so that the system resources are greatly saved; (2) in the detection stage of identifying the bridge falling behavior, warning lines are established at the edges of the bridge deck railings to detect out-of-range personnel, and a large number of false alarms can be generated at the moment because more pedestrians and tourists in a bridge can generate a great deal of interference; in order to reduce false alarm and effectively discover potential target personnel, a large number of training samples are established through actual measurement data and manual simulation, an attention mechanism is fused with a Yolo network, and climbing behaviors are detected; screening out the personnel which can fall from the side of the bridge through a warning region algorithm and cross verification of a bridge deck camera and a Yolo-attention; the method improves the accuracy to 75%; (3) the invention utilizes the method of the combined detection of three modules of personnel boundary crossing, personnel falling and falling water bloom, and establishes a strict signal transmission regression mechanism of the three modules, thereby greatly reducing the probability of false alarm; simultaneously, utilizing the measured data pairs; the drift track of the person falling into the water is strictly mathematically analyzed to predict the approximate position of the person falling into the water, so that great convenience is provided for rescue search of rescuers.

Drawings

Fig. 1 is a flow chart of a bridge side falling behavior recognition method based on depth understanding according to the present invention.

FIG. 2 is a schematic structural diagram of a DeblurgAN-v2 generator in the defogging algorithm according to the embodiment of the invention.

Fig. 3 is a schematic structural diagram of a CBAM module according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a backbone network structure of the YOLO algorithm with an attention mechanism added in the embodiment of the present invention.

FIG. 5 is a schematic diagram of a pedestrian crossing in an embodiment of the present invention.

Fig. 6 is a schematic view of a bridge surveillance zone in an embodiment of the present invention.

FIG. 7 is a schematic diagram of auxiliary verification of a bridge deck highway camera in the embodiment of the invention.

Fig. 8 is a schematic diagram of a person falling in an embodiment of the present invention.

Fig. 9 is a schematic view of a falling water bloom in the embodiment of the invention.

FIG. 10 is a schematic view of the floating of a person in an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present embodiment provides a bridge side falling behavior identification method based on depth understanding, and a strict cross validation method and a signal transmission mechanism can accurately measure the behavior of a person falling from a bridge side, including the following steps:

s2, pre-judging whether a pedestrian falls from the side of the bridge or not by using the pre-processed video data; the fence at the bridge edge and the periphery of the fence are used as interest areas, a trained YOLO-Attention model is used for identifying whether a person crosses the fence or not, and a monitoring camera on the bridge surface and a warning area algorithm are used for assisting in verifying whether the person crosses the fence or not. If the people are recognized to cross the fence, generating a railing boundary crossing signal, and entering the step S3; otherwise, returning to the step S1;

Specifically, in step S1, the method for preprocessing the video data includes: and judging whether the video image needs to be subjected to defogging processing by adopting a self-adaptive defogging algorithm, if the TBV (total bounded variation) in the image is judged to be larger than a set threshold value, the image does not need to be subjected to defogging processing, and otherwise, the image needs to be subjected to defogging processing. This embodiment all is equipped with the camera on other and the bridge floor of bridge, and the camera on the bridge floor can assist the detection forever and cross the border, improves the validity and the wholeness that detect.

On a river bridge with rich water vapor, fogging is a very common phenomenon generally, however, when fog covers the river bridge, identification of a camera is difficult, and when the fog occurs, defogging needs to be carried out on a video captured by the camera, the invention provides a defogging method based on Deblurganv2, which can effectively remove fog on the river surface, so that subsequent drop behavior detection of a bridge is more accurate; the Deblurganv2 is the core of the defogging algorithm, a Feature Pyramid (FPN) structure is adopted as a core module of a generator, semantic information contained in low-layer feature information extracted by the feature pyramid is less but the target position of the low-layer feature information is accurate, extracted high-layer semantic information is rich but the target position is fuzzy, high-layer features are fused through upsampling and low-layer features, and prediction is independently made after the features of each layer are fused.

As shown in fig. 2, the generator backbone (backbone) here selects a more complex inclusion-ResNet-v 2, which combines the inclusion module with the structure of ResNet. After the input of the increment module, a plurality of paths can be selected, and the network can select which filter to use, so that sparse or non-sparse characteristics on the same layer can be well obtained; ResNet is the stack of residual module, and neuron study objective function and the difference of input, along with the increase of network depth, can greatly accelerate neural network's convergence, reduce the training error, improve the precision of network.

L_G＝0.5×L_pix+0.006×L_p+0.01×L_adv

Specifically, in step S2 and step S5, the person is small in the monitoring screen; in step S3, the person falls very quickly; in step S4, in addition to the person falling into the water, the river surface may occasionally have water bloom, but the characteristics thereof are different from those of the person falling into the water bloom; in summary of the above challenges, a mixed domain Attention mechanism is added to the YOLO-Attention model to improve the detection accuracy.

Further, as shown in fig. 3, the present invention employs a CBAM mixed domain attention mechanism, which combines a channel attention mechanism and a spatial attention mechanism, in which the feature map passes through two models, first the channel attention model and then the spatial attention model, and then outputs the reconstructed feature map. The CBAM adopts the idea similar to the attention mechanism of people, and assigns the weight of the feature map again through continuous self-learning so as to attach importance to the feature with large weight and inhibit useless features, thereby improving the network performance.

wherein, F_sq(. cndot.) is a squeeze function, W, H is the width and height, respectively, of the feature map to be processed, u_c(i, j) is the element for which the feature map level c channel coordinate is (i, j), z_cOutput characteristics indicating that the c-th layer channel is extruded; after extrusion operation, formA one-dimensional tensor with the same length and channel number;

s＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

R is a scaling parameter for reducing the number of channels and thus reducing the amount of computation, and R represents a linear space in the real number domain; δ denotes the ReLU activation function, σ denotes the Sigmoid activation function; s is the output of the excitation function and is used for describing the weight of the characteristic diagram; starting from the last equal sign, first use W₁Multiplying by z is a full join operation, the result dimension of the multiplication being

wherein, F_scale(. represents)The function is identified and, in response to the identification,

M_S(F)＝σ(f^7×7([AvgPool(F)；MaxPool(F)]))

in order to maximize the pooling characteristics,

is the average pooling characteristic.

Further, in step S2, when the pedestrian is about to live gently, the pedestrian usually jumps down over the guardrail at the edge of the bridge, and most people will hesitate to jump down again on the guardrail, as shown in fig. 5, at this time, the method performs people crossing detection to determine whether someone wants to live gently to perform early warning, and if the people crossing is detected, the system generates a people crossing signal. The method utilizes YOLO added with an attention mechanism to detect out-of-range personnel, because the personnel are close to a side sidewalk on a guardrail when out-of-range, and misjudgment is easily caused if the pedestrian on the roadside is close to the guardrail, the auxiliary verification is carried out by matching a camera for monitoring traffic on a bridge floor with a long-range camera as shown in figure 7, and a mixed domain attention mechanism is added into a detection model; in addition, the method also adds an alert zone algorithm when people are detected to cross the border. The guardrail area at the side of the bridge is adaptively found through network training, and is marked as the warning area in the image, as shown in fig. 6, whether a person wants to cross the guardrail or not is judged through a characteristic comparison mode, and if the person crosses, an early warning can be given. And in the warning region algorithm, the bridge deck camera auxiliary verification is combined with YOLO added with a mixed region attention mechanism to form a cross verification mode, so that the accuracy of personnel boundary crossing detection is greatly improved.

The invention utilizes the camera of the monitoring bridge for 7 days multiplied by 24 hours to capture the falling signal of the personnel and send an alarm signal to strive for gold rescue for 5 minutes, thereby ensuring that the falling personnel can be rescued and treated in time. The system embeds computer vision algorithm in a camera on a bridge on the river, which comprises: 1) a people crossing railing behavior monitoring module; 2) a personnel fall monitoring module; 3) a falling water bloom monitoring module; 4) a personnel floating detection module; 5) and a rescue area prediction module.

In the invention, in a monitoring module for the behavior of people crossing a railing, three sub-modules are adopted for cross validation to judge whether the people cross the railing, and the three sub-modules are respectively as follows: a) a bridge deck roadside camera target detection submodule; b) an end-to-end detection submodule; c) and a warning region target detection submodule.

The end-to-end detection submodule and the warning region target detection submodule adopt panoramic cameras capable of shooting the whole bridge deck and water surface region to collect image data (for example, when the panoramic cameras are applied to a Nanjing Yangtze river bridge, the panoramic cameras can be installed on a Nanbao and a Beibao of the bridge); the end-to-end detection submodule is used for identifying whether a person exists outside a bridge deck railing, and the warning area target detection submodule is used for identifying whether the person crosses the bridge deck railing; the identification network is realized by using a Yolo-attention algorithm; the input of the end-to-end detection submodule is a whole picture, the output is whether a person exists, and if the person exists, a frame is drawn to frame the person; the warning area target detection submodule utilizes the comparison of front and rear frames to identify whether a person turns over a large bridge railing; a frame is drawn at the position where people stand outside a bridge railing, the area of the frame is used as an interest area, the input of the network is a graph of the whole interest area, and the output is whether people are detected (namely whether people cross the railing).

The bridge deck roadside camera is mounted on a bridge deck street lamp post (the camera is short, the range of 3-5 meters can be seen, and only partial guardrail transgression can be seen), and a clearer picture can be shot due to the fact that the bridge deck roadside camera is closer to a large bridge guardrail, so that the bridge deck roadside camera and the panoramic camera are combined, and the recognition error rate can be reduced; in the specific implementation process of the embodiment, a plurality of bridge deck roadside cameras are required to work in a combined manner.

The invention judges whether a person crosses the cross bridge rail and falls from the bridge side through the cross verification of the first three modules, if the person is detected to cross the rail and fall from the bridge side, the alarm can be given in time and the rescue can be called, so that the optimal rescue time can be prevented from being missed; the position of the person falling into the water is predicted by the two last modules, and the rescue team is informed of the prediction, so that convenience is created for rescue work, the rescue team can rescue the person falling into the water in the shortest time, and the survival rate of the person falling into the water is improved.

Specifically, the alert zone algorithm adopted by the alert zone target detection submodule is specifically implemented as follows:

the first step is as follows: setting a background template of the alert area:

The second step is that: setting early warning characteristics:

The third step: screening out influence factors:

due to the influence of light and rain, the influence of occasional falling of sundries, the influence of head bags exposed by tourists on the railings when the tourists on the bridge lie prone and the change of image characteristics can be generated when the flying birds pass through the warning area, and the false detection can be caused by using the network as an early warning characteristic. Therefore, the method defines the characteristics of light, rainwater, sundries, exposure of head bags on the railings of tourists on the bridge, flying birds and the like as influence factors and needs to screen out the influence factors. Firstly, the image frame when the influence factor characteristic of the warning area appears and the image frame of the background template are subjected to differential operation, the difference information of the image of the guardrail influence factor and the background template is extracted to be used as an influence factor characteristic diagram, and the influence factor characteristic diagram is input into a detection network for training, so that the network has the capability of identifying the influence factor. When the network detects that the characteristics of the video change but the influence factor, the influence factor is screened.

When a person jumps into the river over a guardrail, there is a free fall process, as shown in figure 8, which is detected and alarmed at this point, and a rescue team can usually arrive and save it quickly. The speed of personnel falling is very fast, so the method utilizes a detection model added with a space domain attention mechanism to detect the personnel falling during the personnel falling detection, and if a personnel falling event is detected, the system sends a personnel falling signal to detect the personnel splash. If a signal that the personnel cross the border is received and the personnel disappear from the border crossing area, but the personnel falling condition is not detected, missing detection needs to be considered, the previous video stream needs to be read to carry out personnel falling detection again, and if the personnel falling event is detected again or not, personnel cross-border detection false detection is considered.

When a person falls into a river, a large splash is usually caused, as shown in fig. 9, the method also needs to detect the splash generated when the person falls into the river, and the splash can be caused on the river surface due to other factors, although the characteristics of the method are obviously different from those of the splash falling into the river, the possibility of false detection still exists, so that in order to reduce false detection and save system resources, the detection of the splash falling into the river also needs to be established on the premise that the system already receives a signal of the person falling; when the water bloom detection is carried out, some spoondrift can be occasionally generated on the river surface, but the characteristics of the spoondrift are obviously different from the water bloom when people fall, so that the channel region attention mechanism is added into the detection model to detect the water bloom on the river surface. If the falling water bloom is detected after the personnel fall, the system erects the horse to calibrate the water bloom position as a rescue area prediction starting point, and the erector erects the horse to give an alarm to request rescue from a countryside troops on the river surface, and enters the next step of personnel floating detection.

Specifically, in step S5, after the person falls into water, some people sink, some people float on the water surface and struggle, and some people float on the water surface after drowning, and for convenience of subsequent rescue, the method detects the person floating on the river surface to determine the position of the person to be rescued and reports the position to the rescuers. Since the part usually floating on the water surface has only a few body parts after people fall into the water, the characteristics of the part are relatively few, as shown in fig. 10, and the person falling into the water also moves position due to the water flow, the system uses a detection model with a mixed domain attention mechanism when the detection people float, and the detection accuracy is improved. And if the floating condition of the personnel is not detected after the personnel fall into the water, judging that the personnel are in a submerged state, and needing to carry out next rescue area prediction.

Specifically, in step S6, the invention adopts a rescue area prediction algorithm to predict the approximate position of the person who falls into water, the algorithm predicts the drift trajectory of the drift person, the water in the river seems to be straight, but the person who falls into water is not straight due to various factors, and the simple prediction precision is not high. The method takes the wind speed and the wind direction of a person falling into water and the flow speed and the flow direction of water into consideration to establish a motion equation of target drifting:

wherein, V_cIs a wind speed field, V_wIs a water flow velocity field; x (t) is the position of the man falling into water at the moment, and x (t + delta t) is the position of the man falling into water after delta t time; the method obtains drifting data in various parameter forms by utilizing a dummy to carry out a simulated floating experiment in the river, and fits a personnel drifting motion track according to the obtained data. Wind power and flow velocity are discrete data recorded, and the recorded data is limited. In order to further improve the accuracy of the predicted track, the method divides the sampling interval of the wind speed field and the water flow velocity field into smaller intervals, and applies a Lagrange interpolation method to the smaller intervals to obtain unknown data, so that the prediction error is further reduced.

Because the implementation difficulty of the manual experiment is high, the time cost is high, and the Monte Carlo simulation method is used for simulating the drift trajectory of the personnel on the basis of the manual experiment data fitting and the Lagrange interpolation method. The monte carlo simulation method is a method for setting a random process, continuously generating a time sequence and researching the distribution of the process by calculating statistic in the time sequence. The concrete operation steps are that drifting personnel are abstracted into particles, and each particle is endowed with an influence factor influenced by factors such as river medium wind speed, water flow speed and the like. This example was then massively replicated to generate a drift in the population. And finally, taking the drifting track of the central part of the particle swarm as the prediction of the personnel drifting track.

In this embodiment, in four modules of the people crossing detection, the people falling detection, the falling splash detection, and the people floating detection, a YOLO algorithm with a mixed attention mechanism is adopted for detection, where the YOLO algorithm is used as a core technology of the method, as shown in fig. 4, which is summarized as follows:

1) principle of operation

The input image of the detection model is in a fixed size (608 × 608 input size is adopted in the method), wherein features are extracted through a DarkNet-53 network structure, detection is carried out on feature maps in three sizes to obtain three prediction outputs y1, y2 and y3, a final prediction result is obtained through a Non-Maximum Suppression algorithm (NMS), and the detected target position and category information are output. The basic components of YOLO are CBL, i.e., the Conv layer, BN layer (Batch Normalization) and leakage ReLU activation function layer. The entire network is free of pooling layers and full connectivity layers. The Res unit is a residual unit block, and the unit block can relieve the degradation problem of the network model. DarkNet-53 is used as a main network of the detection model of the method, and the main component of the DarkNet-53 is ResX, which is composed of a CBL and X residual components and is also a large component in YOLO. The CBL in front of each Res module plays the role of down-sampling, so after 5 times of Res modules, the obtained feature map is 608 × 608- >304 × 304- >152 × 152- >76 × 76- >38 × 38- >19 × 19 in size. Residual components in ResX use the residual structure in the ResNet network for reference, and the network can be built deeper. The up-sampling defaults to using a nearest neighbor interpolation method, and the function is to amplify the feature map to obtain the prediction feature maps with different scales. Concat is tensor splicing operation, and splicing up-sampling results of a DarkNet middle layer and a later layer to achieve the purpose of dimension expansion. add is different from adding two tensors directly, and does not expand dimension.

2) Feature extraction network

As the backbone network for YOLO, the network consists essentially of a series of 1X 1 and 3X 3 convolutional layers followed by a BN layer and a LeakyReLU layer, for a total of 53 layers, called DarkNet-53. The network uses the thought of ResNet residual errors as a reference, uses a large number of 'jump layer connections' of the residual errors, and each residual error module consists of a 1 x 1 convolutional layer, a 3 x 3 convolutional layer and a jump connection. The method solves the problem of difficult training brought by a deep network, and in order to reduce the negative effect of gradient brought by Pooling, Pooling is abandoned by YOLO used in the method, and stride of conv is used for realizing down-sampling.

3) Loss function

The loss function is particularly important for target detection, and the loss function of YOLO in the method is then determined.

Given that the number of grids is S, the number of candidate boxes generated by each grid is B, and each candidate box finally obtains a corresponding bounding box through the network. Finally, the number of bounding boxes is S B;

first of all, states

The meaning of (a): if the jth anchor box of the ith grid is responsible for the current object, then

Otherwise it is 0.

Meaning that the jth anchor box of the ith grid is not responsible for the target.

The parameter confidence C follows_ij ^*In training, C_ij ^*Representing true value, C_ij ^*The value of (c) depends on whether the bounding box of the grid cell is responsible for predicting a certain object. If it is responsible, C_ij ^*1, otherwise C_ij ^*＝0。

Next, each term of the loss function is analyzed, first the center coordinate error, as follows:

the meaning of the formula is that when the jth anchor of the ith grid is responsible for a certain real target, the center coordinates of the prediction frame are compared with the center coordinates of the real frame to obtain the center coordinate error. Wherein x is_ijRepresenting the predicted value of the x coordinate of the center, x_ij ^*Representing the true value of the central x coordinate; y is_ijRepresenting the predicted value of the central y coordinate, y_ij ^*Representing the true value of the central y coordinate.

The following is the broad height error, as follows:

the meaning of the formula is that when the jth anchor of the ith grid is responsible for a certain real target, the width and the height of a generated prediction frame are compared with the width and the height of a real frame, and the width and the height errors are calculated; wherein, w_ijWidth, w, of the representation of the prediction anchor_ij ^*Represents the width of the actual anchor; h is_ijDenotes the height, h, of the predicted achorbox_ij ^*Representing the height of the actual anchor box.

Next is the confidence error, which is expressed using cross entropy, and whether or not the anchor is responsible for a certain goal, the confidence error is calculated as follows:

wherein, C_ijRepresents a parameter confidence prediction, C_ij ^*Representing a parameter confidence truth value; alpha is alpha_noobjThe weight value when no target exists is shown, and the weight value when a target exists is shown.

Next is the classification error, which is also the selection of cross entropy as a loss function. When the jth anchor box of the ith mesh is responsible for a real target, the bounding box generated by the anchor box will calculate the classification loss function, as shown in the following formula:

wherein c belongs to classes as a certain class c in the class classes belonging to the general class, P_ijRepresenting a prediction of classification probability, P_ij ^*Representing a classification probability truth value;

in summary, it can be finally obtained that the loss function of the method YOLO is shown as follows:

wherein x is_ijRepresenting the predicted value of the x coordinate of the center, x_ij ^*Representing the true value, y, of the central x-coordinate_ijRepresenting the predicted value of the central y coordinate, y_ij ^*Representing the true value of the central y coordinate; w is a_ijWidth, w, of the representation of the prediction anchor_ij ^*Denotes the width, h, of the actual anchor_ijDenotes the height, h, of the predicted achorbox_ij ^*Representing the height of the actual anchor; c_ijRepresents a parameter confidence prediction, C_ij ^*Representing a parameter confidence truth value; c belongs to class in classA certain class c, P in ses_ijRepresenting a prediction of classification probability, P_ij ^*Representing a classification probability truth value; beta is a_coordRepresenting the coordinate weight, α_noobjRepresenting the weight, alpha, without object_objIndicating the weight value when there is a target.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A bridge side falling behavior identification method based on depth understanding is characterized by comprising the following steps:

s2, pre-judging whether a pedestrian falls from the side of the bridge or not by using the pre-processed video data; using the fence at the bridge edge and the periphery of the fence as an interest domain, identifying whether a person crosses the fence by using a trained YOLO-Attention model, and cross-verifying whether the person crosses the fence by using a monitoring camera on a street lamp of the bridge floor and a warning region algorithm; if the people are recognized to cross the fence, generating a railing boundary crossing signal, and entering the step S3; otherwise, returning to the step S1;

2. The method for identifying bridge side falling behavior based on depth understanding of claim 1, wherein in step S1, the video data is preprocessed by: and judging whether the video image needs to be subjected to defogging processing by adopting a self-adaptive defogging algorithm, if the TBV (total bounded variation) in the image is judged to be larger than a set threshold value, the image does not need to be subjected to defogging processing, and otherwise, the image needs to be subjected to defogging processing.

3. The method for identifying the bridge-side falling behavior based on depth understanding of claim 1, wherein a mixed domain Attention mechanism is added to the YOLO-Attention model in steps S2 to S5 for improving the detection accuracy.

4. The method as claimed in claim 3, wherein a mixed domain attention mechanism is used, which combines a channel attention mechanism and a spatial attention mechanism, and in this mechanism, the feature map passes through a total of two models, namely the channel attention model and the spatial attention model, and then the reconstructed feature map is output.

5. The bridge side falling behavior recognition method based on depth understanding of claim 4, wherein the channel domain attention mechanism in the hybrid attention model is implemented as follows:

the first step, extrusion operation, compressing the two-dimensional features of each channel into a real number through global pooling, and the specific operation formula is as follows:

wherein, F_sq(. cndot.) is a squeeze function, W, H is the width and height, respectively, of the feature map to be processed, u_c(i, j) is the element for which the feature map level c channel coordinate is (i, j), z_cOutput characteristics indicating that the c-th layer channel is extruded;

and secondly, exciting operation, namely generating a weight value for each characteristic channel through the parameter W, and outputting the weight values with the same number as the input characteristics, wherein the specific operation formula is as follows:

s＝F_ex(z,W)＝σ(g(z,W))＝σ(W₂δ(W₁z))

wherein, F_exRepresenting the excitation operation, z is the output of the squeeze operation, and is the tensor with the size of 1 × 1 × C, and C is the channel number of the feature map; w₁And W₂Is a weight; δ denotes the ReLU activation function, σ denotes the Sigmoid activation function; s is the output of the excitation function and is used for describing the weight of the characteristic diagram;

thirdly, calibrating feature weight, weighting the weight value obtained by the excitation operation to each channel feature, multiplying the weight coefficient by channels one by one, and completing an attention mechanism in the channel dimension, wherein a specific operation formula is as follows:

wherein, F_scale(. cndot.) represents an identification function,

6. The method for identifying bridge side falling behavior based on depth understanding of claim 4, wherein the spatial domain attention mechanism in the hybrid attention model is implemented as follows:

forming a feature map by using the spatial structure of the features, and modeling by using the relation of the features on the space; firstly, performing maximum pooling and average pooling operations based on channel dimensions on feature maps to obtain two feature maps which respectively represent maximum pooling features and average pooling features; then applying the convolution layer and the activation function to obtain a final space attention diagram; the specific operation formula is as follows:

M_S(F)＝σ(f^7×7([AvgPool(F)；MaxPool(F)]))

where σ denotes a sigmoid activation function, f^7×7A convolutional layer representing a convolutional kernel size of 7 × 7; m_s(F) Outputting a characteristic map for the final spatial attention; AvgPool (F) is the average pooling operation based on the channel dimension for the feature map, and MaxPool (F) is the maximum pooling operation based on the channel dimension for the feature map;

in order to maximize the pooling characteristics,

is the average pooling characteristic.

7. The method for identifying the bridge side falling behavior based on depth understanding of claim 1, wherein in step S2, the warning region algorithm is as follows:

the first step is as follows: setting a background template of the alert area:

setting an image of the bridge floor guardrail region without a person as a background template, and inputting a feature map of the background template region into a deep learning detection network for training so that the network can adaptively find out an alarm ring region;

the second step is that: setting early warning characteristics:

performing image difference operation on image frames of people on the bridge floor guardrail and image frames of the background template, extracting difference information of the image templates and the background template of people on the guardrail as an early warning characteristic diagram, and inputting the early warning characteristic diagram into a detection network for training to enable the network to obtain an early warning effect; when someone climbs the bridge guardrail, the characteristics of the warning area correspondingly change into early warning characteristics, and at the moment, the detection model gives early warning;

the third step: screening out influence factors:

due to the influence of optical flow and rainwater, when garbage and sundries thrown down from a bridge floor and birds or other flying objects pass through a warning area, image characteristics can be changed, and the images can be used as early warning characteristics by a network to cause false detection; therefore, the optical flow rain characteristic, the garbage impurity characteristic and the characteristic when the flying birds or other flying objects skip the warning area are defined as influence factors and need to be screened out; firstly, performing differential operation on an image frame and a background template image frame when the influence factor characteristics of the warning area appear, extracting the differential information of the image of the guardrail influence factor and the background template as an influence factor characteristic diagram, and inputting the difference information into a detection network for training to enable the network to have the capability of identifying the influence factor; when the network detects that the characteristics of the video change but the influence factor, the influence factor is screened.

8. The bridge-side falling behavior recognition method based on depth understanding of claim 1, wherein in step S4, if no falling splash signal is generated, the method returns to step S3 to detect whether a person falls again, and if no person falls is detected, the person falling signal is determined to be misjudged; if the personnel falling signal is still generated, the step S4 is executed again to detect whether falling splash exists, and if the personnel falling signal is not generated, the personnel falling signal is judged to be misjudgment.

9. The bridge side falling behavior recognition method based on depth understanding of claim 1, wherein in step S6, the method for predicting the personnel drifting trajectory is as follows:

and (3) establishing a motion equation of the target drift by considering the wind speed when the person falling into the water falls into the water and the flow velocity of the water flow:

wherein, V_cIs a wind speed field, V_wIs a water flow velocity field; x (t) is the position of the man falling into water at the moment, and x (t + delta t) is the position of the man falling into water after delta t time; the method comprises the steps of obtaining drifting data in various parameter forms by utilizing a dummy to carry out a simulated floating experiment in the river, and fitting a personnel drifting motion track according to the obtained data; unknown data of the wind speed field and the water flow velocity field are obtained by applying a Lagrange interpolation method, and prediction errors are further reduced; on the basis, a Monte Carlo simulation method is applied to simulate the drifting trajectory of the personnel.