CN116453069A

CN116453069A - Expressway casting object detection method based on cascade difference perception model

Info

Publication number: CN116453069A
Application number: CN202310274462.1A
Authority: CN
Inventors: 张星明; 黄晓丹; 林育蓓; 王昊翔
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-03-21
Filing date: 2023-03-21
Publication date: 2023-07-18

Abstract

The invention discloses a highway casting detection method based on a cascade difference perception model, which comprises the following steps: carrying out background modeling on the expressway video to extract a foreground, and then carrying out mathematical morphology processing to obtain a foreground candidate frame; removing the foreground candidate frames which are too small and too large in area, are not on the road surface and are recognized as the vehicles and the humans by the YOLO network model; solving a foreground candidate frame with the rest exceeding 2s by using an IOU target tracking algorithm; inputting the median image and the background image of the current frame into a difference perception model to obtain a difference region and generate a difference frame; the foreground candidate frames matched with the difference frames are reserved through an optimization algorithm; and inputting the foreground candidate block images into a road surface and non-road surface two-classification network, judging the foreground candidate blocks classified into non-road surfaces as object throwing frames, and marking the object throwing frames in the images. According to the invention, the foreground is extracted through background modeling, and then the throwing objects are screened out through a series of condition judgment, so that the effectiveness of the invention is verified in an actual scene.

Description

Expressway casting object detection method based on cascade difference perception model

Technical Field

The invention relates to the technical field of target detection, in particular to a highway casting object detection method based on a cascade difference perception model.

Background

The detection of casts is an important branch of the field of target detection. The use of digital images for detecting sprinklers is of great importance in production and life, especially in the context of highways. And the monitoring video of the highway camera is utilized to timely and automatically find the casting objects, so that the potential safety hazards of roads are reduced as much as possible, and the driving safety of the highway is improved.

At present, the detection of the casting objects on the expressway is rarely studied at home and abroad, and most of the methods are deep learning methods. Deep learning based methods require a large number of data sets to train, but these methods have difficulty detecting the casting when faced with the casting categories that have not emerged in the training set. In a real environment, the variety of sprinklers is numerous, and the model cannot be theoretically trained by establishing a training set containing all the sprinklers. Meanwhile, under various monitoring video scenes, water marks and pits are easily caused due to illumination change, camera shake and rainy day road surface left water marks. There is a need to propose a highway casting detection method which does not need data driving training and has detection robustness.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provide a highway casting detection method based on a cascade difference perception model so as to improve the casting detection accuracy and robustness and reduce the occurrence frequency of false detection.

The aim of the invention can be achieved by adopting the following technical scheme:

a highway casting detection method based on a cascade difference perception model comprises the following steps:

s1, background modeling is carried out on a highway monitoring video, and a background image generated by modeling is subtracted from a current frame to obtain a foreground binary image;

s2, carrying out mathematical morphology processing on the foreground binary image to obtain a foreground candidate frame;

s3, removing the foreground candidate frames which are too small and too large in area and are not on the road surface from the obtained foreground candidate frames, and detecting the vehicles by using a YOLO network model and eliminating the foreground candidate frames identified as the vehicles;

s4, tracking and matching each foreground candidate frame by using an IOU target tracking algorithm, and reserving the foreground candidate frames which are static for 2S and above;

s5, generating a median image by using the first 5 frames of the current frame, and inputting the median image and the background image into a difference perception model to respectively obtain a difference binary image of a foreground candidate frame region and a difference binary image of a full-view region;

S6, performing logic operation of bitwise AND on the difference binary image of the foreground candidate frame area and the difference binary image of the full image area, and extracting a difference frame from the binary image obtained by the AND;

s7, solving the IOU value of the difference frame and the foreground candidate frame, inputting the IOU value into an optimization algorithm, outputting whether the foreground candidate frame is matched with the difference frame or not, and reserving the foreground candidate frame matched with the difference frame;

s8, inputting the foreground candidate frames into a road surface and non-road surface classification network, judging the foreground candidate frames classified into non-road surfaces as object throwing frames, and marking the object throwing frames in the picture.

Further, the step S1 is as follows:

performing background modeling by adopting a KNN model, inputting the expressway monitoring video into the KNN model to obtain a background image L generated by the KNN model modeling, wherein the background image L generated by the KNN model modeling only comprises the background content of the expressway monitoring video;

the K Nearest Neighbor (KNN) algorithm is a classification algorithm based on statistics. The principle is that a sample is assumed to be in the feature space, the nearest k samples of the sample are basically classified into a class, so that the sample can be judged to belong to the class, and the sample has all the properties of the class;

The basic principle of KNN model modeling to generate background images (see literature: zivkovic Z, van Der Heijden F. Effect adaptive density estimation per image pixel for the task of background subtraction [ J ]. Pattern recognition letters,2006,27 (7): 773-780.): the method comprises the steps of combining the ideas of non-parameter probability density estimation and KNN classification, realizing the establishment of a background model under a scene with small foreground target change, and generating a corresponding background image L after the KNN model receives an input video frame image, wherein the background image L only comprises the background content of a video;

the background image L generated by modeling the current frame R and the KNN model is subtracted by using a frame difference method, a differential image D=R-L can be obtained, a threshold value T is selected to binarize the differential image, a foreground binary image is obtained, and the frame difference method comprises the following calculation process:

D _m (x,y)＝|f _A (x,y)-f _B (x, y) | (equation 1)

Wherein D is _m (x, y) is the value of the pixel point in the differential image at the position with x on the abscissa and y on the ordinate, f _A (x, y) and f _B (x, y) is the value of the pixel point in the two different images at the position of x on the abscissa and y on the ordinate, and the mathematical description of the binarization of the image is as follows:

the current frame R is differentiated from the background image L generated by modeling the KNN model, and the obtained differential image D contains moving prospects such as walking people, vehicles and casts falling on the road surface from the vehicles. Meanwhile, due to interference such as video jitter, illumination change, water mark reflection on a rainy road surface and the like, partial background area noise also appears in the foreground.

Further, the step S2 is as follows:

after the differential image D is obtained, carrying out mathematical morphology operation on the differential image D for 3 times, wherein the mathematical morphology operation comprises expansion operation, corrosion operation, opening operation and closing operation, and the expansion operation is to expand the outline of the object, wherein the mathematical morphology operation comprises 1 closing operation and 2 corrosion operations in sequence; the corrosion operation is to corrode the edge of the object, the opening operation is to corrode the image first and then expand, the small object can be eliminated, the boundary of the larger object is smoothed, the closing operation is to expand the image first and then corrode, the small cavity is mainly eliminated, the areas are communicated to form a communication area, the corresponding mathematical morphology operation and sequence are designed according to the characteristics of the image, noise in the image, such as noise caused by certain illumination changes, can be removed, the interference elimination purpose is achieved, after the mathematical morphology operation, a rectangular frame is extracted according to the foreground area, and a foreground candidate frame set O in the current frame R is obtained:

O＝{o ₁ ,o ₂ ,...,o _n' ,...,o _n ' formula 3

Wherein o is ₁ ,o ₂ ,...,o _n' ,...,o _n Representing the 1 st, 2 nd, … th, n', … th, n-th foreground candidate boxes, respectively.

Further, the step S3 is as follows:

after the foreground candidate frame set O is obtained, although most of interference items are removed through mathematical morphology operation, interference of partial illumination change, camera shake and rainy road surface water mark reflection needs to be removed by other methods. According to the actual situation of the highway casting, the too small casting does not interfere the safe running of the automobile (such as leaves, small paper sheets and the like), while the too large casting does not exist in theory (such as an automobile, a construction vehicle and the like), the area of each foreground in the foreground candidate frame set O is calculated, and the threshold value is judged according to the preset too small area judgment threshold value and the too large area judgment threshold value The method comprises the steps of removing foreground candidate frames with too small area and too large area, namely, misjudging as a foreground because flowers, plants and trees on two sides of a road swing due to wind blowing and clouds moving in the sky, wherein a target range for detecting a casting object is a road surface area, and a part outside the road is not needed to be considered, and a vehicle pedestrian moving or stopping on the road also becomes a foreground, but is not a target for detecting the casting object, and is also needed to be removed, namely, the foreground candidate frames with the too large area and the too small area are removed through threshold value and road segmentation, and the part of the road surface segmentation is not in accordance with the casting object requirement, finally, detecting a human vehicle on a current frame R through a YOLO network model, outputting a detection frame comprising the human vehicle, removing the foreground candidate frames comprising the human vehicle, and finally obtaining a foreground candidate frame set Q:

YOLO (You Only Look Once) is an open source network model, commonly used for object detection tasks, the YOLO network model uses CNN networks for object classification and localization in input images, with high accuracy and speed, which first divides the input image into grids of the size s x s, each of which is responsible for detecting objects with a center point within the grid. Each grid detects Z bounding boxes and their confidence. The confidence reflects the probability that the bounding box contains an object and the accuracy of the bounding box, and is expressed in terms of the degree of spatial overlap IOU of the bounding box and the real box. Each bounding box also has its size and location information, denoted (x, y, w, h). Where (x, y) is the center coordinate of the bounding box, which is the offset from the top left corner coordinate of the grid. w and h are the width and height of the bounding box, and their values are the ratio of the width and height of the image. And finally, the YOLO network model calculates the information and outputs the classification and the position of each target in the input image.

Further, the step S4 is as follows:

under the condition that illumination changes and video slightly shake, a part of background area becomes a foreground area within a short period of two frames, and a casting object usually moves a distance after falling into the ground from a vehicle and finally is static on the road surface, in order to further remove foreground noise of the type and keep the casting object as a foreground, the noise is distinguished from the casting object which is continuously static by tracking each foreground object and judging the static state of the casting object, and the motion state of each foreground candidate frame is tracked by using an IOU object tracking algorithm, wherein the spatial overlapping degree IOU is defined as follows:

IOU target tracking algorithm (see literature: bochinski E, eiselein V, sikora T.high-speed tracking-by-detection without using image information [ C)]V/2017 14th IEEE international conference on advanced video and signal based surveillance (AVSS). IEEE, 2017:1-6.) the target tracking threshold σ is used by continuously inputting several foreground candidate boxes detected per frame of video _IOU Judging the coincidence ratio of foreground candidate frames of adjacent frames so as to correlate the same foreground target in different frames, and if one tracked foreground target q in the previous frame and one foreground target of the current frame meet the condition that the spatial overlapping degree IOU is larger than sigma _IOU If the spatial overlapping degree IOU of a certain foreground object of the current frame and all foreground objects of the previous frame is smaller than sigma _IOU A new foreground object can be considered to appear currently, an identifier q' is allocated for the new object, and all foreground candidate frames currently tracked and corresponding object identifiers are output in the current frame;

the object in motion, the spatial overlap degree IOU of its neighboring frames is typically small. While in a stationary state, the spatial overlapping degree IOU of adjacent frames is generally higher, the invention defines a target tracking threshold sigma _IOU If the degree of spatial overlap IOU of the previous frame and the current frame of the tracked foreground object q is greater than sigma _IOU If the target is considered to be the same target, continuing tracking; defining a target rest threshold sigma _static Target motion threshold sigma _moving Target accumulated static frame number static _min Assigning a variable static to a tracked foreground object q _q For recording the motionless state of the foreground object if there is a degree of spatial overlap IOU (q, q ') ∈σ between the foreground object q' in the previous frame and the tracked foreground object q in the current frame _IOU Illustrated as the same foreground object, and at the same time, if IOU (q, q'). Gtoreq.sigma. _static Then static _q ＝static _q +1, illustrating that the foreground object q is stationary for one frame, if σ _static ≥IOU(q,q′)≥σ _moving Specifying that the foreground object q is in motion and setting _q Reset to static _q =0, if static _q ≥static _min The foreground target q reaches a static reservation condition, the foreground targets of 2s and above are reserved, and finally a static foreground candidate frame set C is obtained:

further, the step S5 is as follows:

in the case of light change and slight video jitter, part of background areas still become foreground for a long time, false detection generated by background noise is removed and suspected throwing object foreground is remained through static state judgment, a difference perception model is cascaded after the foreground is generated and an IOU target tracking algorithm judges static algorithm, the throwing object is usually an object which newly appears on a road surface and moves for a period of time, in contrast to the background areas such as the road surface, a road cone roadblock and the like which always exist in a monitoring video, a double background image is input into the difference perception model to distinguish throwing objects from background noise in consideration of the difference of the background and throwing objects in time and space, the double background image comprises a long background image and a short background image, the background image L generated by KNN model modeling is updated slowly and can be regarded as a long background image, the object which moves for a period of time is finally and is stationary can be fused into the background image L after a long time, the median image S generated by using the first 5 frames of the current frame is regarded as a short background image which is updated quickly, and the object which moves for a period of time and finally appears in the static median image more quickly. The object to be thrown will appear first in the median image S after rest, but not yet in the background image L. The background area and the background object are simultaneously present in the median image S and in the background image L;

The image difference comparison algorithm at the pixel level is sensitive to the fine region of the image difference, the background change generated by light variation and slight video jitter is easily identified through the image difference comparison algorithm at the pixel level such as SIFT, SSIM, template matching algorithm and the like, in order to compare the difference between the image semantics and the overall structure level, the image difference change at the obvious object level is identified, the fine pixel level change is ignored, and the difference region between the median image S and the background image L generated by modeling the KNN model is acquired by using the difference perception model based on the VGG16 model. The difference perception model consists of a VGG16 model (see literature: simonyan K, zisserman A. Very deep convolutional networks for large-scale image recognition [ J ]. ArXiv preprint arXiv:1409.1556,2014 ]) for a total of 16 layers, specifically comprising 2 convolutional layers connected to one max pooling layer, 3 convolutional layers connected to one max pooling layer, and finally three full connection layers. One picture is input into the VGG16 model, and each convolution layer will extract the features of the picture separately.

The formula of the difference perception model is expressed as follows:

wherein L is a background image generated by KNN model modeling, S is a median image generated by using the previous 5 frames of the current frame, and the function F ⁽ⁱ⁾ () Representing an image as input and an ith feature layer of the VGG16 model as output, the ith feature layer comprising M _i The number of elements, N, represents the number of network layers, normalize the pixel value result of the output difference perceived image P (L, S) to [0,255]；

The median image S is calculated as follows:

S＝Median(R _t-4 ,…,R _t ) (equation 5)

Wherein R is _t Video frame image R representing the current t-th frame _t The Median () function represents that 5 images are input, a Median value is obtained for the values of the pixels of the 5 images at the (x, y) coordinate positions, and the Median value is used as the value of the pixels of the output Median image S at the (x, y) coordinate positions;

generating a median image S by using the first 5 frames of the current frame, inputting a background image L generated by modeling the median image S and a KNN model into a differential perception model, and outputting a differential perception image P of a full-image region _scene ；

Only rough difference areas can be obtained from the difference comparison of the whole image, and the sizes and positions of objects in which the difference areas appear cannot be determined in detail. In order to acquire a difference region fitting the shape of the difference object, a difference perceived image is acquired on a region of the foreground candidate frame level. The foreground candidate frame region image is input into a difference perception model and then a difference perception image P of the foreground candidate frame region is output _region The formula of (2) is as follows:

wherein,,representing a set of background images in a foreground candidate frame region in a background image L generated by KNN model modeling, +.>Representing the set of median images within the foreground candidate box region in the median image S, +.>Representing a foreground candidate box c _n' Background image in region,/->Representing a foreground candidate box c _n' Median image in region, p _n' Representing a foreground candidate box c _n' Is a difference-perceived image of (2).

Further, the step S6 is as follows:

the difference perceived image of the full image area can obtain a rough area with obvious difference in the image, the slight difference is not perceived at the same time, the difference perceived image obtained on the foreground candidate frame level can perceive the accurate shape and position of a difference object, and is more sensitive to the slight difference such as some background noise, in order to obtain the accurate shape and position of the difference object and eliminate the slight difference noise, the difference perceived image of the foreground candidate frame area level and the full image level, namely the full image level, are combined, the difference perceived image of the foreground candidate frame area and the difference perceived image of the full image area are respectively subjected to automatic binarization operation through an Ostu () to obtain two binary images, the obtained two binary images are subjected to logic operation of bitwise AND, the binary images obtained by AND are input into a function Rect (), rectangular frame extraction operation is carried out, an extracted difference frame DF is output, and the formula of the difference frame DF extraction process is expressed as follows: df=rect (Ostu (P) _regi o _n )∩Ostu(P _scene ) (equation 7).

Further, the step S7 is as follows:

if the spatial overlapping degree IOU of the difference frame and the foreground candidate frame is higher, the position of the foreground candidate frame is indicated to have obvious image difference change under the space-time interval, and the possibility of throwing objects in the foreground candidate frame is high. If the spatial overlapping degree IOU of the difference rectangular frame and the foreground candidate frame is lower, the shape and size errors of the difference rectangular frame and the foreground candidate frame are larger, and noise generated by background change is more likely. Solving the spatial overlapping degree IOU of the difference frame and the foreground candidate frame, inputting the spatial overlapping degree IOU value into an optimization algorithm, outputting whether the foreground candidate frame is matched with the difference frame or not, and reserving the foreground candidate frame matched with the difference frame;

the formula for solving the spatial overlapping degree IOU of the difference frame and the foreground candidate frame is as follows:

wherein x is _j For the j-th sample, a foreground candidate box c of the current frame is specifically represented _l And all difference frames d _k A maximum value of the spatial overlapping degree IOU of (2);

the optimization algorithm is defined as equation (9) and x is defined as _j Input into equation (9), output two categories y _j E {1, -1}, where 1 represents a match, -1 represents a mismatch, if y _j =1, illustrating that the foreground candidate frame coincides with the difference frame to a higher extent, and retaining the foreground candidate frame c _l And is re-marked as dc _l If y _j = -1, illustrating that all difference frames overlap with the foreground candidate frame to a low degree, excluding the foreground candidate frame c _l ，

Wherein ε is a penalty factor and ω and b are separate hyperplanesZeta of the first and second coefficients of (c) _j For the relaxation coefficient of the jth sample, the distance that allows the samples to be from the correct boundary in the class in which they are located, φ (x _j ) Is x _j A kernel function mapped from low dimension to high dimension, wherein the kernel function adopts a Gaussian kernel function, and a final reserved foreground candidate frame set is DC:

further, the step S8 is as follows:

in rainy days, nights and tunnel scenes, vehicles leave water marks through the road surface, water pits formed by rainwater accumulation and various light reflecting areas formed by car lamps on the road surface are similar to the space-time states of the sprinkled objects newly appearing in the video and become foreground areas, and the false detection cannot be eliminated through a cascaded difference perception model; in order to remove false detection on the road surface, cascading two classification networks of the road surface and the non-road surface after the difference perception model, wherein the images of the road surface type comprise common road surfaces, water mark road surfaces, water pit road surfaces, reflective road surfaces, road surfaces containing lane lines and the like, the images of the non-road surface type mainly comprise various throwing object images, such as cartons, waste paper sheets, road cones, branches, wood sticks, plastic bags, water bottles and the like, a foreground candidate block image is input into the two classification networks of the road surface and the non-road surface, and finally the foreground candidate block classified into the non-road surface is judged as a throwing object target frame and marked in a picture, wherein the two classification networks are realized by adopting a ResNet34 network model;

Depth residual network (ResNet) solves the problems of network degradation and difficult training of CNN network models as network depth increases. As the number of layers of the network increases, the network can extract more complex representations of features in the image. Degradation problems can occur: the accuracy of network prediction is saturated, and the network model is difficult to train due to the fact that gradients disappear or explosion occurs. ResNet adds a residual error learning unit between feature layers of the convolutional neural network through a short circuit mechanism to solve the degradation problem. ResNet34 residual error network (see, he K, zhang X, ren S, et al deep residual learning for image recognition [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognment.2016:770-778.) residual error learning is performed between every two layers of features. ResNet34 network, including a 7 x 7 convolutional layer, a step size of 2 max pooling layer, 4 residual layers, wherein, the first residual layer is composed of 3 residual blocks, the second residual layer is composed of 4 residual blocks, the third residual layer is composed of 6 residual blocks, the fourth residual layer is composed of 3 residual blocks, each residual block is two 3 x 3 convolutional layers of jump connection structure. And inputting an image into the ResNet34 network, identifying the image characteristics, classifying the image and outputting the classification result of the image.

The method comprises the steps of collecting expressway road surface pictures such as common road surfaces, water mark road surfaces, water pit road surfaces, reflecting road surfaces, road surfaces containing lane lines and the like under the weather conditions including sunny days, cloudy days, rainy days and foggy days and non-road surface pictures including cartons, tires, branches, safety helmets, road cones, plastic bags, water bottles and waste paper clusters, and dividing all collected images into a training data set and a test data set according to the proportion of 10:1.

Compared with the prior art, the invention has the following advantages and effects:

1) The invention provides a method for detecting the castables by using a cascade difference perception model and a road non-road classification network, which can obviously reduce false detection caused by illumination change, video jitter and road water mark, has higher castables detection accuracy and has certain detection robustness in various expressway scenes.

2) The invention can rapidly detect the throwing object. After the throwing object falls on the expressway, the falling throwing object can be rapidly detected within 5 to 10 seconds by analyzing only low-frame-rate video of 1 frame per second.

3) The method does not need to establish a casting object type data set and train the model for a long time like a general target detection model based on deep learning, can be used for detecting various casting objects, has stronger universality and is more suitable for actual expressway scenes.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a step diagram of a highway casting detection method based on a cascade difference perception model, disclosed by the invention;

FIG. 2 is a diagram of exemplary processing steps of a highway casting detection method based on a cascade difference perception model according to an embodiment of the present invention;

FIG. 3 is a diagram of exemplary processing steps of a cascaded differential sensing model disclosed in an embodiment of the present invention;

FIG. 4 is a graph showing the effect of projectile detection after the method of the present invention is employed;

FIG. 5 is a comparison of sensed data of a projectile setting different combinations of target tracking parameters according to an embodiment of the invention;

fig. 6 is a graph comparing data of detecting sprinkles under different scene types according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, the embodiment discloses a highway casting detection method based on a cascade difference perception model, which comprises the following steps:

because few studies are made on the method for detecting the casting object at home and abroad, no public data set for testing the effect of the method for detecting the casting object exists at present, and the expressway monitoring video for testing the method is generated by shooting by an expressway monitoring camera in Guangdong province. The highway monitoring videos used for testing are 62 in total, the frame image size of the video is 720 multiplied by 480 pixels, and the frame rate is 1 frame per second. The throwing objects in the test video comprise various types of objects such as cartons, waste paper, tires, safety helmets, water bottles, road cones, branches and the like, and a part of picture samples taken from the test video are shown in fig. 4;

performing background modeling by adopting a KNN model, inputting the expressway monitoring video into the KNN model to obtain a background image L generated by the KNN model modeling, wherein the background image L generated by the KNN model modeling only comprises background content of the expressway monitoring video, and the parameter setting of the KNN model adopts default setting in a digital image processing function library opencv;

The background image L generated by modeling the current frame R and the KNN model is subtracted by using a frame difference method, a differential image D=R-L can be obtained, a threshold value T=240 is selected to binarize the differential image D, and a foreground binary image is obtained, wherein the frame difference method comprises the following calculation process:

dm (x, y) = |fa (x, y) -fB (x, y) | (formula 1)

S2, carrying out mathematical morphology processing on the foreground binary image to obtain a foreground candidate frame; after the differential image D is obtained, mathematical morphological operation is carried out on the differential image D for 3 times, 1 closing operation and 2 corrosion operations are sequentially carried out according to the sequence, a square with the size of 3 multiplied by 3 is selected by an operation core in the morphological operation, and the moving step length of the operation core in the differential image is 1 each time. According to the characteristics of the image, corresponding mathematical morphology operation and sequence are designed, noise in the image, such as noise of certain illumination changes, can be removed, the purpose of interference removal is achieved, after the mathematical morphology operation, a rectangle frame is extracted according to a foreground region by using a findcontours function in an opencv library, and a foreground candidate frame set O in a current frame R is obtained:

O＝{o ₁ ,o ₂ ,…,o _n' ,...,o _n ' formula 3

And S3, removing the foreground candidate frames which are not on the road surface and have too small and too large areas from the obtained foreground candidate frames, and performing man-car detection on the current frame R by using a YOLO network model and eliminating the foreground candidate frames identified as the man-car. The area of each foreground candidate frame is calculated, excluding foreground candidate frames having an area less than 40 pixels and an area greater than 40000 pixels. Drawing a road surface area for a video in an expressway scene, reserving foreground candidate frames with center point coordinates in the road surface area, removing foreground candidate frames with center point coordinates outside the road surface, inputting a current frame image into a YOLOv5 network model for human-vehicle detection, obtaining a target detection frame of pedestrians and vehicles in the video frame image through the YOLOv5 network model, calculating the spatial overlapping degree IOU of each foreground candidate frame and a human-vehicle target detection frame, judging that the foreground candidate frames are human-vehicles if the spatial overlapping degree IOU of the foreground candidate frames and the human-vehicle target detection frame is greater than 0.3, excluding the foreground Jing Houxuan frames, and finally obtaining a foreground candidate frame set Q:

S4, tracking and matching each foreground candidate frame by using an IOU target tracking algorithm, reserving the foreground candidate frames which are static for 2S and above, and tracking the motion state of each foreground candidate frame by using the IOU target tracking algorithm. The spatial overlapping degree IOU is defined as follows:

IOU target tracking algorithm uses target tracking threshold sigma _IOU Judging the coincidence ratio of the foreground candidate frames of adjacent frames so as to correlate the same foreground target in different frames, and if the tracked foreground target q in the previous frame and a certain foreground target of the current frame meet the condition that the spatial overlapping degree IOU is larger than sigma _IOU The foreground object of the current frame continues to be identified as q, if a foreground object of the current frame and all the previous Jing Mu of the previous frameThe standard space overlapping degree IOU is smaller than sigma _IOU Then it can be considered that a new foreground object currently appears, an identifier q 'is allocated to this new object, and tracking of this object q' is continued in the subsequent video frame, wherein σ _IOU Set to 0.2;

the object in motion, the spatial overlap degree IOU of its neighboring frames is typically small. While in a static state, the spatial overlapping degree IOU of adjacent frames is usually higher, and a target tracking threshold sigma is defined _IOU If the degree of spatial overlap IOU of the previous frame and the current frame of the tracked foreground object q is greater than sigma _IOU If the target is considered to be the same target, continuing tracking; defining a target rest threshold sigma _static Target motion threshold sigma _moving Target accumulated static frame number static _min Assigning a variable static to a tracked foreground object q _q For recording the motionless state of the foreground object if there is a degree of spatial overlap IOU (q, q ') ∈σ between the foreground object q' in the previous frame and the tracked foreground object q in the current frame _IOU Illustrated as the same foreground object, and at the same time, if IOU (q, q'). Gtoreq.sigma. _static Then static _q ＝static _q +1, illustrating that the foreground object q is stationary for one frame, if σ _static ≥IOU(q,q′)≥σ _moving Specifying that the foreground object q is in motion and setting _q Reset to static _q =0, if static _q ≥static _min And (5) indicating that the foreground object q reaches a static reservation condition, and reserving the foreground object which is static for 2s and above. Setting sigma _static ＝0.7，σ _moving ＝0.2，static _min =2. Finally, a static foreground candidate frame set C is obtained:

s5, generating a median image by using the first 5 frames of the current frame. Inputting the median image and the background image L into a difference perception model to respectively obtain a difference binary image of a foreground candidate frame region and a difference binary image of a full-view region, wherein the formula of the difference perception model is expressed as follows:

Wherein L is a background image generated by KNN model modeling, S is a median image generated by using the previous 5 frames of the current frame, and the function F ⁽ⁱ⁾ () Representing an image as input and an ith feature layer of the VGG16 model as output, the ith feature layer comprising M _i N represents the network layer number, normalize the pixel value result of the output difference perceived image P (L, S) to [0,255]；

The median image S is calculated as follows:

S＝Median(R _t -4,...,R _t ) (equation 5)

Wherein R is _t Video frame image R representing the current t-th frame _t The Median () function represents that 5 images are input, a Median value is obtained for the values of the pixels of the 5 images at the (x, y) coordinate position, and the values of the pixels of the output Median image S at the (x, y) coordinate position are obtained, specifically, 5 frames of images are input as a group to the numpy.

In order to acquire a difference region fitting the shape of the difference object, a difference perceived image is acquired on a region of the foreground candidate frame level. And acquiring a difference perception image in the area which is enlarged by 4 times in the foreground candidate frame. The foreground candidate frame region image is input into a difference perception model and then a difference perception image P of the foreground candidate frame region is output _region The formula of (2) is as follows:

S6, performing automatic binarization operation on the difference perceived image of the foreground candidate frame area and the difference perceived image of the full-view area through Ostu (), performing logic operation of bitwise AND on the obtained two binary images, inputting the binary images obtained by AND into a function Rect (), performing rectangular frame extraction operation, and outputting an extracted difference frame DF, wherein the formula of the difference frame DF extraction process is expressed as follows: df=rect (Ostu (P) _region )∩Ostu(P _scene ) (equation 7).

the optimization algorithm is defined as equation (9) and x is defined as _j Input into equation (9), output two categories y _j E {1, -1}, wherein 1 represents matching, -1 represents non-matching, and if yj=1, the foreground candidate frame and the difference frame are higher in overlapping degree, and the foreground candidate frame c is reserved _l And is re-marked as dc _l If y _j = -1, illustrating that all difference frames overlap with the foreground candidate frame to a low degree, excluding the foreground candidate frame c _l ，

using the SVM model in sklearm (see literature: chang C, lin C J. LIBSVM: a library for support vector machines [ J)]ACM transactions on intelligent systems and technology (TIST), 2011,2 (3): 1-27.) implementing an optimization algorithm, acquiring a difference frame of each frame in a video by acquiring a highway casting object video in Guangdong province, recording the spatial overlapping degree IOU of the difference frame and a foreground candidate frame, and recording the spatial overlapping degree IOU to x _j In the method, corresponding spatial overlapping degree IOU values under the foreground candidate frames containing the throwing objects are marked with y correspondingly _j The value is 1, and y is marked by the corresponding spatial overlapping degree IOU value under the foreground candidate frame without the throwing object _j Has a value of-1, using a marked (x _j ,y _j ) The data pair is input into an SVM model for training, and the trained SVThe M model is used for a final optimization algorithm, the spatial overlapping degree IOU of the difference frame and the foreground candidate frame is input, and judgment of whether the difference frame is matched with the foreground candidate frame or not is output.

S8, inputting the foreground candidate frames into a road surface and non-road surface classification network, judging the foreground candidate frames which are finally classified into non-road surfaces as throwing object target frames, marking in a picture, and enabling the images of the road surface types to comprise common road surfaces, water mark road surfaces, water pit road surfaces, reflective road surfaces, road surfaces comprising lane lines and the like. The non-road surface type images mainly comprise various throwing object images such as cartons, waste paper sheets, road cones, branches, wood sticks, plastic bags, water bottles and the like. Inputting a foreground candidate block image into a road surface and non-road surface two-classification network, judging a foreground candidate frame which is finally classified into a non-road surface as a throwing object target frame, and marking in a picture, wherein the two-classification network is realized by adopting a ResNet34 network model, collecting expressway road surface pictures such as common road surfaces, water mark road surfaces, water pit road surfaces, reflecting road surfaces, road surfaces containing lane lines and the like under the weather conditions including sunny days, cloudy days, rainy days and foggy days, and non-road surface pictures including cartons, tires, branches, helmets, road cones, plastic bags, water bottles and waste paper clusters, dividing all the collected images into a training data set and a verification data set according to the proportion of about 10:1, wherein the data set is formed as shown in a table 1, and ResNet34 uses an image Net data set (see literature: ruak ovsky O, deJ, su H, et al image large scale visual recognition challenge [ J ] ]International journal of computer vision,2015, 115:211-252), and then fine-tuning the pavement and non-pavement training set manufactured by the invention, wherein the learning rate is set to be 5.5 multiplied by 10 ^-5 The batch training sample size is 32, the epoch number is 23, and the accuracy of the fine-tuned ResNet34 two-class network on the test set is 90.52 percent

TABLE 1 pavement and non-pavement two-classification data set condition table manufactured by the invention

Category(s)	Training set	Verification set	Test set
				Road surface	5206	504	102
Non-road surface	4941	582	246

The method is used for testing the collected 62 expressway monitoring videos. Table 2 shows scene types and video number composition of the 62 captured highway videos. In order to reduce the calculation amount, the invention adopts the video with the video frame rate of 1fps (one second one frame) as the input of the projectile detection system.

TABLE 2 video category composition table for Highway video dataset

Scene(s)	Sunny day	Rainy day with road surface water stripes	Illumination variation	Tunnel	Camera shake	Shielding
							Number of videos	11	16	15	16	2	2

Following the evaluation criteria for general projectile detection, the present invention employs two metrics for evaluation: frame level indicators and pixel level indicators. Where TP, FP and FN represent correct detection, false detection and missed detection, respectively. Accuracy (PR), recall (RC) and F1-score (F1) are defined as follows:

Pr=tp/(tp+fp) (equation 11)

RC=TP/(TP+FN) (equation 12)

F1 =2×pr×rc/(pr+rc) (equation 13)

At the frame level, the current frame is considered to be correctly detected (TP) if it detects at least one casting. If the current frame contains at least one false positive, the current frame is considered to be a False Positive (FP). At the pixel level, if the degree of spatial overlap IOU between the real label frame and the detection frame is greater than 0.2, then the casting is considered to be properly detected (TP). In order to calculate these metrics in the highway video, the present invention labels the casts in all 62 videos. The accuracy (PR), recall (RC) and F1-score (F1) of the method for detecting the casting in the highway monitoring video are shown in Table 3. The present invention detects the object of the casting within 5.61s after the casting appears on average.

TABLE 3 accuracy (PR), recall (RC) and F1-score (F1) of the detection of casting in highway surveillance video by the method of the present invention

By using the method for detecting the sprinkled objects disclosed by the embodiment, 77 sprinkled object events appear in 62 monitoring videos, and 70 sprinkled object events are successfully detected by the method, so that the detection accuracy is 90.9%. The recall rate of the method of the present invention for detecting various types of sprinklers in a highway surveillance video is shown in table 4.

TABLE 4 accuracy of detection of various types of sprinkled objects in test video by the method of the invention

	Box	Tire with a tire body	Road cone	Branch with branch	Safety helmet	Other objects	Totals to
								Number of events (personal)	20	12	25	3	2	15	77
Number of detection (number)	18	11	23	3	2	12	70
								Accuracy rate of	90％	91.7％	92％	100％	100％	80％	90.9％

Example 2

According to the method for detecting the highway casting object based on the cascade difference perception model disclosed in embodiment 1, the test is performed in the video shot by the highway monitoring camera, as shown in fig. 2, which is a processing step diagram of the embodiment. Firstly, obtaining a background image through background modeling, obtaining a difference image after the difference between a current frame and the background image, binarizing the difference image, performing a series of mathematical morphology operations, and extracting a rectangular frame to obtain a foreground candidate frame. And removing the foreground candidate frame outside the pavement area. The YOLOv5 network model is used to detect vehicles and exclude foreground candidate boxes containing vehicles and pedestrians. Then, to eliminate transient background noise, the IOU target tracking algorithm and stationary state judgment are used to eliminate part of the noise false detection and preserve the pre-stationary Jing Houxuan box. And then cascading the difference perception model with a road surface and non-road surface classification network, drawing a difference perception image in a foreground candidate frame area and a full-image area in order to eliminate background noise caused by abrupt illumination change or video jitter, and extracting a difference frame obtained by comparing the background image with a median image. The difference frame and the foreground candidate frame are adaptively matched using an optimization algorithm to exclude the foreground candidate frame as noise. Finally, in order to distinguish road noise such as water mark puddles from sprinklers, a road surface and non-road surface two-class network ResNet34 is used for deleting noise false detection on the road, and finally, a sprinkler detection frame is output.

According to the method for detecting the highway casting object based on the cascade difference perception model disclosed in the embodiment 1, the test is performed in videos shot by the highway monitoring camera and the manually held camera, as shown in fig. 3, which is a processing step diagram of the cascade difference perception model of the embodiment. And after the static foreground candidate frame module is judged by the IOU target tracking algorithm, the difference perception model and the road surface and non-road surface classification network are cascaded, and in order to eliminate background noise caused by abrupt illumination change or video jitter, a difference perception image is drawn in a foreground candidate frame area and a full-image area, and a difference frame for comparing the background image with a median image is extracted. The detection result in fig. 4 proves that the method for detecting the casting objects on the expressway disclosed by the invention can accurately detect all the casting objects and frame the casting objects, and proves the effectiveness of the method.

Example 3

According to the method for detecting the highway casting object based on the cascade difference perception model disclosed in the embodiment 1, the test is performed in videos shot by the highway monitoring camera and the manually held camera. The present embodiment sets different target rest thresholds sigma _static Static with target accumulated static frame number _min Is shown in FIG. 5At different sigma _static And static _min Under the parameter combination, the comparison results of the pixel level F1 score tested on the highway video dataset. As can be seen from FIG. 5, when σ is set _static ＝0.7，static _min When=2 or set σ _static ＝0.7，static _min When=3, the invention achieves the best effect on the detection of the sprinkle. To reduce the delay in projectile detection, a target accumulated static frame number static _min The smaller the time to detect the sprinkle, the shorter the time to detect the sprinkle, so the invention adopts sigma _static ＝0.7，static _min Parameter settings of =2.

Example 4

According to the method for detecting the highway casting object based on the cascade difference perception model disclosed in the embodiment 1, the test is performed in videos shot by the highway monitoring camera and the manually held camera. In this embodiment, according to the video category composition of the expressway video dataset divided in table 2 of embodiment 1, statistics of accuracy, recall and F1 score of the casting detection under different scenes are performed respectively, as shown in fig. 6. The invention can obtain higher accuracy and recall rate in sunny days, rainy days with road surface water stripes, illumination changes, tunnels and shielding of the scenes, has a certain detection effect in camera shake scenes, can accurately detect sprinklers in various expressway scenes and reduce false alarms generated by scene noise, and proves the effectiveness of the invention.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The highway casting detection method based on the cascade difference perception model is characterized by comprising the following steps of:

s1, background modeling is carried out on a highway monitoring video, and a background image generated by modeling is subtracted from a current frame of the highway monitoring video to obtain a foreground binary image;

s5, generating a median image by using the first 5 frames of the current frame of the highway monitoring video, and inputting the median image and the background image into a difference perception model to respectively obtain a difference binary image of a foreground candidate frame region and a difference binary image of a full-view region;

2. The method for detecting the highway casting based on the cascade difference perception model according to claim 1, wherein the step S1 is as follows:

the method comprises the steps of performing background modeling by adopting a KNN model, inputting a KNN model into a highway monitoring video to obtain a background image L generated by the KNN model modeling, wherein the background image L generated by the KNN model modeling only comprises background content of the highway monitoring video, subtracting a current frame R from the background image L generated by the KNN model modeling by using a frame difference method to obtain a differential image D=R-L, selecting a threshold value T to binarize the differential image to obtain a foreground binary image, and calculating the frame difference method as follows:

Dm (x, y) = |fa (x, y) -fB (x, y) | (formula 1)

3. the method for detecting the highway casting based on the cascade difference perception model according to claim 2, wherein the step S2 is as follows:

performing mathematical morphology operation on the differential image D for 3 times, wherein the mathematical morphology operation comprises 1 closing operation and 2 corrosion operations in sequence, and extracting a rectangular frame according to a foreground region after the mathematical morphology operation to obtain a foreground candidate frame set O in the current frame R:

O＝{o ₁ ,o ₂ ,...,o _n' ,...,o _n ' formula 3

Wherein o is ₁ ,o ₂ ,...,o _n' ,...，o _n Representing the 1 st, 2 nd, … th, n', … th, n-th foreground candidate boxes, respectively.

4. The method for detecting highway casts based on the cascade difference perception model according to claim 3, wherein the step S3 is as follows:

calculating the area of each foreground candidate frame in the foreground candidate frame set O, removing the foreground candidate frames with too small area and too large area and the foreground candidate frames which are not on the road surface according to the preset too small area judgment threshold value and the too large area judgment threshold value, Detecting a person and a vehicle through the YOLO network model, removing foreground candidate frames containing the person and the vehicle, and finally obtaining a foreground candidate frame set Q:

5. the method for detecting the highway casting based on the cascade difference perception model according to claim 4, wherein the step S4 is as follows:

tracking each foreground candidate frame by using an IOU target tracking algorithm, and defining a target tracking threshold sigma _IOU If the degree of spatial overlap IOU of the previous frame and the current frame of the tracked foreground object q is greater than sigma _IOU If the target is considered to be the same target, continuing tracking; defining a target rest threshold sigma _static Target motion threshold sigma _moving Target accumulated static frame number static _min Assigning a variable static to a tracked foreground object q _q For recording the motional stationary state of the foreground object, if the previous frame has the space overlapping degree IOU (q, q ') between the foreground object q' and the tracked foreground object q in the current frame, the space overlapping degree IOU (q, q ') is not less than sigma IOU, the same foreground object is described, and meanwhile, if the IOU (q, q') is not less than sigma IOU _static Then static _q ＝static _q +1, illustrating that the foreground object q is stationary for one frame, if σ _static ≥IOU(q,q′)≥σ _moving Specifying that the foreground object q is in motion and setting _q Reset to static _q =0, if static _q ≥static _min The foreground target q reaches a static reservation condition, the foreground targets of 2s and above are reserved, and finally a static foreground candidate frame set C is obtained:

6. the method for detecting the highway casting based on the cascade difference perception model according to claim 5, wherein the step S5 is as follows:

generating a median image S by using the first 5 frames of the current frame, inputting a background image L generated by modeling the median image S and a KNN model into a differential perception model, and outputting a differential perception image P of a full-image region _scene The formula of the difference perception model is expressed as follows:

wherein the function F ⁽ⁱ⁾ () Representing an image as input and an ith feature layer of the VGG16 model as output, the ith feature layer comprising M _i N represents the network layer number, normalize the pixel value result of the output difference perceived image P (L, S) to [0,255]；

The median image S is calculated as follows:

S＝Median(R _t-4 ,...,R _t ) (equation 5)

Wherein R is _t Video frame image R representing the current t-th frame _t The function Median () represents that 5 images are input, a Median value is obtained for the values of the pixels of the 5 images at the (x, y) coordinate positions, and the Median value is used as the value of the pixels of the output Median image S at the (x, y) coordinate positions;

The difference perception image P of the foreground candidate frame region is obtained after the foreground candidate frame region image is input into the difference perception model _region The formula of (2) is as follows:

7. The method for detecting the highway casting based on the cascade difference perception model according to claim 6, wherein the step S6 is as follows:

performing automatic binarization operation on the difference perceived image of the foreground candidate frame area and the difference perceived image of the full-view area through an Ostu () respectively, performing logic operation of bitwise AND on the two obtained binary images, inputting the binary images obtained by AND into a function Rect () to perform rectangular frame extraction operation, and outputting an extracted difference frame DF, wherein the formula of the difference frame DF extraction process is expressed as follows:

DF＝Rect(Ostu(P _region )∩Ostu(P _scene ) (equation 7).

8. The method for detecting the highway casting based on the cascade difference perception model according to claim 7, wherein the step S7 is as follows:

Solving the spatial overlapping degree IOU of the difference frame and the foreground candidate frame, inputting the spatial overlapping degree IOU value into an optimization algorithm, outputting whether the foreground candidate frame is matched with the difference frame or not, and reserving the foreground candidate frame matched with the difference frame;

9. The method for detecting the highway casting based on the cascade difference perception model according to claim 8, wherein the step S8 is as follows:

and inputting the foreground candidate frames into a road surface and non-road surface classification network, judging the foreground candidate frames which are finally classified into non-road surfaces as object throwing frames, and marking the object throwing frames in a picture, wherein the classification network is realized by adopting a ResNet34 network model.

10. The method for detecting the highway casting based on the cascade difference perception model according to claim 8, wherein the step S8 is to collect the highway pavement pictures including the clear weather, the cloudy weather, the rainy weather and the foggy weather and the non-pavement pictures including the cartons, the tires, the branches, the safety helmet, the road cone, the plastic bags, the water bottles and the waste paper clusters, and divide all the collected images into the training data set and the test data set according to the ratio of 10:1.