CN115331141A

CN115331141A - High-altitude smoke and fire detection method based on improved YOLO v5

Info

Publication number: CN115331141A
Application number: CN202210927443.XA
Authority: CN
Inventors: 陈柯亘; 张静朗; 高艺; 王旗龙; 杨柳
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2022-11-11

Abstract

The invention discloses an improved YOLOv 5-based high-altitude firework detection method which comprises the following steps of: applying a multithread queue algorithm under the condition of multiple paths of high-altitude camera video streams, setting a frame-extracting uploading thread for each path of video stream, and transmitting frame-extracting pictures of all threads into an infinite-length queue; establishing a firework data set at a high altitude angle, and performing data cleaning and data set labeling work; modifying a backbone network in the YOLOv5 network, and replacing the standard convolution layer after each CSP structure with a mixed attention module; setting training hyper-parameters, training the improved YOLOv5 network in the third step based on the smoke and fire data set established in the second step, obtaining a smoke and fire detection model after training is completed, and inputting the picture obtained in the first step into the smoke and fire detection model for smoke and flame detection. The invention extracts the video stream single frame through the improved target recognition algorithm to complete the detection of the fire abnormal phenomenon.

Description

High-altitude smoke and fire detection method based on improved YOLO v5

Technical Field

The invention relates to a neural network and image recognition technology, belongs to the field of image processing and deep learning, and particularly relates to an improved YOLO v 5-based high-altitude firework detection method.

Background

The application number is CN202110589355.9, the invention name is: a firework detecting method based on videos. The method comprises the steps of image acquisition, image combination, firework target detection, deep learning and the like in sequence, three-channel color image input is modified into a multi-channel image which is formed by combining images of the same camera at different time points, a combined algorithm formed by multiple detection algorithms is adopted to track the firework target, and the method is mainly used for safety monitoring.

Although the patent uses various detection algorithms including YOLO, the network structures of the algorithms are not changed, only the channel data information of the input pictures is optimized, and multi-frame images are combined, so that the accuracy rate of smoke and fire detection cannot be fundamentally and greatly improved. Meanwhile, the algorithm only considers the firework tracking of a single camera video stream, but in an actual application scene, the algorithm needs to face the situation of multiple paths of video streams; the tracking method provided by the invention has certain limitation, and the effectiveness of the tracking method cannot be ensured at a high altitude angle with a smaller firework target.

Fire and smoke detection technology is an important content of fire prevention technology. In recent years, the fire is more frequently used in residential communities, industrial electrical equipment is more complex and diversified, and the existing sensing detection equipment is more difficult to cope with new fire-fighting situations, so that a great deal of research is carried out on fire detection by scholars at home and abroad, and great progress is made.

Existing fire detection algorithms can be broadly divided into two categories: based onFeature model and learning-based model ^[1] . The former aims at analyzing characteristics of flame, smoke and the like, and realizes fire early warning by summarizing characteristic rules of a fire area; the latter is to obtain a classifier or a feature extraction model with learning ability through an intelligent algorithm. Chen et al analyzed a large number of flames statistically ^[2] Flame identification rules based on a color model are provided, but the algorithm only considers color information, so that the false alarm rate of a detection result is high; dimit et al utilize the characteristic of dynamic flame pulsation ^[3] A flame boundary detection model based on wavelet time-frequency characteristics is provided; to achieve early warning, tore et al utilize the characteristic of having a large amount of smoke early in a fire ^[4] A fire early warning based on a smoke model is provided.

With the rapid development of target detection technology, single feature identification in the traditional method is difficult to achieve high accuracy, and smoke and flame judgment through high-level features of smoke and fire targets is a more effective method, but at present, a plurality of challenges are still faced.

Due to the existence of some uncontrolled conditions, a specific method cannot adapt to complex and variable high-altitude camera scenes. Different illumination can cause the edge and shadow of the object to change, and the imaging quality and the image processing effect are influenced ^[9] (ii) a The shielding condition of the target can cause difficulty in global feature recognition of the model, and larger detection error is caused ^[10] (ii) a Appearance information such as color, texture and the like of the same type of target at different angles is very different, so that target identification is inaccurate.

Most algorithms have low accuracy for small target detection, and the conditions of missing judgment and erroneous judgment exist. Although the target detection technology based on deep learning has gradually replaced the traditional artificial feature extraction method to become the mainstream, there is still room for progress in the detection of small targets ^[11] Reasons for this problem include low resolution, blurred images, little information, and much noise; although certain effects can be generated by using the image data enhancement technology, many researches show that the image processing effect of the original data is always limited, and the change of the neural network structure can be better understoodThe problem is solved.

When a large amount of real-time video data is analyzed on line, a large amount of pictures need to be detected quickly. For example, in the process of capturing multiple RTSP video streams, an algorithm is required to ensure that each video stream can have a frame-extracted picture to be detected within a specific time interval, thereby ensuring the efficiency and stability of the system.

The technical problems restrict the development of the smoke and fire detection technology, and a technical scheme is urgently needed to be provided, so that the smoke and fire detection system based on deep learning can be better used for fire early warning.

The fire detection technology is further upgraded by the continuous development in the field of image processing and deep learning in recent years, and an FPN (field programmable Gate array) network is an advanced target detection network in recent years ^[5] Fast RCNN network ^[6] SSD network ^[7] And YOLO series networks ^[8] And the like. The YOLO series network has iterated a plurality of versions, YOLO v5 is the latest at present, the method has strong advantages in rapid deployment of the model, the obtained model has the advantages of small file size, high reasoning speed, long training time and the like, and the trained target detection model can simultaneously guarantee high accuracy rate and performance.

The accuracy of firework identification can be improved by using an efficient target detection algorithm, and after the firework identification is further improved, fire early warning can be realized by using a high-altitude camera; in the case of multiple paths of video streams, reasonable frame extraction intervals can be set, frames of the multiple paths of video streams are extracted respectively by utilizing a multithreading technology, multiple images are stored in a specific queue, and smoke and fire detection is performed sequentially.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a high-altitude firework detection method based on improved YOLO v5, which extracts a video stream single frame through an improved target recognition algorithm to complete detection of a fire disaster abnormal phenomenon.

The purpose of the invention is realized by the following technical scheme.

The invention relates to a high-altitude firework detecting method based on improved YOLO v5, which comprises the following processes:

the first step is as follows: applying a multithread queue algorithm under the condition of multiple paths of high-altitude camera video streams, setting a frame-extracting uploading thread for each path of video stream, and transmitting frame-extracting pictures of all threads into an infinite-length queue;

the second step: establishing a firework data set at a high altitude angle, and performing data cleaning and data set labeling work;

the third step: modifying a backbone network in the YOLO v5 network, and replacing the standard convolution layer after each CSP structure with a mixed attention module;

after the characteristics of the backbone network are extracted, a Neck part is applied to complete characteristic fusion, target prediction is completed in a Head part by utilizing a CIoU loss function, and an optimal target frame is screened out from a plurality of frames through an nms non-maximum suppression algorithm to form a final detection result;

the fourth step: setting training hyper-parameters, training the improved YOLO v5 network in the third step based on the smoke and fire data set established in the second step, obtaining a smoke and fire detection model after training is completed, and inputting the picture obtained in the first step into the smoke and fire detection model for smoke and flame detection.

Determining the number of paths of video streams and creating different threads for different video streams to ensure that the image uploading of each path of video stream is not interfered by each other; in addition, according to the number of video streams and the detection time of the YOLO v5 network, the frame extraction interval of each video stream is adjusted correspondingly.

And in the second step, performing data cleaning on the preliminarily acquired image of the firework data set, converting the image of the data set conforming to the high altitude angle into a jpg format, and labeling the image by using a labeling tool to form a corresponding xml file.

In the third step, a YOLO v5 network adopts a pre-training model YOLO v5l provided by an official party, the YOLO v5 network comprises a backbone network, a Neck and a Head which are sequentially connected, the backbone network mainly comprises a Focus structure and a CSP structure and is used for acquiring the characteristics of a training image or an image to be recognized, a standard convolutional layer is arranged behind each CSP structure, the standard convolutional layer is replaced by a mixed attention module, and the mixed attention module is formed by sequentially connecting a channel attention module and a space attention module in series;

the structure of the Neck is FPN + PAN, feature fusion is carried out, and the calculated features are transmitted to the Head part; the main body of the Head part is a detector, gridding prediction is carried out on the characteristic diagram, the step is circulated subsequently until the coordinates of a detection frame are generated, and finally, a CIoU-nms non-maximum value inhibition method is used for removing redundant target frames and then a detection result can be output;

the hyper-parameters in the fourth step include the depth and width of the network, and the learning _ rate, batch _ size, epoch, training data set used.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

the invention provides a high-altitude smoke and fire detection method based on improved YOLO v 5. On one hand, in order to improve the accuracy of the trained model smoke and fire detection, a lightweight attention module is introduced into the original backbone network; the attention module enables the features to cover more parts of the object to be recognized, and can effectively improve the feature extraction capability of the neural network on the input image, thereby improving the performance of the model facing a small target detection scene and reducing the false alarm and missing report of high-altitude smoke and fire detection. On the other hand, the model in the practical application scene usually needs to capture multiple video streams for respective detection, the original algorithm can only identify single-path video streams, and the problems of picture backlog and data buffer overflow occurring when pictures transmitted by multiple cameras are processed are avoided by introducing the multi-thread queue algorithm, so that the stability of the system when multiple paths of video streams are transmitted simultaneously is ensured.

By means of a high-quality high-altitude firework data set, an attention mechanism is introduced into a backbone network of the YOLO v5, and the detection accuracy is improved. On the basis of optimizing each link of smoke and fire detection, a smoke and fire detection method can be designed so as to be deployed on a server to construct a smoke and fire detection system. The improved smoke and fire detection method can effectively detect smoke and flame at the initial stage of a fire at the angle of a high-altitude camera so as to achieve the early warning effect and promote the development of a fire prevention technology.

Drawings

FIG. 1 is a flow chart of the high altitude smoke and fire detection method of the present invention based on the modified YOLO v 5;

FIG. 2 is a flow chart of a multi-threaded queue algorithm;

FIG. 3 is a schematic diagram of a standard convolutional layer pull-in hybrid attention mechanism;

FIG. 4 is a schematic view of a channel attention module;

FIG. 5 is a schematic view of a spatial attention module;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

The invention aims to start from a plurality of links in the detection process, improve the accuracy of smoke and fire detection, ensure the robustness of a detection system, enable the detection system to detect smoke and flame of a small target at a high altitude angle with high accuracy, and still keep the detection system to operate efficiently under the condition of facing a plurality of video streams. Therefore, the technical scheme of the invention introduces an attention mechanism in a YOLO v5 backbone network by adding a plurality of attention modules, changes an algorithm for processing a video stream in a firework detection process, and aims to improve the accuracy and stability of the method facing a multi-video stream task.

A plurality of standard convolution layers exist in the backbone network of the original YOLO v5, which can be described as filters of all levels, feature dimension reduction is realized by compressing the number of data and parameters, and the nonlinearity of the network is enhanced while the network level is deepened. The attention mechanism allows the neural network to dynamically focus on certain portions of the input that help identify the current target ^[12] And this part is determined by the correlation, so that changing the order and length of the inputs improves the efficiency of the neural network. The introduction of attention mechanism in neural networks is usually accomplished by adding an attention module, which can be classified into spatial attention according to the principle of the attention moduleModules, channel attention modules, and hybrid attention modules. In the process of improving the YOLO v5 backbone network, replacing the original standard convolutional layer with a mixed attention module to improve the performance of the network layer; meanwhile, the module is lighter, so the influence on the reasoning time is slight and can be ignored generally, which is also proved in subsequent experiments.

When detecting smoke and fire, the original YOLO v5 can only identify single-path video streams, and the condition of 60-80 paths of video streams needs to be considered in an actual scene, so that the problem of a data cache region and even the condition of system crash detection are avoided. Multithreading refers to a technique for implementing concurrent execution of multiple threads from software or hardware ^[13] The computer with multi-thread capability can execute more than one thread at the same time due to hardware support, thereby improving the overall processing performance. Based on the multithreading principle, an infinite-length queue is arranged to store pictures obtained by frame extraction of a plurality of video streams, the detection time of a YOLO v5 network and the number of paths of the video streams are comprehensively considered, and a reasonable frame extraction interval is arranged to ensure that the picture queue cannot be gradually lengthened in practical application.

The invention provides an improved YOLO v 5-based high-altitude smoke and fire detection method aiming at the condition of multi-path high-altitude camera video streams by modifying the structure of an original target detection network, and solves the detection problem of small target smoke and fire at the initial stage of fire. As shown in fig. 1, the following processes are included:

the first step is as follows: multithreading queue algorithm

And (3) applying a multithread queue algorithm under the condition of multiple paths of high-altitude camera video streams, setting a frame-extracting uploading thread for each path of video stream, and transmitting frame-extracting pictures of all threads into an infinite-length queue. The number of paths of video streams needs to be determined, and different threads need to be created for different video streams, so as to ensure that image uploading of each path of video stream is not interfered by each other. In addition, the frame extraction interval of each video stream is set to ensure that the YOLO v5 network can complete the detection task of the frame extraction picture in time, and the frame extraction interval of each video stream needs to be correspondingly adjusted according to the number of the video streams, the detection time of the YOLO v5 network and the like.

The flow of the multithreaded queue algorithm is shown in fig. 2. In the invention, pictures transmitted into a YOLO v5 network for detection are obtained by frame extraction of a plurality of paths of high-altitude camera video streams, and in consideration of the need of ensuring the concurrency of a plurality of video streams, a queue is arranged for storing the pictures, and then a thread is created for each camera. In the thread, firstly, the RTSP video stream corresponding to the high-altitude camera needs to be determined, secondly, the frame rate of the video stream is obtained, the frame rate is multiplied by the frame extraction interval time, the number of frames of each frame can be set through circulation to intercept a picture, and finally, the frame extraction picture is pressed into a queue. Because the frame-drawing pictures of all video stream threads are not directly transmitted to the picture detection thread, but the results are transmitted to the picture queue, and the connection between the picture interception of the multi-path video stream and the picture detection of the YOLO v5 network is indirectly finished by virtue of the enqueuing and dequeuing operations of the queue, the concurrency among all high-altitude cameras can be well ensured.

In the python language used by the multi-thread queue algorithm, data among a plurality of threads is shared, and when the plurality of threads exchange data with one queue, the safety and consistency of the data cannot be guaranteed, and confusion and loss are easy to occur. Therefore, in addition to the control problem existing in the multithread method, the following concurrency control mechanism is set by using the queue library function: if only two threads exist, when one thread presses pictures into the queue, a signal is sent to the other thread, and after the thread finishes pressing the specified number of pictures into the queue, the other thread presses the pictures into the queue; for the case of more than or equal to 3 threads, the method is expanded, and signals are sent to all threads which are not pushing pictures.

In addition, in order to ensure that the frame-extracting pictures of each path of video stream can be detected in time and avoid the condition that the latest enqueued pictures are overstocked for a long time due to the gradual lengthening of the picture queue, the frame-extracting interval time t of each path of video stream _i The following formula needs to be satisfied:

t _i ＝n(t _p +t _l )-t _e +t _d

wherein, t _p Average time, t, required to detect a picture for a YOLO v5 network _l The average time required for the dequeue operation of the picture queue, n is the number of paths of the video stream, t _e Average time, t, required for enqueuing operation of picture queue _d The parameter may be adjusted manually to account for time that is suitably reserved for instability in data transmission and image detection speeds.

The second step is that: high-altitude angle firework data set building and data enhancement

And establishing a firework data set at a high altitude angle, and performing data cleaning and data set labeling work. The data of the initially acquired image of the smoke and fire data set needs to be cleaned, the image of the data set which accords with the high altitude angle (namely, at least 500 meters away from the smoke and fire target) is converted into a jpg format, and then an annotation tool is used for annotating the image to form a corresponding xml file.

Considering that the model obtained by network training can be used in actual scenes in the future, 1800 images are collected from the network and the public camera, 1264 images are remained after further preferential screening, and a data set for training the model is formed. And (3) labeling each image by using a genie labeling assistant, wherein the label comprises two types of smoke and flame, and finally forming an xml label file corresponding to the image, wherein the file comprises four coordinates of the target frame and the type of the object in the frame, and the format is PASCALVOC. It was randomly divided into training and validation sets, 948 training sets and 316 validation sets.

Data enhancement is performed before the image enters network training to enrich the distribution of data, so that the generalization and robustness of the model are remarkably improved, and overfitting is prevented. Splicing the images in a random zooming, random cutting and random arrangement mode by using a Mosaic data enhancement method; by adopting a self-adaptive image scaling measure, the original image can be uniformly scaled to a standard size, and the inference speed can be improved by about 40% by adjusting the filled black edge.

The third step: introduction of the attention mechanism in YOLO v5

And modifying a Backbone network (Backbone) in the YOLO v5 network, and replacing the standard convolutional layer after each CSP structure with a hybrid attention module. After the characteristics of the backbone network are extracted, the Neck part is applied to complete characteristic fusion, finally, target prediction is completed in the Head part by using a CIoU loss function, and an optimal target frame is screened out from a plurality of frames through an nms non-maximum suppression algorithm to form a final detection result.

The YOLO v5 network can adopt a pretrained model YOLO v5l provided by an official party, and the YOLO v5 network comprises a backbone network, a Neck and a Head which are connected in sequence. The Backbone network (Backbone) mainly comprises a Focus structure and a CSP structure, and is used for acquiring the characteristics of a training image or an image to be identified. When image data just enter a backhaul, a Focus module performs slicing operation on an image, and a double-sampling feature map under the condition of no information loss is obtained after convolution operation. The CSP structure in the backhaul can help the neural network to realize richer gradient combination, reduce the calculated amount and improve the reasoning speed and precision. A standard convolutional layer behind each CSP structure, which is replaced with a hybrid attention module ^[14] As shown in fig. 3, the hybrid attention module is formed by sequentially connecting the channel attention module and the spatial attention module in series, and has the characteristics of the spatial attention module and the channel attention module, so that the hybrid attention module has more excellent performance, still maintains the characteristic of light weight, and can effectively improve the distribution capability of the neural network structure to the learning weight.

The structure of Neck is FPN + PAN ^[15 ,16]Feature fusion may be performed and the computed features passed to the Head section. The main body of the Head part is a detector, gridding prediction is carried out on the characteristic diagram, the step can be circulated subsequently until the coordinates of the detection frame are generated, and finally, a CIoU-nms non-maximum value suppression method is used for removing redundant target frames and then the detection result can be output.

Many attention modules are plug-and-play enabled, and changing network performance can be achieved by replacing the standard convolutional layer with the module in a specific way, and the activation function is changed from SiLU to Leaky ReLU.

SiLU: f (x) = x · σ (x), where σ represents a sigmoid function.

Leaky ReLU：

(1) For the channel attention module, as shown in fig. 4, assuming that a h × w × c feature map is inputted, it is first subjected to a maximum pooling and an average pooling (pooling size is h × w) to obtain a 1 × c feature map; then, the number of the neurons in the first full connection layer is C/r (r is a set parameter) through the two full connection layers, the method for reducing the dimension is adopted, and the dimension of the neurons in the second full connection layer is increased to the number of the neurons in the C number, so that the method has the advantages that more nonlinear processing processes are added, and complex correlation among channels can be fitted; and then, a sigmoid layer is carried out, a characteristic diagram of 1 × c is obtained, and finally, an operation of fully multiplying the original characteristic diagrams of h × w × c and 1 × c is carried out, and the fully multiplying matrix can obtain the characteristic diagrams with different channel importance. It learns the parameters of the two fully-connected layers and then updates them with the final classification penalty.

Wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, of the input feature map F. MLP denotes the set multilayer perceptron, and σ denotes the sigmoid activation function. W ₀ And W ₁ Procedure for representing assigned weights in MLP, W ₀ Representing the course of the profile through the first fully-connected layer, W ₁ And (4) representing the process of the characteristic diagram passing through a second full connection layer, wherein all the used activation functions are Leaky ReLU.

And

respectively representing the feature maps of F after average pooling and maximum pooling in the channel attention module, M _c (F) A one-dimensional channel attention map obtained through calculation is shown.

(2) As shown in fig. 5, the input to the spatial attention module is a profile of the output of the channel attention module. Firstly, performing maximum pooling and average pooling along the dimension of a channel, wherein the size of a characteristic graph obtained by each pooling is h × w × 1; then, the feature graphs of the two times of pooling are connected based on channels, and are changed into feature graphs with the size h x w 2; and then carrying out convolution operation with the kernel size of 7 × 7 and the number of convolution kernels of 1 on the obtained feature graph, then using a sigmoid activation function, and finally still carrying out matrix full-multiplication operation. This way of channel cascading can better assign weights from the overall module perspective.

Wherein AvgPool (F) and MaxPool (F) represent average pooling and maximum pooling, respectively, of the input feature map F along the channel dimension. f. of ^7x7 The method is characterized in that the convolution operation with the kernel size of 7 x 7 is carried out on the feature graph completed by two connections, and sigma represents a sigmoid activation function.

And

respectively representing the feature maps of F after average pooling and maximum pooling in the spatial attention module, M _s (F) Representing a two-dimensional spatial attention map calculated.

(3) The channel attention module and the space attention module are connected in series to form a hybrid attention module.

Wherein, M _c (F) Representing a one-dimensional channel attention map with size C1 x 1; m _s (F) A two-dimensional spatial attention map is shown, with a size of 1 × h × w. F 'represents the features obtained after the input features F pass through the channel attention module, F "represents the features obtained after the feature map F' passes through the spatial attention module,

representing multiplication at the element level.

The Neck behind the backhaul is a key link for starting and stopping in a target detection framework, and the network design of the Neck adopts a FPN + PAN structure. The FPN is a classic structure of a feature pyramid, and can construct a high-level semantic feature map on all scales. After passing through a multi-layer network in the middle of the FPN, the characteristic information of the bottom layer is very fuzzy, and the PAN can help to make up and strengthen the positioning information. The Neck can mix and combine the important features extracted from the Backbone, and is beneficial to the next step of learning specific tasks of Head, such as classification, regression and other common tasks.

The features are input into a Head part after fusion, the Head part comprises a plurality of detectors, and a prediction frame is obtained after gridding prediction is carried out and combination is carried out. The loss function adopted here is CIoU, which, compared with the simple calculation of IoU (intersection ratio), not only considers the overlapping area and the center point distance of the bounding box, but also introduces a penalty term of length-to-width ratio, and is more prone to optimization in the direction of increasing the overlapping area.

Weight function α:

consistency used to measure aspect ratio v:

the final CIoU Loss is defined as:

wherein IoU represents the intersection of the joint bounding boxes,

denotes the distance between the center points of the two bounding boxes a and B, ρ denotes the calculated euclidean distance, and c denotes the diagonal distance of the smallest rectangle formed by the two bounding boxes. w is a ^gt And h ^gt Respectively representing the width and height of the detection frame obtained by model detection, and w and h respectively representing the width and height of the detection frame marked in the original data set.

The method used when the redundant prediction box is removed is an nms non-maximum suppression algorithm, and the optimal target detection box is screened out through confidence.

The fourth step: setting training hyper-parameters, training the improved YOLO v5 network in the third step based on the smoke and fire data set established in the second step, obtaining a smoke and fire detection model after training is completed, and inputting the picture obtained in the first step into the smoke and fire detection model for smoke and flame detection. The hyper-parameters to be set comprise the depth and width of the network, the learning _ rate, the batch _ size, the epoch, the used training data set and the like, if necessary, the loss function can be modified, the non-maximum suppression algorithm can be replaced, a new improved network training model is used after the setting is completed, and a contrast experiment and an ablation experiment are designed to verify the effect of the firework detecting method.

In order to balance the operation amount and the detection precision of the network as much as possible within an allowable range and keep the depth and the width of the network unchanged, a pre-training model YOLO v5l provided by YOLO v5 official is adopted, and the network depth parameter and the network width parameter are kept to be 1.00. In order to achieve the best performance of the model, pre-experiments are performed to explore better hyper-parameters. Searching the optimal learning rate in a fixed numerical range, and finally selecting the learning rate learning _ rate to be 0.001; to prevent overfitting and take hardware factors into account, set batch _ size to 4; preliminary experiments show that the Loss function value Loss of the model is sharply reduced at 0-100 epochs and slowly reduced until the Loss function value Loss is stable at 100-120 epochs, so that the epoch is set to be 120.

The experiment of the training model needs to be operated under a Pythrch framework and a CUDA parallel computing framework, and the cuDNN needs to be integrated to accelerate the operation experiment of the computer. The operating system of the actual operation of the experiment is Ubuntu20.04, the graphics card is RTX3070, the Python version is 3.8, the CUDA version is 11.1, the cuDNN version is 8.0, and the version of the Pythroch framework is 1.8.0.

The evaluation criterion for model training is mean average precision (mAP) ^[17] It is derived from the Average Precision (AP). The average accuracy refers to that under the condition of possible values of each recall rate, the corresponding maximum value precision is calculated, and then the precision values are averaged to obtain the average accuracy, so that the quality of the model on a single category can be measured. The mAP value additionally calculates the average value of the AP value under all the categories, and can measure the quality of the learned model on all the categories.

Wherein Q represents a single object type in multi-object detection, Q represents a set of all object types, N represents the number of all object types, and AP (Q) can calculate the AP value of a single object type in detection.

In order to verify that the method for adding the attention module has better effect and keep the type and the number of the data sets unchanged, other attention modules SE and ECA are respectively added at the same position of the backbone network ^[18,19] The results of the comparative experiment and the ablation experiment were shown in table 1, together with the original YOLO v5 network. In order to reduce errors of the experiment in the training process, the hyper-parameter epoch is changed to carry out multiple groups of experiments, the set number of the epoch is 100 and 120, and meanwhile, the speed of the detection picture of the trained model is calculated, so that the reasoning speed is obtained.

TABLE 1

Note: in the table, "preferringtime" is the time used by the trained model to detect 418 pictures, fire mAP @0.5 and small mAP @0.5 are respectively the data of flame and smoke after training, all classes mAP @0.5 are the data obtained by adding and averaging all kinds of targets finally, and the number behind @ represents the threshold value for judging iou as a positive sample and a negative sample.

Analyzing the experimental data, it can be concluded that adding the attention module plays a role in improving the performance of the YOLO v5 network, and all the networks added with the attention module have a higher mAP than the original network. In the aspect of model reasoning time, the attention module is slightly increased after being added, which is caused by the fact that the model is more complicated due to the addition of a new part in the original network structure, and each module has almost the same effect in the aspect of time. In terms of convergence speed, all networks have no significant difference in the present dataset, probably due to too few (only two) target classes being less significant. From the performance of each module, the performance of the selected mixed attention module is optimal when the mixed attention module is added to YOLO v 5.

While the present invention has been described in terms of its functions and operations with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise functions and operations described above, and that the above-described embodiments are illustrative rather than restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Reference to the literature

[1]A.Enis

Kosmas Dimitropoulos,Benedict Gouverneur,Nikos Grammalidis, Osman Günay,Y.Hakan

B.

Steven Verstockt.Video fire detection-Review.Digital Signal Processing,Volume 23,Issue 6,2013.

[2]Chen,T.H.,Wu,P.H.and Chiou,Y.C.(2004)An Early Fire-Detection Method Based on Image Processing.International Conference on Image Processing(ICIP), Taiwan,24-27October 2004,1707-1710.

[3]K.Dimitropoulos,P.Barmpoutis andN.Grammalidis.Spatio-Temporal Flame Modeling and Dynamic Texture Analysis forAutomatic Video-Based Fire Detection.in IEEE Transactions on Circuits and Systems for Video Technology,vol. 25,no.2,pp.339-351,Feb.2015,doi:10.1109/TCSVT.2014.2339592.

[4]Toreyin,B.U.,Dedeoglu,Y.and Cetin,A.E.(2005)Flame Detection in Video Using Hidden Markov Models.Proceedings of IEEE International Conference on Image Processing,2,1230-1233.

[5]Golnaz Ghiasi,Tsung-Yi Lin,Quoc V.Le.NAS-FPN:Learning Scalable Feature PyramidArchitecture for Object Detection.Proceedings ofthe IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.7036-7045.

[6]Ross Girshick.Fast R-CNN.Proceedings of the IEEE International Conference on Computer Vision(ICCV),2015,pp.1440-1448.

[7]Liu,W.et al.(2016).SSD:Single Shot MultiBox Detector.In:Leibe,B.,Matas, J.,Sebe,N.,Welling,M.(eds)Computer Vision–ECCV 2016.ECCV 2016.Lecture Notes in Computer Science,vol 9905.Springer,Cham.

[8]BochkovskiyA,Wang C Y,Liao H Y M.Yolov4:Optimal speed and accuracy of object detection[J].arXiv preprint arXiv:2004.10934,2020.

[9]Arad B,Kurtser P,Barnea E,et al.Controlled lighting and illumination-independent target detection for real-time cost-efficient applications.the case study ofsweet pepper robotic harvesting[J].Sensors,2019,19(6):1390.

[10]BaquéP,Fleuret F,Fua P.Deep occlusion reasoning for multi-camera multi-target detection[C].Proceedings ofthe IEEE International Conference on Computer Vision.2017:271-279.

[11]Zhang H,Zhang L,Yuan D,et al.Infrared small target detection based on local intensity and gradient properties[J].Infrared Physics&Technology,2018,89:88-96.

[12]RushAM,Chopra S,Weston J.Aneural attention model for abstractive sentence summarization[J].arXiv preprint arXiv:1509.00685,2015.

[13]Zhang Z,Huang K,Tan T.Multi-threadparsing for recognizing complex events in videos[C].European conference on computer vision.Springer,Berlin,Heidelberg, 2008:738-751.

[14]Woo S,Park J,Lee JY,et al.Cbam:Convolutional block attention module[C].Proceedings ofthe European conference on computer vision(ECCV). 2018:3-19.

[15]Xu H,Yao L,Zhang W,et al.Auto-fpn:Automatic network architecture adaptation for object detection beyond classification[C].Proceedings ofthe IEEE/CVF international conference on computer vision.2019:6649-6658.

[16]Yang J,Fu X,HuY,et al.PanNet:A deep network architecture for pan-sharpening[C].Proceedings ofthe IEEE international conference on computer vision.2017:5449-5457.

[17]Shafiee M J,Chywl B,Li F,et al.Fast YOLO:Afast you only look once system for real-time embedded object detection in video[J].arXiv preprint arXiv:1709.05943, 2017.

[18]Hu J,Shen L,Sun G.Squeeze-and-excitation networks[C].Proceedings ofthe IEEE conference on computer vision andpattern recognition.2018:7132-7141.

[19]Qilong Wang,Banggu Wu,Pengfei Zhu,Peihua Li,Wangmeng Zuo,and Qinghua Hu.ECA-Net:Efficient ChannelAttention for Deep Convolutional Neural Networks.CVPR,2020.

Claims

1. An improved YOLO v 5-based high altitude smoke and fire detection method is characterized by comprising the following processes:

the second step is that: establishing a firework data set at a high altitude angle, and performing data cleaning and data set labeling work;

after feature extraction is carried out on the backbone network, a Neck part is applied to complete feature fusion, finally, target prediction is completed on a Head part by utilizing a CIoU loss function, and an optimal target frame is screened out from a plurality of frames through an nms non-maximum suppression algorithm to form a final detection result;

2. The improved YOLO v 5-based high-altitude firework detection method as claimed in claim 1, wherein in the first step, the number of video streams is determined, and different threads are created for different video streams to ensure that uploading of images of the video streams is not interfered with each other; in addition, according to the number of video streams and the detection time of the YOLO v5 network, the frame extraction interval of each video stream is adjusted correspondingly.

3. The improved YOLO v 5-based high altitude fire and smoke detection method as claimed in claim 1, wherein in the second step, data cleaning is carried out on the initially acquired fire and smoke data set images, then the data set images conforming to high altitude angles are converted into a jpg format, and then the images are marked by using a marking tool to form corresponding xml files.

4. The high altitude firework detecting method based on improved YOLO v5 as claimed in claim 1, characterized in that in the third step, YOLO v5 network adopts an officially provided pre-training model YOLO v5l, YOLO v5 network comprises three parts of backbone network, neck and Head which are connected in sequence, the backbone network mainly comprises Focus structure and CSP structure, its function is to obtain the feature of training image or image to be identified, there is a standard convolutional layer behind each CSP structure, the standard convolutional layer is replaced by a mixed attention module, which is formed by connecting channel attention module and space attention module in series in sequence;

the structure of the Neck is FPN + PAN, feature fusion is carried out, and the calculated features are transmitted to a Head part; the main body of the Head part is a detector, gridding prediction is carried out on the characteristic diagram, the step is circulated subsequently until coordinates of a detection frame are generated, and finally a CIoU-nms non-maximum value suppression method is used for removing redundant target frames and outputting a detection result.

5. High altitude fire detection method based on improved YOLO v5 according to claim 1, characterized in that the hyper-parameters in the fourth step include depth and width of the network, and learning _ rate, batch _ size, epoch, training data set used.