CN112052826A

CN112052826A - Intelligent enforcement multi-scale target detection method, device and system based on YOLOv4 algorithm and storage medium

Info

Publication number: CN112052826A
Application number: CN202010989852.3A
Authority: CN
Inventors: 练镜锋; 孙少峰; 赵文超
Original assignee: Guangzhou Hantele Communication Co ltd
Current assignee: Guangzhou Hantele Communication Co ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-08

Abstract

The invention discloses an intelligent law enforcement multi-scale target detection method, a detection device, a detection system and a storage medium based on a YOLOv4 algorithm, which are based on a YOLOv4 algorithm and face to an intelligent law enforcement multi-scale target object for detection and alarm. The method comprises a data collection step, a data integration step, a data annotation step, a data division step, a multi-scale feature map distribution step, a Yolov4 model training step, a Yolov4 model verification step and a target detection step. The method provided by the invention has the advantages of high speed and good effect, can process multi-scale and large-scale data, supports multiple languages, supports user-defined loss functions and the like, and is a better alternative scheme for monitoring the multi-scale target by the intelligent enforcement.

Description

Intelligent enforcement multi-scale target detection method, device and system based on YOLOv4 algorithm and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-scale object target detection method, a detection device, a detection system and a storage medium.

Background

With the development of economy, the number of mobile or unfixed illegal vendor motor vehicles is increased rapidly, and aiming at the management of the illegal vendor motor vehicles which are not operated, mobile and unfixed, a video monitoring system needs to be built at a key part in a jurisdiction area for real-time image monitoring. Meanwhile, a mobile video monitoring system is equipped for urban management law enforcement vehicles, and a command center can monitor and manage the positions of the law enforcement vehicles through a GPS (global positioning system), monitor the inside and outside states of the vehicles in real time through videos, and achieve the purposes of mobile monitoring, non-long-term fixed place monitoring and law enforcement team management. The intelligent early warning of the fortification area is to perform key fortification on the intelligent law enforcement management area, and perform intelligent early warning on various abnormal conditions through a video monitoring system, so that the safety and stability of the law enforcement management process are guaranteed.

On one hand, on the basis of a multi-scale object target detection method, a traditional image processing method and a deep learning detection method are mainly used. The target detection method based on the traditional image processing mainly comprises the steps of HOG, HOG + SVM, SURF, SIFT and the like; under a target detection method based on traditional image processing, when a large-focus monitoring scene is encountered, a near-end target and a far-end target are very different, and targets with multiple scales exist in the same scene. When the target prediction area is selected, the size and the length-width ratio of the sliding window cannot be effectively set by adopting the sliding window mode, so the exhaustion mode of the sliding window has long time consumption and high redundancy. And the target detection method based on deep learning mainly comprises R-CNN, Fast R-CNN, Faster R-CNN and the like. Most of target detection methods based on deep learning use a mode based on fixed anchor regression, and the fixed anchor cannot adapt to the condition of considering the size difference of multiple scales of targets, so that a detection network cannot be converged or the quality of a training network is low, and missed detection and false detection of the targets are easily caused.

On the other hand, in the multi-scale target detection scene of intelligent law enforcement monitoring management, more identification is still carried out by means of human, but because the intelligent law enforcement monitoring environment relates to urban road traffic conditions, if manual judgment is carried out, long-time observation is needed, the labor cost of monitoring in a defense area is high easily, and visual fatigue easily occurs in manual work, so that the situations of erroneous judgment and missed judgment are generated.

Disclosure of Invention

In order to overcome the defects of the prior art, one of the objectives of the present invention is to provide a method for detecting an intelligent law enforcement multi-scale target based on the YOLOv4 algorithm, which has the advantages of high speed, good effect, capability of processing multi-scale and large-scale data, supporting multiple languages, supporting custom loss functions, etc., and is a better alternative for monitoring the intelligent law enforcement multi-scale target.

The second objective of the present invention is to provide an intelligent law enforcement multi-scale target detection device based on the YOLOv4 algorithm.

The invention further aims to provide an intelligent law enforcement multi-scale target detection system based on the YOLOv4 algorithm.

It is a further object of the present invention to provide a computer readable storage medium.

One of the purposes of the invention is realized by adopting the following technical scheme:

a method for detecting an intelligent law-enforcement multi-scale target based on a YOLOv4 algorithm comprises the following steps:

a multi-scale object target detection method comprises the following steps:

a data collection step: collecting video image data of different time points and different angles of a multi-scale object target scene;

a data integration step: integrating the collected video image data;

data labeling: marking the integrated video image data and forming source data;

a data dividing step: dividing the source data into a training data set and a verification data set according to a preset proportion;

and (3) multi-scale feature map allocation step: aiming at the picture data set, the size of a prior frame is obtained by adopting a K-means clustering algorithm, and the flow of the K-means clustering algorithm is as follows: randomly selecting 9 prior frame center points from a data set as a centroid; calculating the Euclidean distance between the center point of each prior frame and the centroid, and dividing the closer the distance is, the corresponding set is obtained; after the sets are grouped, 3 sets exist, and the mass center of each set is recalculated; setting thresholds with different sizes according to different resolutions of large, medium and small, if the distance between the new centroid and the original centroid is smaller than the set threshold, terminating the algorithm, otherwise iterating the steps 2-4; finally, clustering prior frames with 9 sizes according to different scales;

YOLOv4 model training step: training learning is carried out on the training data set by using Yolov4, and the operation is as follows: inputting the extracted multi-size feature map into a CSPDarknet53 backbone network, wherein a CSPDarknet53 is a CSPNet network added on the basis of a Darknet53 backbone network of YOLOv3, the CSP network is called a cross-level connection part network, and the accuracy can be ensured while the calculated amount is reduced by integrating the gradient change into the feature map from head to tail, so that a feature pyramid under multiple scales is obtained; inputting the multi-scale features into the SPPNet network, wherein the SPPNet network is called a space pyramid pooling network, and aims to increase the receptive field of the network (namely the identification area of the target), and the characteristic dimension reduction is realized by alternately connecting the convolution layer and the pooling layer, so that the dimension-reduced features are obtained; accelerating information fusion of the shallow feature and the deep feature through a PANet path aggregation network to obtain fusion features of different scales; fourthly, finally outputting a training result through the full connection layer, wherein the training result comprises frame regression coordinates, a target classification result and confidence degrees; calculating a loss function value according to a corresponding result, wherein the loss function comprises three contents: frame regression loss, classification loss and confidence loss, wherein the maximum iteration number set by parameters is 50000 times, the initial learning rate is 0.01, the batch processing size is 32, the attenuation rate is 0.0005 and the momentum rate is 0.9, the learning rate and the batch processing size are properly adjusted according to the descending trend of the loss value, the training is stopped until the loss function value output by the training data set is less than or equal to the threshold value or the set maximum iteration number is reached, and a trained network model is obtained and recorded as a prediction model;

YOLOv4 model verification step: verifying the prediction model through the verification data set to obtain a model score, evaluating the model, screening out the model with the optimal prediction performance through model evaluation, and marking the model as a final model;

and a target detection step: and monitoring the multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored.

Further, in the data collection step, the multi-scale object target scene is a traffic law enforcement monitoring management area; the different time points at least comprise 6 time points of morning congestion, morning unobstructed, afternoon congestion, afternoon unobstructed, evening congestion and evening unobstructed; the congested and unobstructed scene conditions are divided according to the number of the vehicle targets of the road conditions, the congested road section is divided when the number of the snapshotted vehicles reaches more than 50 in the scene of snapshotting by the monitoring camera, the congested road section is generally divided in the peak hours of getting on and off duty in the city center, such as 8: 00-10: 00, 16: 00-18: 00 and 18: 00-20: 00, and the rest time of the city or the places near the suburb are generally unobstructed. The different points in time preferably comprise different weather and different seasons of the location of the multi-scale object target scene. The objects in the video image data include: any combination of a dolly, police car, taxi, minibus, bus for passenger, single-person electric vehicle, express electric vehicle, truck, sanitation vehicle, tank car, engineering vehicle, fire truck, ambulance, police motorcycle, other non-motor vehicle and pedestrian.

Further, in the data integration step, the collected video image data are placed in the same folder; in the YOLOv4 model verification step, model evaluation is performed by three indexes, including: recall, accuracy and average accuracy.

Further, in the data annotation step, depth model training annotation is performed on the integrated video image data to form source data, and an annotation range includes: the position of the image, the image name, the image width and height, the image dimension, the labeled object name and the xy coordinate value of the bbox; the labeled object name comprises: any combination of a dolly, police car, taxi, minibus, bus for passenger, single-person electric vehicle, express electric vehicle, truck, sanitation vehicle, tank car, engineering vehicle, fire truck, ambulance, police motorcycle, other non-motor vehicle and pedestrian. Preferably, the data annotation can be selected from an annotation tool LabelImg which basically contains all information of the object detection scene, including the picture name, the picture size, the picture storage path, the target position coordinates and the target category name. Of course, other labeling tools, such as Labelme, may be used.

Further, in the data partitioning step, the ratio of the training data set to the validation data set is 3:1, 7:3, 8:2, or 98: 2. The division of the training data set and the verification data set is generally divided into 7:3 or 8:2 and the like for a small sample (such as 10000); for large samples (e.g., 1000000), the proportion of the validation data set may be correspondingly smaller, e.g., 98:2, etc.

Further, in the data dividing step, samples are randomly sampled from each video division, the number of randomly sampled samples of each scene is consistent, uniform data distribution is achieved, and the sampling ratio of the training set to the testing set of each video is 3: 1.

Further, in the multi-scale feature map allocation step, large, medium and small multi-scale feature maps are dynamically allocated for different vehicle types.

Further, in the multi-scale feature map allocation step, the sizes of prior frames are obtained by adopting K-means clustering, 3 prior frames are set for each downsampling scale, and 9 prior frames with sizes are clustered in total; in the COCO dataset these 9 prior boxes are: (10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), and (373x 326); dynamic assignment, the 13x13 feature map applies a priori blocks (116x90), (156x198), (373x 326); the 26x26 signature applies a priori blocks (30x61), (62x45), (59x 119); the 52x52 signature applies a priori boxes (10x13), (16x30), (33x 23).

Further, in the target detection step, the traffic law enforcement monitoring management area is monitored by using the final model, a camera is used as the input of the model, and when target objects with different sizes are identified, alarm information is generated.

Further, in the YOLOv4 model prediction step, a model takes a picture, a video or a camera ip address as an input, and a model output is a target detection result. The input mode is different, but the principle is to process and analyze the picture. If the input is a video or camera IP address, reading one frame of picture from the video or camera, taking the frame of picture as the input of the model, further outputting a target detection result, and continuously reading the next frame of picture for analysis after the analysis is finished.

Further, in the YOLOv4 model verification step, model evaluation is performed through three indexes, including: recall, accuracy and average accuracy.

The second purpose of the invention is realized by adopting the following technical scheme:

an intelligent law enforcement multi-scale target detection device based on a YOLOv4 algorithm, comprising: one or more processors, and memory for storing one or more computer programs which, when executed by the one or more processors, perform the object detection step of one of the purposes: monitoring the multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored;

the process of establishing the final model comprises a data collection step, a data integration step, a data annotation step, a data division step, a multi-scale feature map distribution step, a Yolov4 model training step and a Yolov4 model verification step which are one of purposes.

The third purpose of the invention is realized by adopting the following technical scheme:

an intelligent law enforcement multi-scale target detection system based on a YOLOv4 algorithm, comprising:

an image acquisition device for acquiring image data to be analyzed;

a computing device coupled with the image acquisition device and comprising: one or more processors, and memory for storing one or more computer programs which, when executed by the one or more processors, perform the object detection step of one of the purposes: monitoring the multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored;

an alert device coupled with the computing device and configured to alert the alert information;

The fourth purpose of the invention is realized by adopting the following technical scheme:

a computer readable storage medium having one or more computer programs stored thereon, wherein the one or more computer programs, when executed by one or more processors, perform the object detection step of one of the purposes: monitoring the multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored;

Compared with the prior art, the invention has the beneficial effects that:

(1) the intelligent law enforcement multi-scale target detection method based on the YOLOv4 algorithm is suitable for complex scenes, can be applied to the field of intelligent law enforcement monitoring management, and is a multi-scale object target detection and alarm method for complex scene intelligent law enforcement monitoring management. The present invention employs the Prior detection (Prior detection) system of YOLOv4 to re-use classifiers or locators for performing detection tasks, assigning model dynamic applications to multiple locations and scales of the image. In addition, a completely different approach is used with respect to other object detection methods, applying a single neural network to the entire image, which network divides the image into different regions, thus predicting the bounding box and probability of each block of regions, which bounding boxes will be weighted by the predicted probability. The model has some advantages over classifier-based systems. Unlike R-CNN, which requires thousands of single target images, Yolov4 predicts through a single network evaluation, which makes Yolov4 very Fast, typically 1000 times faster than R-CNN, 100 times faster than Fast R-CNN.

(2) The method for detecting the intelligent law enforcement multi-scale target based on the YOLOv4 algorithm has the advantages of high speed, good effect, capability of processing multi-scale and large-scale data, supporting multiple languages, supporting custom loss functions and the like, and is a better alternative scheme for monitoring the intelligent law enforcement multi-scale target. The specific analysis is as follows:

the speed is high: discarding softmax, and performing multi-scale prediction by using Anchor bbox;

cross-platform: the method is suitable for Windows, Linux, macOS and a plurality of cloud platforms;

multilingual: support C + +, Python, R, Java, Scala, Julia, etc.;

the effect is good: wins many data science and machine learning challenges and is available for production by many companies.

(3) The intelligent law enforcement multi-scale target detection method based on the YOLOv4 algorithm further has the following advantages:

dynamically allocating anchors: and acquiring training data, performing data fitting on a training target, dynamically analyzing the characteristics of the anchor in different scales through big data fitting, and dynamically setting the value of the anchor.

Design network structure YOLOv 4: the target detection multi-scale branch in the YOLOv4 is designed, the problem of missed detection and false detection of target detection is solved through the input of multi-scale features, the accuracy of target detection can be effectively improved, and the overall effect of target identification is improved. Relative to YOLOv3, its backbone network meets the following requirements: firstly, the input resolution is high, and the detection accuracy of small objects is improved; secondly, more layers are provided, and the receptive field is improved to adapt to the increase of input; and thirdly, more parameters are used for improving the capability of detecting single-image multi-size targets. Overall, the accuracy is improved by nearly 10 points, and the speed is improved by a small amount.

Drawings

Fig. 1 is a flowchart of an intelligent-implementation multi-scale target detection method based on YOLOv4 algorithm according to embodiment 1 of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.

Example 1

As shown in fig. 1, a law enforcement intelligence oriented multi-scale object target detection method based on YOLOv4 algorithm includes the following steps:

a data collection step: collecting video image data of the intelligent law enforcement monitoring scene at different time points and different angles; the specific operation mode is as follows:

firstly, collecting time points: the method is divided into 6 scenes including morning congestion, morning unobstructed, afternoon congestion, afternoon unobstructed, evening congestion and evening unobstructed according to the scenes. The congested and unobstructed scene conditions are divided according to the number of the vehicle targets of the road conditions, the congested road section is divided when the number of the snapshotted vehicles reaches more than 50 in the scene of snapshotting by the monitoring camera, the congested road section is generally divided in the peak hours of getting on and off duty in the city center, such as 8: 00-10: 00, 16: 00-18: 00 and 18: 00-20: 00, and the rest time of the city or the places near the suburb are generally unobstructed.

Collecting places: near the construction area, near the city center, near the station and on other overpasses (such as fire departments, environmental protection departments, areas near hospitals and the like), because the shooting range is more consistent with the height of the intelligent traffic law enforcement camera.

Collecting modes and quantity: 30-second videos including road one-way and two-way angles and lateral angles are shot on the overpasses at all places. Because the frame rate of the video is generally about 1 second and 30 frames, the frame taking frequency is set to be 1 second and 3 frames are taken, and the total number of samples is at least 100 or more, namely the number of samples is at least about 10000. The number of the data set pictures is about 10000 video frame pictures, and the number of the sample videos is at least more than 100.

Acquiring data in the video image comprises the following steps: any combination of a dolly, police car, taxi, minibus, bus for passenger, single-person electric vehicle, express electric vehicle, truck, sanitation vehicle, tank car, engineering vehicle, fire truck, ambulance, police motorcycle, other non-motor vehicle and pedestrian.

A data integration step: integrating the collected video image data; the VOC2007 folder is newly created under the catalogue, and three folders of antagonists, ImageSets and JPEGImages are created under the VOC2007 folder. Then build new Main folder under ImageSets. And copying the collected data set picture to a JPEGImages directory.

Data labeling: carrying out depth model training and labeling on the integrated video image data and forming source data, wherein the specific operation mode is as follows:

the tool comprises: the used labeling tool is labellimg, and an xml labeling file is generated;

data set numbering: planning data, reducing the possibility of errors, randomly numbering about 10000 video frame pictures, and coding a reasonable sequence number for the pictures, such as 000001-000999;

marking data: and labeling data by using labelimg software, wherein each picture name corresponds to an xml label file with a corresponding name, such as a picture 000001.jpg, and the label file is 000001. xml. The range of labels includes: the position of the image, the name of the image (such as 000001.jpg), the width and height of the image, the dimension of the image, the name of the annotated object and the xy coordinate value of bbox; the annotated object names include: 1. a trolley car; 2. police car police-car; 3. taxi; 4. a minibus van; 5. bus of the bus; 6. minibus; 7. coach bus coach for passenger transport; 8. electric-vehicle of single person; 9. express-vehicle; 10. truck; 11. sanitation vehicle sanitation-truck; 12. tank truck tanker-truck; 13. engineering-truck; 14. fire-truck; 15. ambulance; 16. police motorcycle police-motorcycle; 17. other non-motor vehicles others; 18. pedestrian.

A data dividing step: dividing the source data according to a preset proportion, wherein a training data set accounts for 75%, and a verification data set accounts for 25%;

and (3) multi-scale feature map allocation step: and (3) obtaining the sizes of prior frames by adopting K-means clustering for the vehicle picture data set, setting 3 prior frames for each downsampling scale, and clustering the prior frames with 9 sizes in total. In the COCO dataset these 9 prior boxes are: (10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), (373x 326). In dynamic assignment, applying larger a priori boxes (116x90), (156x198), (373x326) on the smallest 13x13 feature map (with the largest receptive field) is suitable for detecting larger objects. Medium prior boxes (30x61), (62x45), (59x119) were applied on the medium 26x26 signature (medium receptive field), suitable for detecting medium sized objects. Smaller a priori boxes (10x13), (16x30) and (33x23) are applied to the larger 52x52 signature (smaller receptive field), which is suitable for detecting smaller objects.

TABLE 1 eigenmap multiscale assignment

YOLOv4 model training step: training learning is performed on the training data set by using Yolov4, and the operation is as follows: inputting the extracted multi-size feature map into a CSPDarknet53 backbone network, wherein a CSPDarknet53 is a CSPNet network added on the basis of a Darknet53 backbone network of YOLOv3, the CSP network is called a cross-level connection part network, and the accuracy can be ensured while the calculated amount is reduced by integrating the gradient change into the feature map from head to tail, so that a feature pyramid under multiple scales is obtained; inputting the multi-scale features into the SPPNet network, wherein the SPPNet network is called a space pyramid pooling network, and aims to increase the receptive field of the network (namely the identification area of the target), and the characteristic dimension reduction is realized by alternately connecting the convolution layer and the pooling layer, so that the dimension-reduced features are obtained; accelerating information fusion of the shallow feature and the deep feature through a PANet path aggregation network to obtain fusion features of different scales; fourthly, finally outputting a training result through the full connection layer, wherein the training result comprises frame regression coordinates, a target classification result and confidence degrees; calculating a loss function value according to a corresponding result, wherein the loss function comprises three contents: frame regression loss, classification loss and confidence loss, wherein the maximum iteration number set by parameters is 50000 times, the initial learning rate is 0.01, the batch processing size is 32, the attenuation rate is 0.0005 and the momentum rate is 0.9, the learning rate and the batch processing size are properly adjusted according to the descending trend of the loss value, the training is stopped until the loss function value output by the training data set is less than or equal to the threshold value or the set maximum iteration number is reached, and a trained network model is obtained and recorded as a prediction model;

YOLOv4 model verification step: and verifying the prediction model through the verification data set to obtain a model score, evaluating the model, and screening out the model with the optimal prediction performance through model evaluation. Model evaluation is performed through three indexes to verify the quality of the model, including:

recall (R: call): i.e. how many positive samples of the samples are predicted correctly;

precision (P: precision): i.e. how many of the samples predicted to be positive are true;

mean average precision (mAP): the mAP is the average of all classes of AP (average precision), and the average degree of goodness in all classes is measured.

Their calculation formulas are respectively as follows:

R＝TP/(TP+FN)；P＝TP/(TP+FP)；mAP＝∫P(R)dR

in the formula, the first and second organic solvents are,

TP (true Positives): true positive samples (i.e., positive samples are correctly predicted as positive samples);

TN (true neurons): true negative examples (i.e., negative examples are correctly predicted as negative examples);

FP (false positives): false positive samples (i.e., negative samples are mispredicted as positive samples);

FN (false negatives): false negative samples (i.e., positive samples are mispredicted as negative samples);

p: the accuracy rate;

r: the recall ratio is as follows:

AP: area under PR curve (PR curve: Precision-Recall curve) measures whether detection is good or bad for one class, and mAP measures whether detection is good or bad for a plurality of classes.

The AP formula is as follows:

in this example, the accuracy was 94.23%, the recall was 93.82%, and the average accuracy value was 89.35%.

And a target detection step: the final model is utilized to monitor the intelligent law enforcement multi-scale target, a camera for intelligent law enforcement monitoring is used as the input of the model, and when the threatened object targets of various motor vehicles (such as a trolley, a police car, a taxi, a minibus, a bus, a minibus, a passenger bus, a single electric vehicle, an express electric vehicle, a truck, a sanitation vehicle, an oil tank truck, an engineering vehicle, a fire truck, an ambulance, a police motorcycle and other non-motor vehicles) with different scales are identified, the warning information is pushed to achieve the effect of monitoring the intelligent law enforcement multi-scale target.

According to the method for detecting the multi-scale object target facing the intelligent law enforcement based on the YOLOv4 algorithm, when the multi-scale feature map is distributed, 3 prior frames are subjected to size division in an unsupervised learning mode of kmeans clustering according to vehicle pictures in a sample scene, the size of the target is divided into large, medium and small, the size of the target corresponds to the size of the 3 prior frames respectively, and the method can be used for detecting the target at far and near distances in multiple scenes. However, the conventional target detection network generally only detects targets with similar sizes, and easily ignores targets with small object sizes in a large scene.

Example 2

An intelligent law enforcement multi-scale target detection device based on a YOLOv4 algorithm, comprising: one or more processors, and memory for storing one or more computer programs, the one or more computer programs when executed by the one or more processors, performing the object detecting step of: monitoring a multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored;

wherein, the establishment process of the final model is as follows:

a data integration step: integrating the collected video image data;

data labeling: marking the integrated video image data and forming source data;

and (3) multi-scale feature map allocation step: aiming at the picture data set, the size of a prior frame is obtained by adopting a K-means clustering algorithm, and the flow of the K-means clustering algorithm is as follows: randomly selecting 9 prior frame center points from a data set as a centroid; calculating the Euclidean distance between the center point of each prior frame and the centroid, and dividing the closer the distance is, the corresponding set is obtained; after the sets are grouped, 3 sets exist, and the mass center of each set is recalculated; setting thresholds with different sizes according to different resolutions of large, medium and small, if the distance between the new centroid and the original centroid is smaller than the set threshold, terminating the algorithm, otherwise iterating the steps 2-4; finally, clustering prior frames with 9 sizes according to different scales.

YOLOv4 model verification step: and verifying the prediction model through a verification data set, screening out a model with the optimal prediction performance through model evaluation, and marking as a final model.

In some embodiments, the processor may include various processing circuitry, such as, but not limited to, one or more of a central processor or a communications processor. The processor may perform control of at least one other component of the multi-scale object target detection apparatus, and/or perform operations or data processing related to communications. The memory may include volatile and/or non-volatile memory. The multi-scale object detection device may include, for example, a smart phone, a tablet computer, a desktop computer, an e-book reader, an MP3 player, an electronic bracelet, a smart watch, and the like.

Example 3

an image acquisition device for acquiring image data to be analyzed;

a computing device coupled with the image acquisition device and comprising: one or more processors, and memory for storing one or more computer programs, the one or more computer programs when executed by the one or more processors, performing the object detecting step of: monitoring a multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored;

a warning device coupled to the computing device and configured to warn of the warning information.

Wherein, the establishment process of the final model is as follows:

a data integration step: integrating the collected video image data;

data labeling: marking the integrated video image data and forming source data;

In some embodiments, the computing device may be in wired or wireless connection with the image acquisition device. The warning device may be integrated with the computing device or the warning device and the computing device may be 2 separate components. The computing device may include, for example, a smart phone, a tablet computer, a desktop computer, an electronic book reader, an MP3 player, an electronic bracelet, a smart watch, and the like.

Example 4

A computer readable storage medium having one or more computer programs stored thereon, wherein the one or more computer programs, when executed by one or more processors, implement the object detection steps of: and monitoring a multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored.

Wherein, the establishment process of the final model is as follows:

a data integration step: integrating the collected video image data;

data labeling: marking the integrated video image data and forming source data;

In some embodiments, the computer readable medium may include, for example, a hard disk, a floppy disk, a magnetic medium, an optical recording medium, a DVD, a magneto-optical medium, and the like.

The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims

1. A method for detecting an intelligent law-enforcement multi-scale target based on a YOLOv4 algorithm is characterized by comprising the following steps:

a data integration step: integrating the collected video image data;

data labeling: marking the integrated video image data and forming source data;

and (3) multi-scale feature map allocation step: adopting a K-means clustering algorithm to obtain the sizes of prior frames aiming at the picture data set, and clustering the prior frames with 9 sizes according to different scales;

YOLOv4 model training step: training learning is carried out on the training data set by using Yolov4, and the operation is as follows: inputting the extracted 9 size characteristic maps into a CSPDarknet53 backbone network, wherein the CSPDarknet53 is a CSPNet network added on the basis of a Darknet53 backbone network of YOLOv 3; inputting the multi-scale features into the SPPNet network; accelerating information fusion of the shallow feature and the deep feature through a PANet path aggregation network to obtain fusion features of different scales; fourthly, finally outputting the training result through the full connection layer; calculating a loss function value according to a corresponding result, adjusting the learning rate and the batch processing size according to the descending trend of the loss function value, stopping training until the loss function value output by the training data set is less than or equal to a threshold value or reaches a set maximum iteration number, obtaining a trained network model, and marking as a prediction model;

YOLOv4 model verification step: verifying the prediction model through the verification data set, screening out a model with optimal prediction performance through model evaluation, and marking as a final model;

2. The method for intelligently enforcing multi-scale object detection based on YOLOv4 algorithm according to claim 1, wherein in the data collecting step, the multi-scale object scene is a traffic enforcement monitoring management area; the different time points at least comprise 6 time points of morning congestion, morning unobstructed, afternoon congestion, afternoon unobstructed, evening congestion and evening unobstructed; the objects in the video image data include: any combination of a dolly, police car, taxi, minibus, bus for passenger, single-person electric vehicle, express electric vehicle, truck, sanitation vehicle, tank car, engineering vehicle, fire truck, ambulance, police motorcycle, other non-motor vehicle and pedestrian.

3. The method of claim 1, wherein in the data integration step, the collected video image data are placed in the same folder; in the YOLOv4 model verification step, model evaluation is performed by three indexes, including: recall, accuracy and average accuracy.

4. The method as claimed in claim 2, wherein in the step of data annotation, the integrated video image data is annotated and source data is formed, and the annotation range includes: the position of the image, the image name, the image width and height, the image dimension, the labeled object name and the xy coordinate value of the bbox; the labeled object name comprises: any combination of a dolly, police car, taxi, minibus, bus for passenger, single-person electric vehicle, express electric vehicle, truck, sanitation vehicle, tank car, engineering vehicle, fire truck, ambulance, police motorcycle, other non-motor vehicle and pedestrian.

5. The method of claim 1, wherein in the step of data partitioning, the ratio of the training data set to the validation data set is 3:1, 7:3, 8:2 or 98: 2.

6. The method for intelligently enforcing the multi-scale object detection based on the YOLOv4 algorithm of claim 1, wherein in the step of assigning the multi-scale feature map, K-means clustering is used to obtain the sizes of the prior frames, 3 prior frames are set for each downsampling scale, and 9 prior frames are clustered together; in the COCO dataset these 9 prior boxes are: (10x13), (16x30), (33x23), (30x61), (62x45), (59x119), (116x90), (156x198), and (373x 326); dynamic assignment, the 13x13 feature map applies a priori blocks (116x90), (156x198), (373x 326); the 26x26 signature applies a priori blocks (30x61), (62x45), (59x 119); the 52x52 signature applies a priori boxes (10x13), (16x30), (33x 23).

7. The method of claim 1, wherein in the step of detecting the target, a traffic enforcement monitoring management area is monitored by using the final model, a camera is used as an input of the model, and when various target objects with different sizes are identified, warning information is generated.

8. An intelligent law enforcement multi-scale target detection device based on a YOLOv4 algorithm, comprising: one or more processors, and memory for storing one or more computer programs which, when executed by the one or more processors, perform the object detection steps of claim 1: monitoring the multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored;

the process of establishing the final model comprises the steps of data collection, data integration, data annotation, data division, multi-scale feature map allocation, Yolov4 model training and Yolov4 model verification as claimed in claim 1.

9. An intelligent law enforcement multi-scale target detection system based on a YOLOv4 algorithm, comprising:

an image acquisition device for acquiring image data to be analyzed;

a computing device coupled with the image acquisition device and comprising: one or more processors, and memory for storing one or more computer programs which, when executed by the one or more processors, perform the object detection steps of claim 1: monitoring the multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored;

10. A computer readable storage medium, having one or more computer programs stored thereon, wherein the one or more computer programs, when executed by one or more processors, implement the object detection step of claim 1: monitoring the multi-scale object target scene by using the final model, and generating alarm information when a specific target object is monitored;