CN114170581A

CN114170581A - Anchor-Free traffic sign detection method based on deep supervision

Info

Publication number: CN114170581A
Application number: CN202111487756.XA
Authority: CN
Inventors: 吕卫; 梁芷茵; 褚晶辉
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2022-03-11

Abstract

An Anchor-Free traffic sign detection method based on deep supervision comprises the following steps: constructing a data set and carrying out data preprocessing to form a training set and a testing set; building an Anchor-Free traffic sign detection neural network model based on deep supervision, which comprises the following steps of sequentially connecting in series: the device comprises an input unit, a coding unit, a layer jump structure unit, a decoding unit and an output prediction unit, wherein the coding unit is also connected with the decoding unit; training an Anchor-Free traffic sign detection neural network model based on deep supervision by using the obtained training set; and testing the Anchor-Free traffic sign detection neural network model based on deep supervision by using the obtained test set. The invention realizes the traffic sign detection by using the Anchor-Free method Based on the coding-decoding structure in the traffic sign detection, avoids the problem that the Anchor-Based method needs to manually set the Anchor frame parameters, and ensures that the algorithm can adapt to various traffic sign detection scenes.

Description

Anchor-Free traffic sign detection method based on deep supervision

Technical Field

The invention relates to a traffic sign detection method. In particular to an Anchor-Free traffic sign detection method based on deep supervision.

Background

The traffic sign is one of the most critical components in a road traffic system, and provides suggestive or restrictive information such as road conditions, real-time traffic conditions and the like for vehicles. The vehicles obey the traffic rules according to the traffic signs, so that the occurrence of traffic jam and traffic accidents can be greatly reduced. In practical applications, the traffic sign detection algorithm is an integral part of an automatic driving system. In the early stage of research, scholars at home and abroad mainly solve the problem of traffic sign detection by combining a plurality of image processing methods because the traffic signs are regular in shape and bright in color. In recent years, with the continuous and deep research of scholars at home and abroad on the neural network, the detection method based on the neural network has better detection effect and higher detection speed than the traditional image processing method, is widely applied to the field of traffic sign detection, and occupies an important position.

The neural network-based traffic sign detection algorithm is high in accuracy, and can better cope with negative influences on detection caused by illumination change, shielding and the like. The current commonly used detection method Based on the neural network is mainly an Anchor-Based method, and is typified by fast-RCNN^[1]、SSD^[2]And YOLO^[3]And the like. The patent "a traffic sign detection algorithm based on the YOLOv5 network structure" (china, 202110305468.1) uses a lightweight feature extraction network to extract image feature information on the basis of YOLOv 5; the patent 'a traffic sign detection and identification method based on a residual SSD model' (China, 201810850416.0) introduces a residual network into the SSD detection model, and improves the capability of the model in extracting features. The Anchor-Based method is Based on the Anchor frame needing to be manually set with hyper-parameters, and the Anchor frame has heuristic prior information, so that the hyper-parameters of the Anchor frame can be manually set in different data sets according to the target size distribution characteristics of the data sets. Meanwhile, the preset anchor frame is sensitive to the change of the data set, and the detection effect of the detection algorithm in the real scene change is reduced. The Anchor-Based method obtains higher recall ratio by densely arranging Anchor frames in the image, but only a small part of the Anchor frames are overlapped with a target real area, and a large amount of extra calculation cost is generated during detection.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for detecting an Anchor-Free traffic sign based on deep supervision with higher detection speed to overcome the defects of the prior art.

The technical scheme adopted by the invention is as follows: an Anchor-Free traffic sign detection method based on deep supervision comprises the following steps:

step 1, constructing a data set and carrying out data preprocessing to form a training set and a test set;

step 2, building an Anchor-Free traffic sign detection neural network model based on deep supervision, which comprises the following steps of sequentially connecting in series: the device comprises an input unit, a coding unit, a layer jump structure unit, a decoding unit and an output prediction unit, wherein the coding unit is also connected with the decoding unit;

step 3, training the Anchor-Free traffic sign detection neural network model based on deep supervision by adopting the training set obtained in the step 1;

and 4, testing the Anchor-Free traffic sign detection neural network model based on deep supervision by adopting the test set obtained in the step 1.

The data set in the step 1 is used for training and testing a neural network by adopting data which contains 45 types of traffic signs and has the frequency of more than 100 in a China traffic sign data set TT100K published by Qinghua university and Tengchun; the data preprocessing is to cut the original image into 512 x 512 pixel images randomly according to the area where the traffic sign is located, the cut image contains more than one traffic sign, and the annotation of the traffic sign detection frame in the cut image is obtained according to the original annotation file; the incomplete traffic sign in the cut image is used for simulating the condition when the traffic sign is blocked.

Step 2, an Anchor-Free traffic sign detection neural network model based on deep supervision is built through a deep learning framework PyTorch, and the input unit comprises: carrying out primary extraction of shallow features on an input original image through a 7 x 7 convolutional layer to obtain a feature map, wherein the convolutional kernel size of the convolutional layer is 7 x 7, the step length is 2, and the number of output channels is 64; and sequentially passing the feature map through 1 BN layer to prevent gradient disappearance, 1 RELU activation function layer and 1 maximum pooling layer with the step length of 2 and the pooling window of 2 multiplied by 2 to form the input of the coding unit.

The encoding unit described in step 2 includes:

the device comprises a first residual error module, a second residual error module, a third residual error module, a fourth residual error module and a cavity space convolution pooling pyramid module which are sequentially connected in series, wherein a feature map output by an input unit is further subjected to feature extraction through the first residual error module, the second residual error module, the third residual error module and the fourth residual error module in sequence, feature enhancement is performed through the cavity space convolution pooling pyramid module, an enhanced feature map is obtained and is sent to a decoding unit, and the outputs of the first residual error module, the second residual error module and the third residual error module are respectively connected with a layer jump structure unit.

The void space convolution pooling pyramid module is divided into 5 branches, wherein the first branch comprises a global average pooling layer, a fourth 1 x 1 convolution layer and an upper sampling layer which are sequentially connected in series; the second branch comprises a fifth 1 × 1 convolutional layer; the third branch comprises a first 3 x 3 expanded convolutional layer, the expansion rate of the first 3 x 3 expanded convolutional layer is 6; the fourth branch comprises a second 3 × 3 expansion convolutional layer, the expansion rate of the second 3 × 3 expansion convolutional layer is 12; the fifth branch comprises a third 3 × 3 expanded convolutional layer, the third 3 × 3 expanded convolutional layer having an expansion rate of 18; and after the feature maps output by the fourth residual module respectively enter five branches, the feature maps pass through a channel dimension splicing layer, are fused with feature information under different receptive fields, and pass through a sixth 1 multiplied by 1 convolution layer to obtain the output feature map of the void space convolution pooling pyramid module.

The layer jump structure unit in step 2 comprises a first 1 × 1 convolutional layer, a second 1 × 1 convolutional layer, a third 1 × 1 convolutional layer, a first depth supervision mechanism, a second depth supervision mechanism and a third depth supervision mechanism, wherein the first 1 × 1 convolutional layer receives an output characteristic diagram of a first residual error module in a coding unit, outputs the output characteristic diagram to a decoding unit after convolution operation, and simultaneously outputs the output characteristic diagram to the first depth supervision mechanism during model training; the second 1 x 1 convolutional layer receives an output characteristic diagram of a second residual error module in the coding unit, outputs the output characteristic diagram to the decoding unit after convolutional operation, and simultaneously outputs the output characteristic diagram to a second depth supervision mechanism during model training; and the third 1 x 1 convolutional layer receives the output characteristic diagram of the third residual error module in the coding unit, outputs the output characteristic diagram to the decoding unit after convolution operation, and simultaneously outputs the output characteristic diagram to a third depth supervision mechanism during model training.

The first depth supervision mechanism, the second depth supervision mechanism and the third depth supervision mechanism are respectively and correspondingly receive output feature maps of the first 1 x 1 convolution module, the second 1 x 1 convolution module and the third 1 x 1 convolution module in a model training stage; the first depth supervision mechanism, the second depth supervision mechanism and the third depth supervision mechanism have the same structure, and respectively carry out traffic sign central point prediction, offset prediction and scale prediction on a received feature map through three branches, wherein the three branches comprise two 3 multiplied by 3 convolution modules connected in series; obtaining the prediction information of the central point after passing through a seventh 3 multiplied by 3 convolution module and an eighth 3 multiplied by 3 convolution module of the first branch, and performing cross entropy loss calculation with the real central point to obtain cross entropy loss; obtaining prediction information of the offset after passing through a ninth 33 convolution module and a tenth 33 convolution module of the second branch, and performing L1 loss calculation with the real offset to obtain a first L1 loss; obtaining prediction information about the scale after passing through an eleventh 3 × 3 convolution module and a twelfth 3 × 3 convolution module of the third branch, and performing L1 loss calculation with the real scale to obtain a second L1 loss; and adding the cross entropy loss, the first L1 loss and the second L1 loss to obtain an auxiliary loss function, and forming an output value of the deep supervision mechanism.

The decoding unit in step 2 comprises a first decoding module, a second decoding module and a third decoding module, wherein the first decoding module is used for decoding the feature map output by the hollow space convolution pooling pyramid module in the receiving and coding unit, adding the feature map to the output of a third 1 x 1 convolution layer in the layer jump structure unit, and outputting the feature map obtained by adding to the second decoding module and the output prediction unit; the second decoding module decodes the received characteristic diagram, adds the decoded characteristic diagram to the output of a second 1 multiplied by 1 convolution layer in the layer jump structure unit, and outputs the characteristic diagram obtained by the addition to a third decoding module and an output prediction unit; the third decoding module is used for decoding the received characteristic diagram, adding the decoded characteristic diagram to the output of the first 1 multiplied by 1 convolution layer in the layer jump structure unit, and outputting the characteristic diagram obtained by the addition to the output prediction unit; the first decoding module, the second decoding module and the third decoding module have the same structure and respectively comprise a bilinear interpolation layer and a 3 multiplied by 3 convolution layer which are sequentially connected in series.

The output prediction unit in step 2 comprises: a first bilinear interpolation for receiving the result of the output addition of the second decoding module in the decoding unit and the second 1 × 1 convolutional layer in the layer-skipping structural unit, a second bilinear interpolation for receiving the result of the output addition of the first decoding module in the decoding unit and the third 1 × 1 convolutional layer in the layer-skipping structural unit, a channel dimension splicing layer for respectively receiving the result of the output addition of the third decoding module in the decoding unit and the first 1 × 1 convolutional layer in the layer-skipping structural unit, the output of the first bilinear interpolation and the output of the second bilinear interpolation, the channel dimension splicing layer fuses the obtained feature maps into one feature map for output, the fused feature map respectively carries out the prediction information of the traffic sign center point category and position, the center point offset and the traffic sign scale through three branches and outputs to realize the traffic sign detection, the three branches have the same structure, there are two concatenated 3 x 3 convolution modules.

The 3 × 3 convolution module comprises a 3 × 3 convolution layer, a BN layer and a RELU layer which are sequentially connected in series.

The Anchor-Free traffic sign detection method Based on deep supervision realizes traffic sign detection by using an Anchor-Free method Based on a coding-decoding structure in traffic sign detection, avoids the problem that the Anchor-Based method needs to manually set Anchor frame parameters, and enables an algorithm to be suitable for various traffic sign detection scenes. In the invention, a cavity space convolution Pooling Pyramid module (ASPP) is added behind the coding sub-network, the module utilizes the cavity convolution with various expansion rates to capture the feature information of different space proportions, and the space representation capability of the features extracted by the decoding sub-network is enhanced, thereby improving the detection capability of the detection model on the traffic sign features of different space proportions. The invention introduces a layer jump structure between coding-decoding structures and a deep supervision mechanism. The skip-layer structure can utilize the multilevel features generated by the coding sub-network, and fully utilize the edge and detail information of the shallow features and the semantic information of the deep features. The deep supervision mechanism can optimize the training of the model and reduce the difficulty of model optimization caused by utilizing shallow features. The invention connects the multilevel decoding features in the decoding structure on the channel dimension, thereby fusing the multilevel features. After the multi-level features are fused, channel information of interest is obtained by using a channel attention mechanism, so that the output can integrate richer feature information including detail information and context information and obtain the channel information of interest. The invention has the beneficial effects that:

1. the neural network adopts a coding-decoding structure, a cavity space convolution pooling pyramid module is added after a coding sub-network, and the cavity convolutions with different expansion rates can extract semantic information with different space scales, so that the module can extract and fuse context information under a plurality of space scales, and the detection performance of a multi-scale traffic sign target is improved.

2. The neural network adds a layer jump structure between the coding-decoding structures, and the layer jump structure is from the multi-level characteristics of the coding sub-network, so that the intermediate-level characteristics can be effectively utilized, and the reuse rate of the characteristics is improved. And the middle-level features of the coding sub-network have richer detail information and edge information, and the accuracy of positioning the traffic sign can be improved by utilizing the features.

3. Deep supervision is introduced after the layer jump structure, in the training process, all levels of feature maps of the layer jump structure are added and output as a part of a loss function, and the middle level feature maps can be directly optimized in the optimization process, so that the optimization difficulty of the model is reduced.

4. The neural network connects the outputs of the multi-stage decoding modules in the channel dimension and uses the feature map of the channel attention mechanism for prediction. The multi-stage decoding module comprises multi-scale features from a layer jump structure, the obtained feature graph comprises detail information and context information after splicing on a channel dimension, interested channel weight is increased through a channel attention mechanism, and feature expression capacity is improved.

Drawings

FIG. 1 is a schematic diagram of a neural network model for detecting Anchor-Free traffic signs based on deep supervision constructed by the invention;

FIG. 2 is a schematic diagram of a hollow space convolution pooling pyramid module structure according to the present invention;

FIG. 3 is a schematic diagram of the deep supervision mechanism of the present invention;

FIG. 4 is a schematic diagram of a decoding module according to the present invention;

FIG. 5 is a schematic diagram of the structure of a 33 convolution module according to the present invention;

FIG. 6 is a graph of the effect of the test using the method of the present invention.

Detailed Description

The depth supervision-Free traffic sign detection method based on the invention is explained in detail below with reference to the embodiments and the attached drawings.

The invention discloses a depth supervision-based Anchor-Free traffic sign detection method, which comprises the following steps:

the data set is used for training and testing a neural network by adopting data which contains 45 types of traffic signs and has the frequency of occurrence of more than 100 in a Chinese traffic sign data set TT100K published by Qinghua university and Tengchong; the data preprocessing is to cut the original image into 512 x 512 pixel images randomly according to the area where the traffic sign is located, the cut image contains more than one traffic sign, and the annotation of the traffic sign detection frame in the cut image is obtained according to the original annotation file; the incomplete traffic sign in the cut image is used for simulating the condition when the traffic sign is blocked.

Step 2, building an Anchor-Free traffic sign detection neural network model based on deep supervision, as shown in the attached figure 1, and sequentially connecting in series: the device comprises an input unit 1, an encoding unit 2, a layer jump structure unit 3, a decoding unit 4 and an output prediction unit 5, wherein the encoding unit 2 is also connected with the decoding unit 4. The invention relates to an Anchor-Free traffic sign detection neural network model based on deep supervision, which is built through a deep learning framework PyTorch.

In the invention, a layer jump structure is introduced between layers with the same size of characteristic graphs of the coding sub-network and the decoding sub-network, and a deep supervision mechanism is added between the layer jump structures. The layer jump structure effectively utilizes rich detail information contained in the decoding sub-network, and more accurate target positioning information can be obtained in the output characteristic diagram through the layer jump structure. The deep surveillance mechanism helps the training phase model to be better optimized, and reduces the difficulty of optimization increased by using shallow features.

In order to obtain the multi-level characteristics in the decoding sub-network, the output obtained by each level of decoding module is spliced on the channel dimension, and the weight of the interested channel information is increased by using a channel attention mechanism.

In the invention, based on the depth supervision Anchor-Free traffic sign detection neural network model,

1. as shown in fig. 1, the input unit 1 includes:

carrying out primary extraction of shallow features on an input original image through a 7 x 7 convolutional layer to obtain a feature map, wherein the convolutional kernel size of the convolutional layer is 7 x 7, the step length is 2, and the number of output channels is 64; and the feature map sequentially passes through 1 BN layer to prevent gradient disappearance, 1 RELU activation function layer and 1 maximum pooling layer with the step length of 2 and the pooling window of 2 multiplied by 2 to form the input of the encoding unit 2.

2. The encoding unit 2 includes:

the device comprises a first residual error module, a second residual error module, a third residual error module, a fourth residual error module and a cavity space convolution pooling pyramid module which are sequentially connected in series, wherein after a feature map output by an input unit 1 is further subjected to feature extraction sequentially through the first residual error module, the second residual error module, the third residual error module and the fourth residual error module, feature enhancement is performed through the cavity space convolution pooling pyramid module, an enhanced feature map is obtained and is sent to a decoding unit 4, and the outputs of the first residual error module, the second residual error module and the third residual error module are respectively connected with a layer jump structure unit 3. The first Residual module, the second Residual module, the third Residual module and the fourth Residual module are all structured by ResNet-101(He K, Zhang X, Ren S, et al. deep reactive Learning for Image registration [ C ]//2016IEEE Conference on Computer Vision and Pattern Registration (CVPR), Las vegas, NV, USA (2016.6.27-2016.6.30),2016: 770-778.).

The void space convolution pooling pyramid module is divided into 5 branches as shown in fig. 2, wherein the first branch comprises a global average pooling layer, a fourth 1 × 1 convolution layer and an upper sampling layer which are sequentially connected in series; the second branch comprises a fifth 1 × 1 convolutional layer; the third branch comprises a first 3 x 3 expanded convolutional layer, the expansion rate of the first 3 x 3 expanded convolutional layer is 6; the fourth branch comprises a second 3 × 3 expansion convolutional layer, the expansion rate of the second 3 × 3 expansion convolutional layer is 12; the fifth branch comprises a third 3 × 3 expanded convolutional layer, the third 3 × 3 expanded convolutional layer having an expansion rate of 18; and after the feature maps output by the fourth residual module respectively enter five branches, the feature maps pass through a channel dimension splicing layer, are fused with feature information under different receptive fields, and pass through a sixth 1 multiplied by 1 convolution layer to obtain the output feature map of the void space convolution pooling pyramid module.

3. The layer jump structure unit 3 comprises a first 1 × 1 convolutional layer, a second 1 × 1 convolutional layer, a third 1 × 1 convolutional layer, a first depth supervision mechanism, a second depth supervision mechanism and a third depth supervision mechanism, wherein the first 1 × 1 convolutional layer receives an output characteristic diagram of a first residual error module in the coding unit 2, outputs the output characteristic diagram to the decoding unit 4 after convolution operation, and simultaneously outputs the output characteristic diagram to the first depth supervision mechanism when model training is carried out; the second 1 x 1 convolutional layer receives the output characteristic diagram of the second residual module in the coding unit 2, outputs the output characteristic diagram to the decoding unit 4 after convolutional operation, and simultaneously outputs the output characteristic diagram to the second deep supervision mechanism during model training; the third 1 × 1 convolutional layer receives the output feature map of the third residual module in the encoding unit 2, outputs the feature map to the decoding unit 4 after convolutional operation, and simultaneously outputs the feature map to the third deep supervision mechanism during model training.

The first depth supervision mechanism, the second depth supervision mechanism and the third depth supervision mechanism respectively and correspondingly receive output feature maps of the first 1 x 1 convolution module, the second 1 x 1 convolution module and the third 1 x 1 convolution module in a model training stage; the first, second and third deep supervision mechanisms have the same structure, and as shown in fig. 3, the received feature map is subjected to traffic sign center point prediction, offset prediction and scale prediction through three branches, wherein the three branches comprise two 3 × 3 convolution modules connected in series; obtaining the prediction information of the central point after passing through a seventh 3 multiplied by 3 convolution module and an eighth 3 multiplied by 3 convolution module of the first branch, and performing cross entropy loss calculation with the real central point to obtain cross entropy loss; obtaining prediction information of the offset after passing through a ninth 33 convolution module and a tenth 33 convolution module of the second branch, and performing L1 loss calculation with the real offset to obtain a first L1 loss; obtaining prediction information about the scale after passing through an eleventh 3 × 3 convolution module and a twelfth 3 × 3 convolution module of the third branch, and performing L1 loss calculation with the real scale to obtain a second L1 loss; and adding the cross entropy loss, the first L1 loss and the second L1 loss to obtain an auxiliary loss function, and forming an output value of the deep supervision mechanism.

4. The decoding unit 4 comprises a first decoding module, a second decoding module and a third decoding module, wherein the first decoding module is used for decoding the feature map output by the hollow space convolution pooling pyramid module in the receiving and encoding unit 2, adding the feature map to the output of a third 1 x 1 convolution layer in the layer jump structure unit 3, and outputting the feature map obtained by adding to the second decoding module and the output prediction unit 5; the second decoding module decodes the received feature map, adds the decoded feature map to the output of the second 1 × 1 convolution layer in the layer jump structure unit 3, and outputs the feature map obtained by the addition to the third decoding module and the output prediction unit 5; the third decoding module decodes the received feature map, adds the decoded feature map to the output of the first 1 × 1 convolution layer in the layer jump structure unit 3, and outputs the feature map obtained by the addition to the output prediction unit 5; the first decoding module, the second decoding module and the third decoding module have the same structure, and the structure is shown in fig. 4, and each decoding module comprises a bilinear interpolation layer and a 3 × 3 convolution layer which are sequentially connected in series.

5. The output prediction unit 5 includes: a first bilinear interpolation for receiving the result of the output addition of the second decoding module in the decoding unit 4 and the second 1 × 1 convolutional layer in the layer-skipping structural unit 3, a second bilinear interpolation for receiving the result of the output addition of the first decoding module in the decoding unit 4 and the third 1 × 1 convolutional layer in the layer-skipping structural unit 3, a channel dimension splicing layer for respectively receiving the result of the output addition of the third decoding module in the decoding unit 4 and the first 1 × 1 convolutional layer in the layer-skipping structural unit 3, the output of the first bilinear interpolation and the output of the second bilinear interpolation, the channel dimension splicing layer fuses the obtained feature maps into one feature map output, the fused feature maps respectively carry out the prediction information of the traffic sign center point category and position, the center point offset and the traffic sign scale through three branches and output, thereby realizing the traffic sign detection, the three branches have the same structure and are provided with two series-connected 3 multiplied by 3 convolution modules.

The 3 × 3 convolution module structure is shown in fig. 5, and includes a 3 × 3 convolution layer, a BN layer, and a RELU layer connected in series in sequence.

inputting the training set image cut in the step 1 into an Anchor-Free traffic sign detection neural network model based on deep supervision, and obtaining the category information of the traffic sign and the position information of the detection frame in a forward propagation stage. And calculating the error between the output result of the neural network model based on the depth-supervised Anchor-Free traffic sign detection and the position information and the category information of the real target according to the predicted result. And the error term is reversely propagated from the output layer to the hidden layer by layer, the model parameters of the depth-supervised Anchor-Free traffic sign detection neural network are updated, and the model parameters of the depth-supervised Anchor-Free traffic sign detection neural network are continuously fed back and optimized by using a random gradient descent (SGD) optimizer.

In the training of the Anchor-Free traffic sign detection neural network model based on deep supervision, one batch comprises 4 images, the iteration times are set to be 180, the initial learning rate is set to be 1.25 multiplied by 10^-4And decays to 1.25X 10 in the 90 th iteration^-5. And storing the trained model.

Inputting the cut test set image obtained in the step 1 into the trained Anchor-Free traffic sign detection neural network model based on deep supervision in the step 3, and outputting a detection result, wherein an effect graph of the result is shown in the attached figure 6.

The embodiment of the invention adopts Average Precision (Average Precision) to measure the algorithm effect. 3073 test set pictures are input for detection and calculation, and then the AP is calculated to be 95.7.

[1]Ren S,He K,Girshick R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems.2015:91-99.

[2]Huang J,Rathod V,Sun C,et al.Speed/accuracy trade-offs for modern convolutional object detectors[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2017:7310-7311.

[3]Redmon J,Divvala S,Girshick R,et al.You only look once:Unified,real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition.2016:779-788.

Claims

1. An Anchor-Free traffic sign detection method based on deep supervision is characterized by comprising the following steps:

step 2, building an Anchor-Free traffic sign detection neural network model based on deep supervision, which comprises the following steps of sequentially connecting in series: the device comprises an input unit (1), an encoding unit (2), a layer jump structure unit (3), a decoding unit (4) and an output prediction unit (5), wherein the encoding unit (2) is also connected with the decoding unit (4);

2. The depth supervision-based Anchor-Free traffic sign detection method according to claim 1, characterized in that the data set in step 1 adopts 45-class traffic sign-containing data with the frequency of occurrence of more than 100 in China traffic sign data set TT100K published by Qinghua university and Tengchin for training and testing of neural networks; the data preprocessing is to cut the original image into 512 x 512 pixel images randomly according to the area where the traffic sign is located, the cut image contains more than one traffic sign, and the annotation of the traffic sign detection frame in the cut image is obtained according to the original annotation file; the incomplete traffic sign in the cut image is used for simulating the condition when the traffic sign is blocked.

3. The method for detecting the Anchor-Free traffic sign based on the deep supervision according to the claim 1, wherein the step 2 is to build the neural network model based on the Anchor-Free traffic sign detection based on the deep supervision through a deep learning framework PyTorch, and the input unit (1) comprises: carrying out primary extraction of shallow features on an input original image through a 7 x 7 convolutional layer to obtain a feature map, wherein the convolutional kernel size of the convolutional layer is 7 x 7, the step length is 2, and the number of output channels is 64; and the characteristic diagram sequentially passes through 1 BN layer to prevent gradient disappearance, 1 RELU activation function layer and 1 maximum pooling layer with the step length of 2 and the pooling window of 2 multiplied by 2 to form the input of the encoding unit (2).

4. The depth supervision-based Anchor-Free traffic sign detection method according to claim 1, wherein the encoding unit (2) in step 2 comprises:

the device comprises a first residual error module, a second residual error module, a third residual error module, a fourth residual error module and a cavity space convolution pooling pyramid module which are sequentially connected in series, a feature map output by an input unit (1) sequentially passes through the first residual error module, the second residual error module, the third residual error module and the fourth residual error module to further extract features, the features are enhanced through the cavity space convolution pooling pyramid module, the enhanced feature map is obtained and sent to a decoding unit (4), and the outputs of the first residual error module, the second residual error module and the third residual error module are respectively connected with a layer jump structure unit (3).

5. The depth supervision-based Anchor-Free traffic sign detection method according to claim 4, characterized in that the void space convolution pooling pyramid module is divided into 5 branches, the first branch comprises a global average pooling layer, a fourth 1 x 1 convolution layer and an upsampling layer which are connected in series in sequence; the second branch comprises a fifth 1 × 1 convolutional layer; the third branch comprises a first 3 x 3 expanded convolutional layer, the expansion rate of the first 3 x 3 expanded convolutional layer is 6; the fourth branch comprises a second 3 × 3 expansion convolutional layer, the expansion rate of the second 3 × 3 expansion convolutional layer is 12; the fifth branch comprises a third 3 × 3 expanded convolutional layer, the third 3 × 3 expanded convolutional layer having an expansion rate of 18; and after the feature maps output by the fourth residual module respectively enter five branches, the feature maps pass through a channel dimension splicing layer, are fused with feature information under different receptive fields, and pass through a sixth 1 multiplied by 1 convolution layer to obtain the output feature map of the void space convolution pooling pyramid module.

6. The depth supervision-based Anchor-Free traffic sign detection method according to claim 1, wherein the layer jump structure unit (3) in the step 2 comprises a first 1 x 1 convolutional layer, a second 1 x 1 convolutional layer, a third 1 x 1 convolutional layer, a first depth supervision mechanism, a second depth supervision mechanism and a third depth supervision mechanism, wherein the first 1 x 1 convolutional layer receives an output feature map of a first residual error module in the coding unit (2), outputs the output feature map to the decoding unit (4) after convolution operation, and simultaneously outputs the output feature map to the first depth supervision mechanism when model training is performed; the second 1 x 1 convolutional layer receives the output characteristic diagram of the second residual error module in the coding unit (2), outputs the output characteristic diagram to the decoding unit (4) after convolutional operation, and simultaneously outputs the output characteristic diagram to a second depth supervision mechanism during model training; and the third 1 x 1 convolutional layer receives the output characteristic diagram of the third residual module in the coding unit (2), outputs the output characteristic diagram to a decoding unit (4) after convolutional operation, and simultaneously outputs the output characteristic diagram to a third deep supervision mechanism during model training.

7. The depth supervision-based Anchor-Free traffic sign detection method according to claim 6, wherein the first depth supervision mechanism, the second depth supervision mechanism and the third depth supervision mechanism respectively receive output feature maps of the first 1 x 1 convolution module, the second 1 x 1 convolution module and the third 1 x 1 convolution module in a model training phase; the first depth supervision mechanism, the second depth supervision mechanism and the third depth supervision mechanism have the same structure, and respectively carry out traffic sign central point prediction, offset prediction and scale prediction on a received feature map through three branches, wherein the three branches comprise two 3 multiplied by 3 convolution modules connected in series; obtaining the prediction information of the central point after passing through a seventh 3 multiplied by 3 convolution module and an eighth 3 multiplied by 3 convolution module of the first branch, and performing cross entropy loss calculation with the real central point to obtain cross entropy loss; obtaining prediction information of the offset after passing through a ninth 33 convolution module and a tenth 33 convolution module of the second branch, and performing L1 loss calculation with the real offset to obtain a first L1 loss; obtaining prediction information about the scale after passing through an eleventh 3 × 3 convolution module and a twelfth 3 × 3 convolution module of the third branch, and performing L1 loss calculation with the real scale to obtain a second L1 loss; and adding the cross entropy loss, the first L1 loss and the second L1 loss to obtain an auxiliary loss function, and forming an output value of the deep supervision mechanism.

8. The depth supervision-based Anchor-Free traffic sign detection method according to claim 1, wherein the decoding unit (4) in step 2 comprises a first decoding module, a second decoding module and a third decoding module, wherein the first decoding module decodes the feature map output by the cavity space convolution pooling pyramid module in the receiving and encoding unit (2), adds the feature map to the output of a third 1 x 1 convolution layer in the layer-skipping structure unit (3), and outputs the feature map obtained by the addition to the second decoding module and the output prediction unit (5); the second decoding module carries out decoding processing on the received feature map, adds the feature map to the output of the second 1 x 1 convolution layer in the layer jump structure unit (3), and outputs the feature map obtained by the addition to the third decoding module and the output prediction unit (5); the third decoding module carries out decoding processing on the received feature map, adds the feature map to the output of the first 1 x 1 convolution layer in the layer jump structure unit (3), and outputs the feature map obtained by the addition to an output prediction unit (5); the first decoding module, the second decoding module and the third decoding module have the same structure and respectively comprise a bilinear interpolation layer and a 3 multiplied by 3 convolution layer which are sequentially connected in series.

9. The depth supervision-based Anchor-Free traffic sign detection method according to claim 1, characterized in that the output prediction unit (5) of step 2 comprises: receiving a first bilinear interpolation of a result of output addition of a second decoding module in a decoding unit (4) and a second 1 × 1 convolutional layer in a layer-skipping structural unit (3), receiving a second bilinear interpolation of a result of output addition of the first decoding module in the decoding unit (4) and a third 1 × 1 convolutional layer in the layer-skipping structural unit (3), respectively receiving a result of output addition of the third decoding module in the decoding unit (4) and the first 1 × 1 convolutional layer in the layer-skipping structural unit (3), a channel dimension splicing layer of an output of the first bilinear interpolation and an output of the second bilinear interpolation, respectively, fusing the obtained feature maps into one feature map for output, and respectively carrying out prediction information of the traffic sign center point category and position, the center point offset and the traffic sign scale through three branches by the channel dimension splicing layer and outputting, and the three branches have the same structure and are provided with two series-connected 3 multiplied by 3 convolution modules.

10. The depth supervision-based Anchor-Free traffic sign detection method according to claim 9, wherein the 3 x 3 convolution module comprises a 3 x 3 convolution layer, a BN layer and a RELU layer which are connected in series in sequence.