CN111738300A

CN111738300A - Optimization algorithm for detecting and identifying traffic signs and signal lamps

Info

Publication number: CN111738300A
Application number: CN202010463642.0A
Authority: CN
Inventors: 王卓曜; 金城; 刀坤
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2020-10-02

Abstract

The invention discloses an optimization algorithm for detecting and identifying traffic signs and signal lamps; the algorithm is divided into a feature extraction stage, a region candidate stage and a hierarchical classification stage. In the feature extraction stage, a Ghost bottleeck module is introduced to construct a feature extraction network, big data training is carried out on ImageNet to obtain a pre-training model, image features are extracted through the pre-trained feature extraction network, and the feature graph is subjected to pooling treatment; in the region candidate stage, an RPN subnet is adopted to obtain a candidate region, and the feature graphs corresponding to the candidate region are cut and scaled, so that the feature subgraphs to be classified have the same size; and in the hierarchical classification stage, the images are classified by a hierarchical classification method, and traffic lights and traffic signs are identified. Compared with a baseline algorithm, the algorithm disclosed by the invention has the advantages that the performance in all aspects is greatly improved, and the requirements on the real-time performance and the reliability of an automatic driving system can be met.

Description

Optimization algorithm for detecting and identifying traffic signs and signal lamps

Technical Field

The invention belongs to the technical field of statistical pattern recognition and image processing, and particularly relates to an optimization algorithm for traffic sign and signal lamp detection and recognition.

Background

Traffic sign and signal light detection are important components of the perception of road scenes for autonomous vehicles. The key problem of traffic sign and signal lamp detection is that the target is positioned and identified under the condition of meeting the real-time requirement, in particular to a technology for checking whether a target object exists in a complex fast moving image sequence and accurately and fast calculating the position of the target in the image, and the main problems to be solved are target identification and positioning under the conditions of complex illumination, complex background, multiple scales, multiple visual angles, shielding and the like.

For the identification of traffic lights and traffic signs, most studies currently view it as two distinct problems: namely traffic light identification and traffic sign identification. The traditional traffic signal lamp identification method mostly depends on visual characteristics such as color and shape of a signal lamp, so that the applicability is poor; the deep learning method is regarded as a special case of target detection, for example, a YOLO target detection algorithm is adopted to detect and identify traffic lights. The development of the traffic sign detection and identification method is similar, but the calculation cost is high by using a target detection method based on deep learning, the calculation amount cannot be reduced, a GPU with high power consumption and high calculation power is required to be used for calculation, and the real-time requirement is difficult to meet due to the power supply limitation of a vehicle-mounted power supply.

Disclosure of Invention

The invention aims to provide a traffic sign and traffic signal lamp detection and identification optimization algorithm based on deep learning; the algorithm is based on a hierarchical and fine-classified network structure of a lightweight backbone network, simultaneously positions and identifies traffic signal lamps and traffic signs in the image, can meet the requirements of real-time performance and accuracy, can be applied to a real-time automatic driving system, and can give early warning and limit to the road driving behavior of an automatic driving automobile so as to avoid traffic accidents. The technical scheme of the invention is specifically introduced as follows.

An optimization algorithm for detecting and identifying traffic signs and signal lamps is divided into three stages: a feature extraction stage, a region candidate stage and a hierarchical classification stage; in the feature extraction stage, a Ghost bottleeck module of GhostNet is introduced to construct a feature extraction network, big data training is carried out on a classical image classification data set ImageNet to obtain a pre-training model, rich semantic information contained in an image is extracted through the pre-trained feature extraction network, and a feature graph is subjected to pooling processing; in the Region candidate stage, a RPN subnet in the Faster-RCNN is adopted to obtain a candidate Region (Region Proposal), and feature graphs corresponding to the candidate Region are cut and scaled, so that all feature subgraphs to be classified have the same size; the hierarchical classification stage classification subnet classifies the images by a two-stage progressive hierarchical classification method so as to identify traffic lights and traffic signs in the images.

In the invention, the specific steps of the characteristic extraction stage are as follows:

(1) constructing a feature expression network, extracting the most basic original features of an image from an input image through a small common convolution kernel, inputting an output feature graph into a series of Ghost bottleeck modules, gradually increasing channels, dividing the Ghost bottleeck into different stages according to the size of the input feature graph, finally converting the feature graph into feature vectors by using a global average pool and a convolution layer for final classification, and performing big data training on a classical image classification data set ImageNet image classification training by using the feature expression network to obtain a pre-training model;

(2) separating the feature extraction network, taking 1-18 layers of the pre-trained feature expression network as feature extraction layers, fixing the parameters, gradients and Batchnormal coefficients of the first five layers, and finally performing average pooling treatment on the output feature map.

In the invention, the specific steps of the region candidate stage are as follows:

(1) acquiring a candidate region, acquiring the candidate region by adopting an RPN subnet in FasterR-CNN in a region candidate stage, taking a pooled feature map obtained in a feature extraction stage as input, performing convolution rich semantic information firstly, and then sending the information into an RPN network to acquire the coordinate of the candidate region;

(2) positive and negative candidate regions are distinguished

Indicating that the RPN network is locatedG represents all group Truth in the image, and the correct candidate area obtained by distinguishing RPN network

And wrong candidate region

Is defined as:

(3) screening positive candidate regions, sorting all candidate regions from large to small according to confidence degrees, and selecting the first 300 candidate regions to be sent to a subsequent classification stage;

(4) and obtaining a candidate region feature map, converting the coordinate expression of the candidate region into a feature map corresponding to the region, and scaling the feature map corresponding to the candidate region before the candidate region is sent into the subnet to 7 multiplied by 7, so that all feature subgraphs to be classified have the same size.

In the invention, the specific steps of the hierarchical classification stage are as follows:

(1) global classification phase

And in the global classification stage, the feature subgraphs are subjected to primary classification, namely, the feature subgraphs are classified into traffic signal lamps, traffic signs and background classes. If the global classification result is correct, performing the next subdivision;

(2) fine classification stage

In the fine classification stage, the feature map and the information of the global classification are combined, and more detailed classification is carried out under the category of the global classification, namely the global classification is of signal lamps and is further classified into red lamps, yellow lamps and green lamps; and for the global classification of the traffic signs, further classifying the global classification into various specific traffic signs, and adding a bounding box regression layer in the classification subnet, wherein the layer outputs fine bounding box coordinates of the target region.

Compared with the prior art, the invention has the following beneficial effects:

(1) in the feature extraction stage, a lightweight Ghost bottleeck module which is subjected to model pruning, compression optimization and backward shift and is suitable for edge calculation is adopted to construct network extraction features contained in a feature extraction image, and output feature graphs are subjected to pooling processing, so that feature redundancy is reduced, forward reasoning speed is high, and memory occupation is low;

(2) in the hierarchical classification stage, for each characteristic subgraph, a two-stage progressive hierarchical classification method is adopted, inter-class differences and intra-class differences before three classes of traffic signs, signal lamps and natural scenes are fully considered, and the classification accuracy is effectively improved. Compared with a baseline algorithm, the improved algorithm has greatly improved performance in all aspects, and can meet the requirements of real-time performance and reliability of an automatic driving system.

Drawings

FIG. 1: GhostNet network flow chart.

FIG. 2: the algorithm of the invention is a flow chart.

Detailed Description

The technical scheme of the invention is explained in detail in the following by combining the drawings and the embodiment.

The invention provides an optimization algorithm for detecting and identifying traffic signs and traffic signal lights, which is divided into 3 stages: feature extraction, region candidate stage and classification stage.

First, feature extraction stage

(1) Constructing a feature expression network

The method comprises the steps of constructing a feature expression network, firstly determining the size of an input image of the feature extraction network to be 512 multiplied by 3, extracting the most basic original features of the image by 16 smaller common convolution kernels with stride being 2, and inputting an output feature diagram into a series of Ghost bottleck channels which are gradually increased. These ghestbottleecks are divided into different stages according to the size of their input feature maps. All Ghost bottleecks were used with stride 1, except that the last Ghost bottleeck of each stage was stride 2. Finally, the feature map is converted into 1280-dimensional feature vectors by using the global average pool and the convolutional layer for final classification, and the specific structure of the feature expression network is shown in fig. 1. And (3) carrying out big data training on the classical image classification data set ImageNet image classification training by using the feature expression network to obtain a pre-training model.

(2) Separation feature extraction network

The method comprises the steps of extracting rich basic semantic features contained in an image well through a pre-trained feature expression network, taking the first 18 layers of the pre-trained feature expression network as feature extraction layers, fixing parameters, gradients and Batchnormal coefficients of the first five layers, closing dropout without being influenced by gradient back propagation, taking the dropout as an image basic feature extractor, carrying out 7 multiplied by 7 average pooling processing on an output feature map, and obtaining a feature map F1 to be sent to a subsequent regional candidate RPN.

Second, extracting candidate region stage

(1) Candidate region acquisition

The region candidate stage uses the RPN subnet in the Faster-RCNN to obtain the candidate region (RegionProposal). Taking the pooled feature map F1 obtained in the feature extraction stage as input, firstly performing common convolution with convolution kernel size of 1x1, stride of 1 and convolution kernel number of 960 to obtain a feature map F2 with rich semantics, and then sending the feature map F2 into an RPN network to obtain all interested candidate regions of the candidate regions obtained by the RPN network

The candidate region includes its coordinates and confidence.

(2) Positive and negative candidate region differentiation

Let G denote all group Truth in the image, in order to distinguish the correct candidate regions obtained by the RPN network

And wrong candidate region

The intersection ratio of the candidate area and the corresponding group Truth is carried outCalculating, wherein the intersection ratio of the candidate area and the corresponding group Truth is between 0.7 and 1, and the candidate area is listed as the positive sample

The intersection ratio of the candidate area and the corresponding group Truth is 0.01 to 0.3, and the candidate area is listed as a negative sample

And ignoring the candidate area with the intersection ratio of the candidate area and the corresponding group Truth between 0.3 and 0.7. The specific definition is as follows:

(3) positive candidate region screening

For all candidate regions obtained by RPN

Sorting according to the confidence degree from large to small, and selecting the first 300 candidate areas

And sending the obtained product to a subsequent classification stage.

(4) Candidate region feature map acquisition

Candidate region

And cutting and splicing the part F1 of the feature map corresponding to the coordinates. Extracting candidate regions

Corresponding characteristic diagram F1'. Before entering the sub-network, the candidate area is processed

Corresponding toThe feature map F1' is scaled to 7 × 7 to obtain F1 ", and all the feature maps F1" to be classified have the same size and are sent to the subsequent classification network.

Third, classification stage

For each characteristic sub-graph F1 ″, the classification sub-network will classify the images by a hierarchical classification method to identify traffic lights and traffic signs therein, specifically, the hierarchical classification method is divided into two stages: a global classification phase and a fine classification phase.

(1) Global classification phase

In the global classification stage, the feature maps are subjected to primary classification through a full connection layer to obtain a global classification result p, wherein the global classification result p comprises traffic signal lamps, traffic signs and background classes. The global classification result is of the traffic light or the traffic sign, and the next sub-classification is carried out.

(2) Fine classification stage

In the fine classification stage, the information of the feature map F1' and the global classification p is combined, and the fine classification is carried out under the category of the global classification to obtain a fine classification result

For the global classification is traffic signal, further classifying it into red, yellow and green; and if the global classification is a traffic sign, the traffic sign is further classified into various specific traffic signs (such as no-pedestrian traffic and the like). And adding a bounding box regression layer in the classification subnet, wherein the output of the layer is the fine bounding box coordinate t of the target area.

(3) Expression of results

Finally, the whole network outputs the detection results in the input image, including the global classification result p and the fine classification result

And location information t. The output result is expressed as: global classification result p refinement classification result

And location information t (e.g. position information t): traffic signal lamp red lamp [0.1,0.2), (0.2,0.4)])。

Four, calculation of loss function

As can be seen from the above description of the network, the output of the network contains three items, namely the global classification result p and the fine classification result

And a bounding box location t. The loss function of this network therefore also measures the difference between the predicted and true values in three ways, respectively: global classification penalty L_clsL 'is finely classified'_clsAnd bounding box positioning loss L_loc. The specific definition is as follows:

global classification penalty L_clsIn (c) p_iAnd u_iGlobal classification results and their true values, N, respectively, for network prediction_totRepresenting a total number of candidate regions for normalizing global classification loss;

segment loss of L'_clsIn

And

respectively the fine classification result of the network prediction and the true value thereof,

is 1 or 0 to indicate whether the result of the global classification is correct or not, by

Is set to 0, it is possible to avoid that the globally misclassified term is calculated twice, N_gThen the correct global classification area number is indicated to normalize the fine classification loss;

bounding box positioning loss L_locMiddle t_iAnd v_iRespectively, the predicted bounding box position of the network and its true value, N₊Representing the total number of candidate regions with correct classification, and normalizing the bounding box positioning loss;

L(p_i,t_i)＝λL_cls+L′_cls+L_loc(6)

finally, the network loss function adopts weighted addition, a weight parameter lambda is used for balancing the positioning error of the bounding box and the global classification and fine classification losses, the lambda is a hyper-parameter, and the value of the lambda is determined to be 2 through experiments.

Network training

The training process of the network adopts the idea of transfer learning, namely the whole training process is divided into two stages: and a pre-training stage, namely transfer learning.

(1) Pre-training phase

In the pre-training stage, big data image classification training is carried out on the characteristic expression network on a classical image classification data set ImageNet to obtain a pre-training model. The network structure of the feature expression network is shown in fig. 1, and in the training phase, the hyper-parameters are set as follows: the batch size is 1024, the learning rate is 0.5, linear decay is adopted, weight decay is 4e-5, momentum is 0.9, label smooth is 0.1, and dropout is 0.1, and the total number of iterations is more than 30000.

(2) Transfer learning phase

In the transfer learning stage, the training of the invention adopts the MSCOCO target detection data set to carry out training, and a better initial model is obtained. In the second stage, based on the initial model, the traffic signal lamp and traffic sign data set Tsinghua-Tencent 100K used by the baseline author is used for fine-tuning to obtain a final detection model with strong generalization capability. Besides the two-stage training method, the following strategies are adopted in the training process: the larger batch size (32 for the experiment) was trained; setting a smaller Loss (such as 0.03) threshold or a larger iteration number (5 ten thousand iterations in the experiment) threshold; a method of doubling the learning rate every several iterations (e.g., 10 ten thousand) is employed.

Example 1

Multiple sets of comparative experiments were performed using the baseline algorithm author's test set Tsinghua-Tencent 100K, which was 1500 images in total, for a total of 3120 traffic signs and traffic lights. Using a reference pair area used by the author of less than 32²(small) area is 32²To 96²(middle) area is more than 96²The (large) detection results are respectively counted, the threshold value of the cross-over ratio (IoU) is set to be 0.5, the precision rate, the recall rate and the mAP are respectively calculated, and in addition, in order to compare and research the calculated amount and the memory consumption of the invention, the base line algorithm and the average forward reasoning speed, the memory occupation and the model size of the invention are counted.

Table 1 comparative experiment performance reference table

In contrast to the baseline method: the invention adopts the lightweight characteristic extraction network after model compression and pruning, and the table shows that the invention ensures that the difference between the precision rate and the recall rate of the large, medium and small targets and the baseline method is not more than two percentage points, and the recall rate and the precision rate on the medium target detection are improved by 1.2 to 1.46 percentage points compared with the baseline method, however, the memory occupation is only 906MB, which is one third of the baseline method, the operation inference speed can reach 15 frames per second, which is 4 times of the baseline method, and the invention can meet the requirement of high real-time performance of an automatic driving system.

Claims

1. An optimization algorithm for detecting and identifying traffic signs and signal lamps is characterized by comprising three stages: a feature extraction stage, a region candidate stage and a hierarchical classification stage; in the characteristic extraction stage, a Ghostbottleneck module of GhostNet is introduced to construct a characteristic extraction network, big data training is carried out on a classical image classification data set ImageNet to obtain a pre-training model, rich semantic information contained in an image is extracted through the pre-trained characteristic extraction network, and a characteristic graph is subjected to pooling processing; in the region candidate stage, an RPN subnet in the Faster-RCNN is adopted to obtain a candidate region, and the feature graph corresponding to the candidate region is cut and scaled, so that all the feature subgraphs to be classified have the same size; the hierarchical classification stage classification subnet classifies the images by a two-stage progressive hierarchical classification method so as to identify traffic lights and traffic signs in the images.

2. The optimization algorithm according to claim 1, wherein the specific steps of the feature extraction stage are as follows:

3. The optimization algorithm according to claim 1, wherein the specific steps of the region candidate stage are as follows:

(2) positive and negative candidate regions are distinguished

All candidate areas positioned by the RPN network are shown, G shows all group Truth in the image, and the correct candidate areas obtained by the RPN network are distinguished

And wrong candidate region

Is defined as:

4. The optimization algorithm according to claim 1, wherein the specific steps of the hierarchical classification stage are as follows:

(1) global classification phase

In the global classification stage, the feature subgraphs are classified preliminarily, namely the feature subgraphs are classified into traffic signal lamps, traffic signs and background classes, the global classification result is correct, and the next step of fine classification is carried out;

(2) fine classification stage