CN112597996B

CN112597996B - Method for detecting traffic sign significance in natural scene based on task driving

Info

Publication number: CN112597996B
Application number: CN202011577655.7A
Authority: CN
Inventors: 李雨萌
Original assignee: Shanxi Cloud Times R & D Innovation Center Co ltd
Current assignee: Shanxi Cloud Times R & D Innovation Center Co ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2024-03-29
Anticipated expiration: 2040-12-28
Also published as: CN112597996A

Abstract

The invention relates to the field of computer vision, and discloses a method for detecting the significance of traffic signs in a natural scene based on task driving, which comprises the following steps: s1, acquiring training data; s2, inputting images in training set data, extracting total global features of the images by using a convolutional neural network, and extracting feature information of the images under a plurality of different resolutions; the method comprises the steps of utilizing an expansion convolution network to carry out multilayer expansion convolution learning on feature information under different resolutions of an image to extract features and contrast features; s3, carrying out up-sampling type learning on the features and the contrast features to obtain feature images under each resolution, and then fusing the feature images into total local features; s4, predicting to obtain a traffic sign significance characteristic diagram; s5, repeating the steps S2-S4 to train the convolutional neural network, and storing a training model; s6, inputting the image to be predicted, and obtaining the traffic sign significance characteristic diagram. The traffic sign detection precision is improved, and the method can be widely applied to the unmanned field.

Description

Method for detecting traffic sign significance in natural scene based on task driving

Technical Field

The invention relates to the field of computer vision, in particular to the technical field of image saliency detection based on deep learning, and more particularly relates to a saliency detection method of traffic signs in a natural scene based on task driving.

Background

As the number of vehicles increases, the traffic problem is also increasing, and identifying traffic signs is the most important problem in driving, and is also important for road maintenance, driver assistance systems, and automatic driving vehicles.

Many practical factors need to be considered, for example, in the development of Advanced Driving Assistance Systems (ADAS), the most fundamental of which is the identification of traffic signs (Traffic Sign Recognition, TSR). TSR is a difficult real-scene pattern recognition problem, and the main function of TSR is to provide road information for a driver and remind the driver of making reasonable operations by detecting traffic signs. If the complex conditions such as road condition congestion, rain and snow weather or driver fatigue occur in the running process of the vehicle, the TSR can prevent the driver from having traffic accidents in complex environments such as negligence, fatigue driving and severe weather.

For the reasons described above, it becomes particularly important how the computer can be precisely positioned to the coordinates of the traffic sign. It is well known that image saliency detection provides a method for extracting main information in image processing, has become a key technology in the field of computer vision, and has wide application in practical computer vision tasks. It is mainly by simulating the functions of human visual mechanism to effectively extract the parts of the scene that people are paying attention to. That is, traffic sign is a very important technology in the field of automatic driving as a salient object detection, and the most important problem facing the present is how to make the detection result coincide with the subjective intention of people in a real scene, and how to improve the high-efficiency robustness of detection in a complex scene.

The purpose of saliency detection is to mimic the human visual system, while the selective attention of the human visual system can be divided into two mechanisms, one based on a data-driven bottom-up (bottom-up) attention mechanism, which is a task-independent approach based on saliency driving, which is relatively fast, and the other based on a task-driven top-down (top-down) attention mechanism, which is a task-dependent approach controlled by our mind, which is relatively slow. In the prior art, detection is generally performed in a task-independent manner based on significance driving, because this is fast. However, in a real scene, when a specific target needs to be detected, for example, an application such as automatic driving needs to detect a specific target, for example, a traffic sign that a driver needs to pay attention to, by using the first detection, it is highly likely that only the traffic sign that is most significant is detected, but all traffic signs in the image cannot be displayed, so that a detection method for traffic signs in a natural scene based on task driving needs to be provided to realize accurate detection of traffic signs.

Disclosure of Invention

The invention overcomes the defects existing in the prior art, and solves the technical problems that: the method for detecting the significance of the traffic sign in the natural scene based on task driving is provided, so that the traffic sign in the natural scene can be positioned more accurately.

In order to solve the technical problems, the invention adopts the following technical scheme: a significance detection method of traffic signs in a natural scene based on task driving comprises the following steps:

s1, acquisition of training data: collecting images containing traffic signs in natural scenes, marking the traffic signs in the images, and unifying the resolution of the images;

s2, inputting images in training set data, extracting total global features of the images by using a convolutional neural network, and extracting feature information of the images under a plurality of different resolutions; the convolutional neural network comprises a plurality of convolutional blocks and a global convolutional block which are sequentially connected, wherein the output of each convolutional block corresponds to characteristic information under the resolution of an image, and the output of the global convolutional block corresponds to the total global characteristic of the image; the method comprises the steps of utilizing an expansion convolution network to carry out multilayer expansion convolution learning on feature information of images under different resolutions to extract features, and carrying out contrast feature extraction according to the extracted features;

s3, carrying out upsampling learning on the features and contrast features under each resolution obtained in the step S2 to restore the features and contrast features to the original resolution, obtaining feature images under each resolution, and then fusing the feature images under different resolutions into total local features in a concat fusion mode;

s4, predicting a traffic sign saliency map finally according to the total global features extracted in the step S2 and the total local features obtained in the step S3;

s5, adjusting parameters of the convolutional neural network, repeating the steps S2-S4 until the traffic sign saliency map obtained through prediction is consistent with the marked traffic sign, and storing a training model;

s6, inputting an image to be predicted, and repeating the steps S2-S4 to obtain a traffic sign significance map.

In the step S2, the convolutional neural network includes five convolutional blocks CONV1-CONV5 and a GLOBAL convolutional block, the convolutional blocks CONV1-CONV2 respectively include two-layer convolutional operations with a convolutional kernel size of 3*3, the convolutional blocks CONV3-CONV5 respectively include three-layer convolutional operations with a convolutional kernel size of 3*3, the outputs of the convolutional blocks CONV1, CONV2, CONV3, CONV4 and CONV5 are sequentially connected with an expansion convolutional network, and the GLOBAL convolutional block GLOBAL includes three-layer convolutional operations with a convolutional kernel size of 5x5, 5x5 and 3x3.

In the step S2, the dilation convolutional network includes four-layer dilation convolutional operations with dilation rates of 1, 3, 5, and 7, respectively.

In the step S1, the resolution of the image is unified to 256×256, and the resolution of the image output by the convolution blocks CONV1-CONV5 is 256×256, 128×128, 64×64, 32×32, and 16×16, respectively.

In the step S2, a gaussian pyramid algorithm is adopted to extract contrast characteristics.

In the step S3, when the feature maps under different resolutions are fused in a concat fusion manner, the convolution kernel size is 1x1.

In the step S4, the specific method for predicting the finally obtained traffic sign significance feature map is as follows: and (3) respectively convolving the total global features obtained in the step (S2) and the total local features obtained in the step (S3) with convolution kernels of 1x1 to obtain local scores and global scores, and finally performing saliency prediction on each pixel through a Softmax function to obtain a final traffic sign saliency map.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a method for detecting the significance of traffic signs in a natural scene based on task driving, which mainly adopts a convolution operation of expanding a convolution field of view layer by layer, namely fully utilizing the expansion of an acceptance domain of an expansion convolution network to increase as much target information as possible, improving the extraction capability of traffic sign features, learning semantic information of traffic target area information and the context thereof, and reducing the loss of related traffic sign information as much as possible. In addition, the invention also adopts a contrast method to capture the features, and because the salient object is a foreground object which is different from the background area, a Gaussian pyramid method for calculating the contrast features is adopted to find out relatively prominent and important pixels and areas in the image. Finally, the prediction results are effectively combined to obtain a final significance map, and experimental results show that the traffic sign in the natural scene can be more accurately positioned based on the method.

Drawings

Fig. 1 is a network flow chart of a method for detecting the significance of traffic signs in a natural scene based on task driving, which is provided by the embodiment of the invention;

FIG. 2 is a schematic diagram of the structure of an extended convolutional network according to an embodiment of the present invention;

FIG. 3 is a graph showing the comparison of the test results of the significance detection method of the present invention with other methods.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments; all other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1-2, the embodiment of the invention provides a method for detecting the significance of traffic signs in a natural scene based on task driving, which comprises the following steps:

s1, acquisition of training data: and acquiring images containing traffic signs in the natural scene, marking the traffic signs in the images, and unifying the resolution of the images.

Specifically, in this embodiment, the image is uniformly resolved to 256×256 before extracting the features.

S2, inputting images in training set data, extracting total global features of the images by using a convolutional neural network, and extracting feature information of the images under a plurality of different resolutions; the convolutional neural network comprises a plurality of convolutional blocks and a global convolutional block which are sequentially connected, wherein the output of each convolutional block corresponds to characteristic information under the resolution of an image, and the output of the global convolutional block corresponds to the total global characteristic of the image; and utilizing the expansion convolution network to carry out multilayer expansion convolution learning on the feature information of the images under different resolutions to extract features, and carrying out contrast feature extraction according to the extracted features.

Specifically, as shown in fig. 1-2, in this embodiment, the convolutional neural network includes five convolutional blocks CONV1-CONV5 and a GLOBAL convolutional block GLOBAL, the convolutional blocks CONV1-CONV2 respectively include two-layer convolutional operations with a convolutional kernel size of 3*3, the convolutional blocks CONV3-CONV5 respectively include three-layer convolutional operations with a convolutional kernel size of 3*3, and the outputs of the convolutional blocks CONV1, CONV2, CONV3, CONV4, CONV5 are sequentially connected with one expansion convolutional network DILACON 1-DILACON 5, where the expansion convolutional network includes four-layer expansion convolutional operations with expansion rates of 1, 3, 5, and 7, respectively. The GLOBAL convolution block GLOBAL comprises three layers of convolution operations, with convolution kernel sizes of 5x5, and 3x3 respectively.

As shown in fig. 1, after the image data is input into the convolutional neural network, the image data is sequentially output to the GLOBAL convolutional block GLOBAL after passing through the convolutional blocks CONV1-CONV5, and on the other hand, the convolutional blocks CONV1-CONV5 respectively output characteristic information under different resolutions, and after the characteristic information is respectively subjected to the expansion convolution operation through the expansion convolutional networks DILACON 1-DILACON 5, the characteristics of the image under different resolutions can be learned and extracted, and meanwhile, the characteristics are respectively subjected to the characteristic extraction through the contrast characteristic extraction modules (contrast 1-contrast 5) to obtain contrast characteristics. In this embodiment, the convolutional blocks CONV1, CONV2, CONV3, CONV4, and CONV5 of the convolutional neural network are the first 13 layers of the convolutional neural network VGG-16, and when the total GLOBAL feature of the image is extracted, the full connection layer after the last convolutional block in the convolutional neural network VGG-16 is removed, a GLOBAL convolutional block is added and named as GLOBAL, and three layers of convolutional operations are included, wherein the sizes of the convolutional kernels adopted are 5x5, and 3x3 respectively.

Specifically, in the present embodiment, the resolutions of the images output by the convolution blocks CONV1-CONV5 are 256×256, 128×128, 64×64, 32×32 and 16×16, respectively.

As shown in fig. 2, in this embodiment, the characteristic information of the target is extracted by performing the expansion convolution learning by using the principle that the expansion convolution network includes a plurality of layers of expansion convolution operations (the expansion rates are 1, 3, 5, and 7, respectively), so as to achieve the purpose of expanding the size of the field of view in the convolution operation process to obtain more information characteristics.

Specifically, in the step S2, the contrast features are extracted by using a gaussian pyramid algorithm to find relatively prominent and important pixels and regions in the image, thereby extracting the contrast features at different resolutions.

S3, up-sampling learning is carried out on the features and contrast features under each resolution obtained in the step S2 to restore to the original resolution, feature graphs under each resolution are obtained, and then the feature graphs under different resolutions are fused into total local features in a concat fusion mode, and the convolution kernel size is 1x1.

S4, predicting according to the total global features extracted in the step S2 and the total local features obtained in the step S3, and finally obtaining a traffic sign saliency map. The specific method for predicting comprises the following steps: and (3) respectively convolving the total global features obtained in the step (S2) and the total local features obtained in the step (S3) with convolution kernels of 1x1 to obtain local scores and global scores, and finally performing saliency prediction on each pixel through a Softmax function to obtain a final traffic sign saliency map.

And S5, adjusting parameters of the convolutional neural network, and repeating the steps S2-S4 until the traffic sign saliency characteristic map obtained through prediction is consistent with the marked traffic sign.

In the experimental process, the acquired image is made into a training data set and a testing data set, and training of a network model is realized under a TensorFlow, wherein the weight of the convolutional neural network is initialized by the pretraining weight of VGG-16, the weights of all newly added convolutional layers are randomly initialized (delta=0.01), and the deviation is initialized to 0. In the embodiment, an Adam optimizer is adopted to train a model, and the initial learning rate is 10 ^-6 ，β ₁ =0.9，β ₂ =0.999. The entire training process takes about 6 hours with an apparatus using an NVIDIA 1080Ti GPU where one image batch size completes 20 iterative processes.

The trained model is stored, then a test image is input for test verification, and finally experimental results show that the method has higher accuracy than other significance detection methods based on the method. First, a standard significance evaluation index F metric (F-measure) is used in performance (higher is better) and an average absolute error (Mean Absolute Error, MAE) is averaged (lower is better), wherein maxF is _β 0.886, MAE 0.062; and then, comparing experimental effect graphs, as shown in fig. 3, three groups of a, b and c of detection result graphs obtained in the embodiment are detected natural scene images containing traffic signs, the obtained result of a wCtr algorithm, the obtained result of an NDF algorithm and the obtained result of the algorithm of the invention sequentially from left to right. As can be seen from the results, the invention can better enlarge the visual field to learn the context information, improve the extraction capability of the characteristics and reduce the loss of the information as much as possible, and in the figure 3, three groups of images from a to c are natural scenes with dark weather, small target volume and shielding objects in sequence, and in the figure, the three groups of images from a to c are seen from natural scenes with dark weather, small target volume and shielding objectsThe method of the present invention can still achieve relatively accurate results under various conditions in natural scenes.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. The method for detecting the significance of the traffic sign in the natural scene based on the task driving is characterized by comprising the following steps of:

s4, predicting a traffic sign significance feature map finally obtained according to the total global features extracted in the step S2 and the total local features obtained in the step S3;

s5, adjusting parameters of the convolutional neural network, repeating the steps S2-S4 until the traffic sign saliency feature map obtained through prediction is consistent with the marked traffic sign, and storing a training model;

s6, inputting an image to be predicted, and repeating the steps S2-S4 to obtain a traffic sign significance characteristic diagram;

in the step S2, the convolutional neural network includes five convolutional blocks CONV1-CONV5 and a GLOBAL convolutional block, wherein the five convolutional blocks CONV1-CONV5 and the GLOBAL convolutional block are sequentially connected, the convolutional blocks CONV1-CONV2 respectively include two-layer convolutional operations with the convolutional kernel size of 3*3, the convolutional blocks CONV3-CONV5 respectively include three-layer convolutional operations with the convolutional kernel size of 3*3, the outputs of the convolutional blocks CONV1, CONV2, CONV3, CONV4 and CONV5 are sequentially connected with an expansion convolutional network, and the GLOBAL convolutional block GLOBAL includes three-layer convolutional operations with the convolutional kernel sizes of 5x5, 5x5 and 3x3;

in the step S2, a Gaussian pyramid algorithm is adopted to extract contrast characteristics;

in the step S4, the specific method for predicting the finally obtained traffic sign significance feature map is as follows: and (3) respectively convolving the total global features obtained in the step (S2) and the total local features obtained in the step (S3) with convolution kernels of 1x1 to obtain local scores and global scores, and finally performing saliency prediction on each pixel through a Softmax function to obtain a final traffic sign saliency feature map.

2. The method for detecting the significance of the traffic sign in the natural scene based on the task driving according to claim 1, wherein in the step S2, the dilation convolution network comprises four-layer dilation convolution operations with dilation rates of 1, 3, 5 and 7, respectively.

3. The method for detecting the saliency of a traffic sign in a natural scene based on task driving according to claim 1, wherein in the step S1, the resolution of the images output by the convolution blocks CONV1-CONV5 is unified to 256×256, and the resolution of the images output by the convolution blocks CONV1-CONV5 is 256×256, 128×128, 64×64, 32×32 and 16×16, respectively.

4. The method for detecting the significance of the traffic sign in the natural scene based on the task driving according to claim 1, wherein in the step S3, when the feature maps under different resolutions are fused in a concat fusion manner, the convolution kernel size is 1x1.