CN111611919B

CN111611919B - Road scene layout analysis method based on structured learning

Info

Publication number: CN111611919B
Application number: CN202010431561.2A
Authority: CN
Inventors: 李垚辰; 袁建; 董子坤; 王雨潇; 刘跃虎
Original assignee: RESEARCH INSTITUTE OF XI'AN JIAOTONG UNIVERSITY IN SUZHOU
Current assignee: RESEARCH INSTITUTE OF XI'AN JIAOTONG UNIVERSITY IN SUZHOU
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2022-08-16
Anticipated expiration: 2040-05-20
Also published as: CN111611919A

Abstract

A road scene layout analysis method based on structured learning collects and expands a traffic scene image data set, and labels and preprocesses the data set according to scene platform classification; performing subregion segmentation on the image, performing superpixel segmentation on the image, training an enhanced decision tree regressor by using the characteristics of the superpixel and a label to obtain an initial segmentation result, and then optimizing the initial segmentation result by using a Markov random field to obtain a final segmentation result; secondly, extracting features from the sub-regions, training an SVM classifier by using the sub-region features and hidden variable labels, and predicting the combination of hidden variables of the sub-regions of each picture; finally, a decision tree is constructed by using the combination of the hidden variables of the sub-regions and the corresponding relation of the scene platform labels, and the labels of the scene platform corresponding to the labels of the group are found through the decision tree; the method is based on the road scene pictures and videos of the simple road traffic scene environment, can effectively realize the prediction of the traffic scene platform, and is accurate in prediction effect and simple and effective.

Description

Road scene layout analysis method based on structured learning

Technical Field

The invention belongs to the field of image processing, computer vision and pattern recognition, and particularly relates to a road scene layout analysis method based on structured learning.

Background

The estimation of the layout of traffic scenes has a very important application in the field of unmanned driving. In some practical applications, the method has wide application prospects in problems such as three-dimensional reconstruction of road scenes and the like. Common traffic scene layout estimation methods are probabilistic graph model inference based and convolutional neural network based prediction. However, methods based on probabilistic graph model inference, such as the method proposed by Geiger et al (refer to the method of Geiger: Geiger a, Lauer M, Wojek C, et al, 3d Traffic Scene Understanding From mobile Platforms [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis 2014,36(5): 1012-. However, the method has large calculation amount and is not easy to process; the prediction method based on the convolutional neural network, for example, the method proposed by FANG-YU, etc. (refer to methods of FANG-YU: f. -y.wu, s. -y.yan, j.s.smith, and b. -l.zhang, ' Traffic scene recognition based on excluded CNN and VLAD spatial spectra, ' in Machine Learning and Cybernetics (icmc lc), ' 2017International Conference on, vol.1, pp.156-161, IEEE,2017.), some image batches generated by the region pro-posal algorithm are extracted by CNN, and are reduced in dimension by VLAD, then encoded by VLAD, and classified by placing into a PCA classifier, and finally the Traffic scenes can be classified into 10 classes. However, the method has complicated steps and large calculation amount.

Disclosure of Invention

In order to solve the problems in the prior art, the present invention provides a road scene layout analysis method based on structured learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

a road scene layout analysis method based on structured learning comprises the following steps:

step 1: collecting traffic scene images, forming a traffic scene image data set, and carrying out labeling and preprocessing on the traffic scene image data set according to scene platform classification;

step 2: performing subregion segmentation on the marked and preprocessed image based on supervised training and graph model optimization to obtain a subregion segmentation result;

and 3, step 3: modeling the sub-topics of the sub-regions in the segmentation result by using hidden variables, training a classifier by adopting a tangent plane structured SVM (support vector machine) method with N loose variables, continuously iteratively updating the weight, stopping training until the value of the loss function is minimum, obtaining the optimized parameters of the classifier, and deducing hidden variable labels by using the optimized parameters of the classifier to obtain hidden variables of the sub-regions;

and 4, step 4: and (3) constructing a decision tree by adopting a CART algorithm, and deducing a scene platform label corresponding to the subregion hidden variable label combination.

The further improvement of the invention is that in the step 1, the specific processes of labeling and preprocessing are as follows: labeling each pixel in the traffic scene image data set with a sub-region label, performing data cleaning on the labeled data set, screening out data labeled with missing, and then resetting the picture size to be 256 × 256.

The invention has the further improvement that the specific process of the step 2 is as follows: dividing a traffic scene image data set after marking and preprocessing into a training set and a testing set, performing superpixel segmentation on all images in the training set and extracting the characteristics of each superpixel, then training a lifting decision tree regressor by using the sub-region labels and the extracted characteristics of the superpixels, taking the output of the regressor as an initial segmentation result, and finally constructing a Markov random field for the initial segmentation result to optimize to obtain the segmentation result of the sub-region.

A further refinement of the invention is that the features of each super-pixel include sift features, RGB mean and variance of color, GIST features and location features of appearance;

the specific process of constructing a Markov random field for optimizing the initial segmentation result to obtain the segmentation result of the sub-region is as follows: minimization of the energy function j (c):

where SP is the set of superpixels, s _i Is the ith super pixel, c _i Is the ith super pixel correspondenceA sub-region class label of (a); s _j Is the jth super pixel, c _j Is the sub-region category label corresponding to the jth super pixel; a is the set of adjacent superpixel pairs, s _i And s _j Is one neighboring superpixel pair in set a of neighboring superpixel pairs; e _data And E _smooth Are the data term and the smoothing term, respectively, and λ is the weight of the smoothing term.

A further development of the invention consists in that the data item E _data And a smoothing term E _smooth The concrete form of (A) is as follows:

E _data ＝-w _i σ(L(s _i ,c _i )) (2)

E _smooth (c _i ,c _j )＝-log[(P(c _i |c _j )+P(c _j |c _i ))/2]×δ[c _i ≠c _j ] (3)

l(s) in data item _i ,c _i ) Is that the ith super pixel belongs to sub-region c _i Is a sigmoid function, w _i Is the weight of the ith super pixel; δ is an event function.

A further improvement of the invention is that P (c) in the smoothing term _i |c _j ) Is that a certain super-pixel belongs to sub-region c _i When the neighboring super-pixel belongs to the sub-region c _j The conditional probability of (a); when the condition c is satisfied _i ≠c _j The event function delta is 1 when the condition c is not satisfied _i ≠c _j The time event function δ is 0.

The invention is further improved in that the specific process of step 3 is as follows: firstly, modeling the subtopic of a sub-region in a segmentation result by using a hidden variable, then training a classifier by adopting a tangent plane structured SVM method with N loose variables, continuously iteratively updating the weight, and stopping training until the value of a loss function is minimum to obtain the optimized parameter of the classifier.

The invention is further improved in that the specific process of step 3 is as follows: when training the SVM classifier, inputting the extracted feature vector x of the sub-region _i And hidden variable label z _i Performing supervised training; wherein the extracted sub-regionThe domain features include HOG, Gabor, LBP and RGB, and the loss function is defined as follows:

ξ _i ≥Δ(z _i ,z)+F(z,x _i ；ω)-F(z _i ,x _i ；ω)

where ξ is the relaxation variable, ω is the weight, λ is the penalty parameter, x _i Is an L-dimensional feature vector, z _i Is the hidden variable label of the ith sample, z is all labels contained in the hidden variable label set, Δ (z) _i Z) is the distance value between the hidden variable label of the ith sample and a label in the set of hidden variable labels, and F is the objective function defined as follows:

wherein x is _i Is an L-dimensional feature vector, ω is a weight, φ (x) _i ,z _i ) M is the number of sub-region samples as a feature mapping function; feature mapping function phi (x) _i ,z _i ) The form of (1) is as follows:

wherein the content of the first and second substances,

is phi (x) _i ,z _i ) A segment of a non-zero vector of (a),

value of (a) and x _i Are the same and are in phi (x) _i ,z _i ) Z of (a) _i A location; omega ^* An optimization parameter to minimize the loss function.

The invention is further improved in that when the hidden variable label is deduced, the hidden variable label z is exhausted to obtain the hidden variable label z which maximizes the objective function F (x, z; omega) ^* As a result of the inference:

z ^* ＝argmax _z∈Z F(x,z；ω ^* ) (7)

wherein, ω is ^* Is the optimization parameter that minimizes the loss function, and Z is the set of hidden variable tags.

The invention is further improved in that the specific process of the step 4 is as follows: defining 14 labels of scene platforms according to scene layout and structure, and finding out hidden variable combination z related to each kind of scene platform ^* Is data; a decision tree is constructed by adopting a CART algorithm, a group of hidden variable labels are input into the decision tree, and the scene platform labels finally corresponding to the hidden variable labels can be found through the decision tree.

Compared with the prior art, the invention has the following beneficial effects:

firstly, acquiring and expanding a traffic scene image data set, and labeling and preprocessing the data set according to scene platform classification; secondly, performing superpixel segmentation on the image, training an enhanced decision tree regressor by using the characteristics of the superpixel and labels, performing subregion segmentation on the image, and optimizing an initial segmentation result by using a Markov random field to obtain a final segmentation result; then extracting features from the divided sub-regions, training an SVM classifier by using the sub-region features and artificially defined hidden variable labels, and predicting the combination of hidden variables of the sub-regions of each picture; and finally, a decision tree is constructed by using the combination of the hidden variables of the subareas and the corresponding relation of the labels of the scene platforms, and the labels of the scene platforms corresponding to the labels of the hidden variables of a group of subareas can be simply and conveniently found through the decision tree. The method has high accuracy, and is simple and effective. When the method is used for image subregion segmentation, compared with the existing methods such as an unsupervised clustering method and the like, the method utilizes the bottom-layer characteristics of the images to carry out supervised training and uses the graph model to optimize the result, so that the result is more accurate. When the method is used for scene platform prediction, compared with a method for analyzing picture layout and structure based on a neural network, the method utilizes hidden variables to model the sub-topics of the sub-regions, extracts bottom-layer features, and excavates high-level semantics of each part of the image from bottom to top, so that the defect that the whole representation can not simulate one picture is overcome, a complex network structure is not needed, the consumption required by training is less, and the confidence coefficient is higher; on the other hand, the scene platform labels corresponding to the hidden variable combinations are deduced by using the decision tree, compared with a supervised training method, the calculation amount for constructing the decision tree is small, and the deduction accuracy can reach 100% under the condition that the input hidden variable combinations are predicted correctly.

Drawings

Fig. 1 is a schematic view of a road scene platform.

Fig. 2 is a schematic diagram of road image segmentation.

FIG. 3 is a schematic road scene platform inference diagram.

FIG. 4 is a comparison graph of various SVM model tests.

Fig. 5 is a road scene platform decision tree.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

The specific method of the invention is as follows:

step 1: collecting traffic scene images, forming a traffic scene image data set, and labeling and preprocessing the traffic scene image data set according to scene platform classification; the specific processes of labeling and preprocessing are as follows: labeling each pixel in the traffic scene image data set with a sub-region label, performing data cleaning on the labeled data set, screening out data labeled with a defect, and then resetting the picture size to be 256 × 256.

Step 2: and (3) carrying out subregion segmentation on the image after resetting: the image is subjected to subregion segmentation based on supervised training and graph model optimization to obtain a segmentation result of the subregion; the specific process is as follows:

dividing a traffic scene image data set after image resetting into a training set and a testing set, carrying out superpixel segmentation on all images in the training set and extracting the characteristics of each superpixel, wherein the extracted characteristics comprise sift characteristics, RGB mean value and variance of colors, and GIST characteristics and position characteristics of appearance. And then training a regression device of the boosted decision tree by using the sub-region labels and the extracted characteristics of the super-pixels, wherein the output of the regression device is an initial segmentation result, namely the likelihood ratio of each super-pixel belonging to each sub-region category, and the super-pixel belonging to the sub-region with the maximum likelihood ratio.

And finally, constructing a Markov random field for optimizing the initial segmentation result to obtain a final segmentation result, namely a segmentation result of the sub-region, wherein the specific process comprises the following steps: the following energy function j (c) is minimized:

where SP is the set of superpixels, s _i Is the ith super pixel, c _i Is the sub-region category label corresponding to the ith super pixel; s _j Is the jth super pixel, c _j Is the sub-region category label corresponding to the jth super pixel; a is the set of adjacent superpixel pairs, s _i And s _j Is one neighboring superpixel pair in set a of neighboring superpixel pairs; λ is the weight of the smoothing term, E _data And E _smooth Respectively a data item and a smoothing item. The concrete form is as follows:

E _data ＝-w _i σ(L(s _i ,c _i )) (2)

l(s) in data item _i ,c _i ) Is that the ith super pixel belongs to sub-region c _i Is a sigmoid function, w _i Is the weight of the ith super pixel; p (c) in the smoothing term _i |c _j ) Is that a certain super-pixel belongs to sub-region c _i When the neighboring super-pixel belongs to the sub-region c _j Strip of (2)Piece probability, and vice versa; δ is an event function when condition c is satisfied _i ≠c _j The value is 1 when the value is 1, and is 0 when the value is not 0.

And 3, step 3: prediction of sub-region hidden variables: modeling the sub-topics of the sub-regions in the segmentation result by using hidden variables, training a classifier by adopting a tangent plane structured SVM (support vector machine) method with N loose variables, continuously iteratively updating the weight, stopping training until the value of the loss function is minimum, obtaining the optimized parameters of the classifier, and deducing hidden variable labels by using the optimized parameters of the classifier to obtain hidden variables of the sub-regions. The specific process is as follows:

firstly, modeling the sub-topics of the sub-regions in the segmentation result by using hidden variables, wherein the specific process comprises the following steps: k values are used to represent the hidden variable label z corresponding to the sub-topic of the sub-region, and the z shape is as z ∈ {1, 2. Hidden variables may represent sub-topics such as sky, road, left tree, right tree, etc.

Then, training a classifier by adopting a tangent plane structured SVM method with N relaxation variables, continuously iteratively updating the weight, stopping training until the value of the loss function is minimum, and obtaining the optimized parameters of the classifier, wherein the specific process is as follows:

when training the SVM classifier, inputting the feature vector x of the extracted sub-region _i And hidden variable label z _i Supervised training is performed. The extracted characteristics of the sub-region include HOG, Gabor, LBP, RGB, and the like. Training a classifier by adopting a tangent plane structured SVM method with N relaxation variables, wherein a loss function is defined as follows:

ξ _i ≥Δ(z _i ,z)+F(z,x _i ；ω)-F(z _i ,x _i ；ω)

where ξ is the relaxation variable and ω isWeight, λ is a penalty parameter, x _i Is an L-dimensional feature vector, z _i Is the hidden variable label of the ith sample, z is all labels contained in the hidden variable label set, Δ (z) _i Z) is the distance value between the hidden variable label of the ith sample and a label in the set of hidden variable labels, and F is the objective function defined as follows:

wherein x is _i Is an L-dimensional feature vector, ω is a weight, ω is a C × L-dimensional vector matrix, φ (x) _i ,z _i ) For the feature mapping function, M is the number of sub-region samples. Feature mapping function phi (x) _i ,z _i ) The form of (1) is as follows:

wherein the content of the first and second substances,

is phi (x) _i ,z _i ) A non-zero vector of value x _i Are the same and are in phi (x) _i ,z _i ) The known weight ω and the feature mapping function φ (x) at the z-th position _i ,z _i ) The product of (c) is a constant. Omega ^* An optimization parameter to minimize the loss function. Continuously iterating by gradient descent method to obtain weight omega for minimizing loss function ^* 。

When the hidden variable label is deduced, the hidden variable label z is exhausted, and the hidden variable label z which maximizes the objective function F (x, z; omega) is obtained ^* As a result of the inference:

z ^* ＝argmax _z∈Z F(x,z；ω ^* ) (7)

wherein, ω is ^* Is the optimization parameter that minimizes the loss function, and Z is the set of hidden variable labels.

And 4, step 4: predicting scene platform labels: and (3) constructing a decision tree by adopting a CART algorithm, and deducing a scene platform label y corresponding to the subregion hidden variable label combination. The specific process is as follows:

defining labels of 14 scene platforms according to scene layout and structure, and finding out hidden variable combinations related to each type of scene platform as data; and constructing a decision tree by adopting a CART algorithm, measuring the selection of attributes by using a Gini index, selecting the attribute with the minimum Gini index for splitting, and recursively invoking processes of calculating the Gini index and splitting on two child nodes until the data has no new attribute and can be subdivided and unprocessed data, thereby completing construction. And inputting a group of sub-region hidden variable labels into the decision tree, and finding the scene platform label y finally corresponding to the group of hidden variable labels through the decision tree.

Compared with the existing method based on probability map model inference and the prediction method based on the convolutional neural network, the algorithm has high prediction accuracy, can effectively generate the integral model of the traffic scene, and has small calculated amount and simple and effective method.

The framework of the algorithm is realized based on SSVM and a decision tree. In experimental data, 1000 pictures in different scenes are selected as a data set, and a single road image data set is divided into a training set and a testing set according to the proportion of 7: 3. In fig. 1, 6 scene platforms are shown, including a real scene and a scene wireframe model corresponding to the real scene, a road is divided into regions such as "background", "left wall", "right wall", "ground", and "sky" according to image content, a patch filled with four stars represents the "background", a patch filled with five stars represents the "left wall", a patch filled with diamonds represents the "right wall", a patch filled with circles represents the "ground", and a patch filled with straight lines represents the "sky".

Fig. 2 shows the road image segmentation principle, and the process can be divided into two steps of classification and optimization.

Fig. 3 shows a road scene platform inference process, and the whole process is divided into prediction of a sub-region hidden variable by a sub-region feature and prediction of a scene platform label by a combination of the hidden variables.

FIG. 4 shows experimental comparison of extracting RGB, HOG, Gabor, and LBP features using a tangent plane Structured SVM method with N slack variables to train a classifier and a Structured SVM model, a LibSVM model, a Subgadient Structured SVM model, a Frankwolf Block Structured SVM model, and a Frankwolf Batch Structured SVM model of a single slack variable when predicting hidden variables of a sub-region. It can be seen that the Structured SVM model of N relaxation variables used herein has good accuracy under different characteristics.

Fig. 5 shows a decision tree constructed according to the CART algorithm when predicting scene platform tags.

The algorithm of the invention is compared with a convolutional neural network, as shown in table 1, the compared data are from 1000 pictures in different scenes, and a single road image data set is split into a training set and a test set in a ratio of 7: 3. The algorithm of the invention is compared with three neural network models of AlexNet, VGG16 and ResNet _101, and the quantitative evaluation criteria are accuracy, precision, recall and F1 score. The comparison result shows that the algorithm has higher classification accuracy.

TABLE 1 quantitative evaluation of scene platform classification results

The method is based on the road scene pictures and videos of the simple road traffic scene environment, can effectively realize the prediction of the traffic scene platform, and is accurate in prediction effect and simple and effective.

Claims

1. A road scene layout analysis method based on structured learning is characterized by comprising the following steps:

step 2: performing subregion segmentation on the marked and preprocessed image based on supervised training and graph model optimization to obtain a subregion segmentation result; the specific process is as follows: dividing a traffic scene image data set after marking and preprocessing into a training set and a testing set, performing superpixel segmentation on all images in the training set and extracting the characteristics of each superpixel, then training a lifting decision tree regressor by using the characteristics of a subregion label and the extracted superpixels, taking the output of the regressor as an initial segmentation result, and finally constructing a Markov random field on the initial segmentation result to optimize to obtain the segmentation result of the subregion;

the characteristics of each super pixel comprise sift characteristics, RGB mean and variance of colors, GIST characteristics of appearance and position characteristics;

where SP is the set of superpixels, s _i Is the ith super pixel, c _i Is the sub-region category label corresponding to the ith super pixel; s _j Is the jth super pixel, c _j Is the sub-region category label corresponding to the jth super pixel; a is the set of adjacent superpixel pairs, s _i And s _j Is one neighboring superpixel pair in set a of neighboring superpixel pairs; e _data And E _smooth Are the data term and the smoothing term, respectively, and λ is the weight of the smoothing term;

data item E _data And a smoothing term E _smooth The concrete form of (A) is as follows:

E _data ＝-w _i σ(L(s _i ,c _i )) (2)

l(s) in data item _i ,c _i ) Is that the ith super pixel belongs to sub-region c _i Is a sigmoid function, w _i Is the weight of the ith super pixel; δ is an event function;

and step 3: modeling the sub-topics of the sub-regions in the segmentation result by using hidden variables, training a classifier by adopting a tangent plane structured SVM (support vector machine) method with N loose variables, continuously iteratively updating the weight, stopping training until the value of the loss function is minimum, obtaining the optimized parameters of the classifier, and deducing hidden variable labels by using the optimized parameters of the classifier to obtain hidden variables of the sub-regions;

2. The method for analyzing road scene layout based on structured learning according to claim 1, wherein in step 1, the specific processes of labeling and preprocessing are as follows: labeling each pixel in the traffic scene image data set with a sub-region label, performing data cleaning on the labeled data set, screening out data labeled with missing, and then resetting the picture size to be 256 × 256.

3. The structural learning-based road scene layout analysis method as claimed in claim 1, wherein P (c) in the smoothing term _i |c _j ) Is that a certain super-pixel belongs to sub-region c _i When the neighboring super-pixel belongs to the sub-region c _j The conditional probability of (a); when the condition c is satisfied _i ≠c _j The event function delta is 1 when the condition c is not satisfied _i ≠c _j The time event function δ is 0.

4. The method for analyzing road scene layout based on structured learning of claim 1, wherein the specific process of step 3 is as follows: firstly, modeling the subtopic of a sub-region in a segmentation result by using a hidden variable, then training a classifier by adopting a tangent plane structured SVM method with N loose variables, continuously iteratively updating the weight, and stopping training until the value of a loss function is minimum to obtain the optimized parameter of the classifier.

5. The method for analyzing road scene layout based on structured learning of claim 1, wherein the specific process of step 3 is as follows: when training the SVM classifier, inputting the extracted feature vector x of the sub-region _i And hidden variable label z _i Performing supervised training; wherein, the extracted characteristics of the sub-region include HOG, Gabor, LBP and RGB, and the loss function is defined as follows:

ξ _i ≥Δ(z _i ,z)+F(z,x _i ；ω)-F(z _i ,x _i ；ω)

where ξ is the relaxation variable, ω is the weight, λ is the penalty parameter, x _i Is an L-dimensional feature vector, z _i Is the hidden variable label of the ith sample, z is all labels contained in the hidden variable label set, Δ (z) _i Z) is the distance value between the hidden variable label of the ith sample and a label in the hidden variable label set, and F is the objective function defined as follows:

wherein the content of the first and second substances,

is phi (x) _i ,z _i ) A segment of a non-zero vector of (a),

6. The method as claimed in claim 5, wherein when the hidden variable label is inferred, the hidden variable label z is exhausted, and the hidden variable label z that maximizes the objective function F (x, z; ω) is obtained ^* As a result of the inference:

z ^* ＝argmax _z∈Z F(x,z；ω ^* ) (7)

7. The method for analyzing road scene layout based on structural learning of claim 1, wherein the specific process of step 4 is as follows: defining 14 labels of scene platforms according to scene layout and structure, and finding out hidden variable combination z related to each type of scene platform ^* Is data; a decision tree is constructed by adopting a CART algorithm, a group of hidden variable labels are input into the decision tree, and the scene platform labels finally corresponding to the hidden variable labels can be found through the decision tree.