CN111598095B - Urban road scene semantic segmentation method based on deep learning - Google Patents

Urban road scene semantic segmentation method based on deep learning Download PDF

Info

Publication number
CN111598095B
CN111598095B CN202010156966.XA CN202010156966A CN111598095B CN 111598095 B CN111598095 B CN 111598095B CN 202010156966 A CN202010156966 A CN 202010156966A CN 111598095 B CN111598095 B CN 111598095B
Authority
CN
China
Prior art keywords
image
layer
residual error
network
images
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010156966.XA
Other languages
Chinese (zh)
Other versions
CN111598095A (en
Inventor
宋秀兰
魏定杰
孙云坤
何德峰
余世明
卢为党
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202010156966.XA priority Critical patent/CN111598095B/en
Publication of CN111598095A publication Critical patent/CN111598095A/en
Application granted granted Critical
Publication of CN111598095B publication Critical patent/CN111598095B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/02Affine transformations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4084Scaling of whole images or parts thereof, e.g. expanding or contracting in the transform domain, e.g. fast Fourier transform [FFT] domain scaling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A deep learning-based urban road scene semantic segmentation method comprises the following steps: 1) Acquiring an image of the front end of the vehicle; 2) And expanding the input data of the marked image and the original image: randomly cutting, splicing or adding different types of noise to the image, transforming the image through an image affine matrix, and finally, maintaining the original resolution of the image through transformation such as filling and cutting to obtain a data set; 3) The image after data expansion and the marked image are used for network training, and the residual U-net network comprises a down-sampling part, a bridge part, an up-sampling part and a classification part; 4) And modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor. The invention uses a smaller data set, can prevent the gradient from descending too fast and can ensure that the overfitting problem does not occur during training.

Description

Urban road scene semantic segmentation method based on deep learning
Technical Field
The invention belongs to the field of intelligent vehicles, and discloses an urban road scene semantic segmentation method based on deep learning.
Background
In recent years, with the continuous progress of urbanization, urban road conditions become more and more complex, and pedestrians, traffic lights, zebra crossings and different vehicles all influence the speed of the intelligent vehicle and obstacle avoidance measures. The environment around the vehicle can be well recognized through a deep learning semantic segmentation method, and different feedbacks are made. The semantic segmentation is to assign a preset category to each pixel point of the image, so that the intelligent vehicle can understand the surrounding environment in real time when driving, and the traffic accidents can be reduced. Therefore, research on deep learning of urban road environment has been a research focus in the field of vehicle intelligence. The existing deep learning semantic segmentation method researches neural networks such as Segnet, fcn, resnet and the like. Although these neural networks do not require a conventional object recognition process, can automatically learn features, are not designed manually by engineers, and can obtain a suitable model through a large amount of image training and output a result of semantic segmentation, the following problems can be encountered during the existing network training: 1. overfitting is caused by too many weights; 2. the problem of gradient rapid reduction may occur due to more network layers; 3. the training time is long due to the fact that a data set required by training is large. These problems make it difficult for the deep learning network to output accurate semantic segmentation results, so that it is difficult for the intelligent vehicle to obtain feedback of the surrounding environment in real time under complex road conditions, i.e. there is a potential safety hazard. It would therefore be valuable to design a network that uses a smaller data set, while preventing the gradient from dropping too quickly, and ensuring that over-fitting problems do not occur during training.
Disclosure of Invention
In order to overcome the defects of the prior art and to consider that the intelligent vehicle can better recognize the surrounding environment in complex environments such as urban roads and the like, the invention provides a method for carrying out semantic segmentation on urban road scenes based on deep learning.
The technical scheme adopted by the invention for solving the technical problem is as follows:
a deep learning-based urban road scene semantic segmentation method comprises the following steps:
1) And image acquisition of the front end of the vehicle: regularly collecting urban road images, setting a time interval as T, and inputting the images with the resolution of h multiplied by w into an image detection module to obtain effective images; then the image is input into a labeling module for labeling, the system adopts label software Labelme3.11.2 of a public image interface for labeling, vehicles, pedestrians, bicycles, traffic lights and neon objects on the image are framed and labeled into different categories through the scene segmentation labeling function of the system, the generated labeled image reflects the different categories of objects through different gray levels, and a gray list and an object category K stored in the image are obtained from the different gray levels of the labeled image;
2) And expanding the input data of the marked image and the original image: randomly clipping, splicing or adding different types of noise to the image, and then transforming the image by using an affine matrix of the image, wherein the affine transformation is shown in a formula (1):
Figure BDA0002404413090000021
in affine matrix s x Representing the sum s of lateral translations y Denotes the amount of longitudinal translation, c 1 Representing magnification or reduction of the image abscissa, c 4 Indicating the magnification or reduction of the ordinate, c 2 And c 3 Controlling image cutting transformation, (a, b represents original pixel position, (a ', b') is transformed position, and finally, the image cutting transformation is carried outPerforming conversion such as over filling and cutting, and keeping the original resolution of the image to obtain a data set;
3) The image after data expansion and the marked image are used for network training, and the residual U-net network consists of four parts, namely a down-sampling part, a bridge part, an up-sampling part and a classification part;
image length h, image width w, loss function size L, network iteration number epochs, batch size batch _ size, and validation set proportion rate. Dividing the data set into a training set and a verification set through rate, inputting the data set into a residual U-net network in batches according to batch _ size for training during training, calculating L through a predicted image output by the network and an actual label image, reversely propagating and adjusting parameters in the network to enable L output to tend to be minimized, repeatedly training the network to iteration times, and adjusting network parameters through the verification set in the iteration process. And finally, obtaining an optimal network model.
4) Road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting the predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor, so that the vehicle can well identify which types of objects exist at the front position to make subsequent different reactions.
Further, in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, and the residual error networks are respectively from the first stage to the fourth stage. The connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the steps of combining a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally fusing an input image and a processed characteristic image in the fusion layer in an identity connection mode. The forms of the second-level to the fourth-level residual error network are the same, and the connection sequence is as follows: the method comprises the steps of batch normalizing layer, softmax function layer, convolution layer, batch normalizing layer, softmax function layer, convolution layer and fusion layer, and finally fusing the input characteristic image and the processed characteristic image in the fusion layer in an identity connection mode; the convolution layer is composed of convolution kernels of 3 x 3, the dimensions of two convolution kernels of each stage are respectively 64, 128, 256 and 512, and finally, each stage is connected through a pooling layer with 2 x 2 step length of 2, and the dimension change of the pooling layer is the same as that of the convolution layer of each stage
Further, in the step 3), the bridge part is prepared for splicing network high-bottom dimension information, and the bridge part is composed of two batch normalization layers, two softplus function layers and two convolution layers with dimensions of 1024 of 3 multiplied by 3, and is free of a fusion layer, so that constant connection is not needed, and the connection sequence of each layer is the same as that of the second-level residual error network. And finally, adjusting the characteristic image to the spliced size through an up-sampling layer.
Furthermore, in step 3), the upsampling part is also composed of four stages of residual error networks, which are respectively a fifth stage to an eighth stage of residual error networks, the form of the residual error networks and the connection order of each layer are basically the same as those of each stage of residual error networks of the downsampling part, only the identity connection of the fifth stage to the seventh stage of residual error networks is replaced by a 1 × 1 convolutional layer, the eighth stage of residual error networks is not changed, the dimensions of the convolutional layers in the upsampling residual error networks of each stage are respectively 512, 256, 128 and 64, the stages are connected by the upsampling layer and the splicing layer, and the splicing layer splices the high-low dimension information with corresponding dimensions, and the splicing measure is as follows:
(3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part;
(3.2) splicing the characteristic image output by the third-level residual error network after passing through the pooling layer with the characteristic image output by the fifth-level residual error network after passing through the upper sampling layer;
(3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer;
(3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer;
the dimension of the spliced feature images changes, the dimension of the feature images is adjusted by using the 1 × 1 convolutional layers instead of the identity connection, the dimension of the four 1 × 1 convolutional layers is respectively 512, 256, 128 and 64, and finally the feature images are fused in the fusion layers.
In the step 3), the classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images are not probability values, so that the output is converted into probability distribution through the softmax layer, and the softmax function is shown in formula (2):
Figure BDA0002404413090000041
wherein d is k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g k (x) Indicates the probability that pixel x belongs to class k, g k (x)∈[0,1]The highest probability in each channel is the corresponding class;
the deviation of the prediction from the actual is then evaluated using a cross-entropy loss function, see equation (3):
Figure BDA0002404413090000042
where t (x) represents the class to which pixel x corresponds, so g t(x) (x) The probability of the class is represented by,
Figure BDA0002404413090000043
the probability that the pixel x corresponding to the annotated image belongs to the k classes is represented, so that the smaller the value of the loss function is, the closer the predicted image and the annotated image are. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to an ideal value;
finally, the iteration time epochs, the batch processing size batch _ size and the verification set proportion rate of the network are determined when the model is trained. And dividing the obtained image set into a training set and a verification set according to the proportion of the verification set, inputting the images in the training set into the network in batches according to the batch processing size until the images in the training set are all input, completing one iteration, and finally repeatedly training the model through the determined iteration times to obtain the optimal neural network model.
The main execution parts of the invention are the acquisition and processing of images, the training of neural networks and the recognition of the images by using a recognition model. The implementation process of the method can be divided into the following three stages:
firstly, acquiring image data: setting a time interval T of an acquisition module, selecting different urban environment road sections to acquire images, and inputting the images into a detection module to obtain an effective image set; then, labeling the image by using labeling software labelme3.11.2, framing a target object in the image by an example scene segmentation labeling function, labeling the type of the object, generating a labeled image by software, and labeling different objects in the image by using different gray levels. Marking different gray levels of an image, and obtaining a gray list = [ ] and an article category number K; and finally, expanding the image and the marked image through a data expansion module to obtain a data set.
Secondly, parameters and training of the network: image length h, image width w, loss function size L, network iteration number epochs, batch size batch _ size, and validation set proportion rate. And dividing the data set into a training set and a verification set through rate, inputting the data set into a residual U-net network in batches according to batch _ size for training during training, calculating L through a predicted image output by the network and an actual label image, reversely propagating and adjusting parameters in the network to enable the L output to tend to be minimized, repeatedly training the network to iteration times, and adjusting network parameters through the verification set in the iteration process. And finally, obtaining an optimal network model.
Thirdly, road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into a trained deep learning model, outputting the predicted semantic segmentation images, and transmitting different gray levels in the images back to the processor, so that the vehicle can well identify which types of objects exist at the front position to make subsequent different reactions.
The invention has the following beneficial effects: 1. in the network design, the problems of too fast gradient descent, too large required data set and overfitting possibly occurring in the deep learning network during training are comprehensively considered, so that the method of batch normalization, residual error network and high-bottom information splicing is added in the network, and the problems of gradient descent and image information loss are effectively reduced; the accuracy of semantic segmentation is improved; 2. the road condition detection system for deep learning is simple in design, convenient to understand, few in used data set, high in real-time performance, and high in practicability and adaptability.
Drawings
Fig. 1 is a flow of implementation of an urban road scene semantic segmentation system for deep learning.
FIG. 2 is an overall model design of a residual U-net network used in a deep learning urban road scene semantic segmentation system.
FIG. 3 is a network form of second-level to fifth-level residual error networks in a residual error U-net network used by the deep learning urban road scene semantic segmentation system.
FIG. 4 is a diagram showing the semantic segmentation effect of deep learning urban road scenes.
Detailed Description
The method of the present invention is described in further detail below with reference to the accompanying drawings.
Referring to fig. 1 to 4, a deep learning-based urban road scene semantic segmentation method includes the following steps:
1) And image acquisition of the front end of the vehicle: collecting urban road images at regular time, setting a time interval as T, and inputting the images with the resolution of h multiplied by w into an image detection module to obtain effective images; then the image is input into a labeling module for labeling, the system adopts labeling software Labelme3.11.2 of a public image interface for labeling, vehicles, pedestrians, bicycles, traffic lights, neon objects and the like on the image are framed and labeled into different categories through the scene segmentation labeling function of the system, the generated labeled image reflects the objects of different categories through different gray levels, and a gray list and the object category K stored in the image are obtained from the different gray levels of the labeled image;
2) And expanding the input data of the marked image and the original image: randomly clipping, splicing or adding different types of noise to the image, and then transforming the image by using an affine matrix of the image, wherein the affine transformation is shown in a formula (1):
Figure BDA0002404413090000061
s in affine matrix x Representing the sum of the lateral translations and s y Representing the amount of longitudinal translation, c 1 Representing magnification or reduction of the image abscissa, c 4 Indicating the magnification or reduction of the ordinate, c 2 And c 3 Controlling image cutting transformation, (a, b) representing original pixel position, (a ', b') being transformed position, and finally keeping original resolution of image through transformation such as filling and cutting to obtain data set;
3) The image after data expansion and the marked image are used for network training, and the residual U-net network consists of four parts, namely a down-sampling part, a bridge part, an up-sampling part and a classification part;
image length h, image width w, loss function size L, network iteration number epochs, batch size batch _ size, and validation set proportion rate. Dividing the data set into a training set and a verification set through rate, inputting the data set into a residual U-net network in batches according to batch _ size for training during training, calculating L through a predicted image output by the network and an actual label image, reversely propagating and adjusting parameters in the network to enable L output to tend to be minimized, repeatedly training the network to iteration times, and adjusting network parameters through the verification set in the iteration process. And finally, obtaining an optimal network model.
4) Road condition classification: and modifying the time interval T of the acquisition module, inputting the subsequently obtained images into the trained deep learning model, outputting the predicted semantic segmentation images, and returning different gray levels in the images to the processor, so that the vehicle can well identify which types of objects exist in the front position to make subsequent different responses.
Further, in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, and the residual error networks are respectively from the first stage to the fourth stage. The connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the steps of a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally fusing an input image and a processed characteristic image in the fusion layer in an identical connection mode. The forms of the second-level to the fourth-level residual error network are the same, and the connection sequence is as follows: and finally, fusing the input characteristic image and the processed characteristic image in the fusion layer in an identity connection mode. The convolutional layer is composed of 3 × 3 convolutional kernels, and the two convolutional kernels of each stage have dimensions of 64, 128, 256, and 512. Finally, all the levels are connected through the pooling layer with 2 multiplied by 2 step length being 2, and the dimensional change of the pooling layer is the same as that of the convolution layer of each level.
The bridge part is prepared for splicing network high-bottom dimension information and comprises two batch normalization layers, two softplus function layers and two convolution layers with the dimension of 3 multiplied by 3 being 1024, and a fusion-free layer is not needed, so that constant connection is not needed, and the connection sequence of each layer is the same as that of a second-stage residual error network. And finally, adjusting the feature image to a size suitable for splicing through an upsampling layer.
The up-sampling part is also composed of four stages of residual error networks which are respectively a fifth stage residual error network to an eighth stage residual error network, the form of the residual error network and the connection order of each layer are basically the same as that of each stage residual error network of the down-sampling part, only the identity connection of the fifth stage residual error network to the seventh stage residual error network is replaced by a 1 multiplied by 1 convolution layer, and the eighth stage residual error network is not changed. The dimension of the convolution layer in the upsampled residual network at each stage is 512, 256, 128 and 64 respectively. Connect through upsampling layer and concatenation layer between each level, the concatenation layer splices the height dimension information of corresponding size, the concatenation measure:
and (3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part.
And (3.2) splicing the characteristic image output by the third-level residual error network after passing through the pooling layer with the characteristic image output by the fifth-level residual error network after passing through the up-sampling layer.
And (3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer.
And (3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer.
The dimension of the spliced feature images changes, the dimension of the feature images is adjusted by using the 1 × 1 convolutional layers instead of the identity connection, the dimension of the four 1 × 1 convolutional layers is respectively 512, 256, 128 and 64, and finally the feature images are fused in the fusion layers.
The classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images represent probability values, so the output is converted into a probability distribution through the softmax layer, the softmax function, see formula (2):
Figure BDA0002404413090000071
wherein d is k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g k (x) Representing the probability that pixel x belongs to class k, g k (x)∈[0,1]. The highest probability in each channel is the corresponding class.
The deviation of the predicted result from the actual is then evaluated using a cross-entropy loss function, see equation (3):
Figure BDA0002404413090000081
where t (x) represents the class to which pixel x corresponds, so g t(x) (x) The probability of the class is represented by,
Figure BDA0002404413090000082
indicating that the corresponding pixel x of the annotated image belongs to class kTherefore, a smaller value of the loss function indicates a closer proximity between the predicted image and the annotated image. Through the reverse transfer of the loss function, the internal parameters of the neural network are continuously optimized, so that the loss function is continuously reduced and tends to an ideal value.
Finally, the iteration number epochs, the batch processing size batch _ size and the verification set proportion rate of the network are determined when the model is trained. And dividing the obtained image set into a training set and a verification set according to the proportion of the verification set, inputting images in the training set into the network in batches according to the batch processing size until the images in the training set are all input, finishing one iteration, and finally repeatedly training the model through the determined iteration times to obtain the optimal neural network model.
The main execution parts of the embodiment are image acquisition and processing, neural network training and image recognition by using a recognition model. The implementation process of the method can be divided into the following three stages:
firstly, acquiring image data: setting the time interval T =4s of an acquisition module, selecting different urban environment road sections to acquire images, and inputting the images into a detection module to obtain 1000 effective images; then, labeling software labelme3.11.2 is used for labeling the image, various targets in the image are framed through an example scene segmentation labeling function, the types of the targets are labeled, the software generates a labeled image, and different target types are labeled by using different gray levels in the image. A gray list of list = [0, 20, 80, 140, 180, 230] represents pixel values of different objects, which respectively include background, neon, traffic light, vehicle, pedestrian, and bicycle, and the total number of categories K =6; and finally, expanding the image and the marked image through a data expansion module to obtain a data set.
Secondly, inputting network parameters on a network parameter setting interface, wherein the network parameters comprise the following steps: image length h =224, image width w =224, loss function L; the number of network iterations epochs =30, the batch size batch _ size =4 and the validation set ratio rate =0.1;3000 image sets are divided into 2700 training sets and 300 verification sets, 4 images are input into a residual U-net network for training each time according to batch _ size during training, until the training sets are completely trained, the size of a loss function L is calculated through predicted images output by the network and actual label images, parameters in the network are reversely propagated and adjusted to enable the output of the L to tend to be minimized, one iteration is completed, the network is iteratively trained for 30 times, and network parameters are adjusted through the verification sets in the iteration process; finally, a proper network model is obtained.
And thirdly, modifying the time interval T =0.2s of the acquisition module, inputting the subsequently obtained images into the trained deep learning model, outputting a real-time semantic segmentation result, and returning different gray levels in the images to the processor, so that the vehicle can well identify which types of objects exist in the front position to make subsequent different responses.
The actual system design form, the network establishment process and the results are shown in fig. 1, fig. 2, fig. 3 and fig. 4, and fig. 1 is a flow of implementation of the deep learning urban road scene semantic segmentation system. FIG. 2 is an overall model design of a residual U-net network used in a deep learning urban road scene semantic segmentation system. FIG. 3 is a network form of second-level to fifth-level residual error networks in a residual error U-net network used by the deep learning urban road scene semantic segmentation system. FIG. 4 is a diagram showing the semantic segmentation effect of deep learning urban road scenes.
The above illustrates the excellent deep learning urban road scene semantic segmentation effect exhibited by one embodiment of the present invention. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that any modifications made within the spirit and scope of the appended claims are intended to be within the scope of the invention.

Claims (1)

1. A deep learning-based urban road scene semantic segmentation method is characterized by comprising the following steps:
1) And image acquisition of the front end of the vehicle: collecting urban road images at regular time, setting a time interval as T, and carrying out image detection on the images with the resolution of h multiplied by w to obtain effective images; then marking the obtained effective image, marking by adopting marking software Labelme3.11.2 of a public image interface, framing and marking the objects of vehicles, pedestrians, bicycles, traffic lights and neon lights on the image into different categories by the scene segmentation marking function, reflecting the objects of different categories by the generated marked image through different gray levels, and obtaining a gray list and the object category K stored in the image from the different gray levels of the marked image;
2) And expanding the input data of the marked image and the original image: randomly clipping, splicing or adding different types of noise to the image, and then transforming the image by using an affine matrix of the image, wherein the affine transformation is shown in a formula (1):
Figure FDA0004028244470000011
in affine matrix s x Representing the sum of the lateral translations and s y Representing the amount of longitudinal translation, c 1 Representing magnification or reduction of the image abscissa, c 4 Denotes the magnification or reduction of the ordinate, c 2 And c 3 Controlling image cutting transformation, (a, b) representing original pixel position, (a) ,b ) The original resolution of the image is maintained through the transformation of filling, cutting and the like to obtain a data set;
3) The image after data expansion and the marked image are used for network training, and the residual U-net network consists of four parts, namely a down-sampling part, a bridge part, an up-sampling part and a classification part;
the method comprises the steps of image length h, image width w, loss function size L, network iteration times epochs, batch processing of batch _ size and verification set proportion rate, dividing a data set into a training set and a verification set through the rate, inputting batch _ size into a residual U-net network for training according to the batch _ size during training, calculating L through predicted images output by the network and actual label images, reversely propagating and adjusting parameters in the network to enable the output of the L to tend to be minimized, repeatedly training the network to the iteration times, adjusting network parameters through the verification set in the iteration process, and finally obtaining an optimal network model;
4) Road condition classification: modifying the acquisition time interval T, inputting the subsequently obtained images into a trained deep learning model, outputting predicted semantic segmentation images, transmitting different gray levels in the images back to a processor, and identifying the object types existing in the front position by the vehicle;
in the step 3), the down-sampling part is divided into four stages, each stage is composed of a residual error network, namely a first-stage to fourth-stage residual error network, and the connection sequence of each layer in the first-stage residual error network is as follows: the method comprises the following steps of (1) merging the input image and the processed characteristic image in a merging layer by an identity connection mode, wherein the merging layer comprises a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a merging layer, the input image and the processed characteristic image are merged in the merging layer, the forms of all layers of a second-level residual error network and a fourth-level residual error network are the same, and the connection sequence is as follows: the method comprises the steps of firstly, integrating a plurality of feature images into a fusion layer, wherein the feature images are input into a batch normalization layer, a softmax function layer, a convolution layer, a batch normalization layer, a softmax function layer, a convolution layer and a fusion layer, and finally, the input feature images and the processed feature images are fused in the fusion layer in an identical connection mode; the convolution layer is composed of convolution kernels of 3 x 3, the dimensionalities of two convolution kernels of each level are respectively 64, 128, 256 and 512, and finally, each level is connected through a pooling layer with 2 x 2 step length being 2, and the dimensionality change of the pooling layer is the same as that of each level;
in the step 3), the bridge part is prepared for splicing network high-bottom dimension information and comprises two batch normalization layers, two softplus function layers and two convolution layers with dimensions of 3 multiplied by 3 being 1024, wherein no fusion layer exists, the connection sequence of each layer is the same as that of the second-stage residual error network, and finally the feature image is adjusted to be spliced through an up-sampling layer;
in the step 3), the upsampling part is also composed of four levels of residual error networks, which are respectively residual error networks from the fifth level to the eighth level, the form of the residual error networks and the connection mode of each layer are basically the same as that of the residual error networks from the fifth level to the seventh level, the permanent connection of the residual error networks from the fifth level to the seventh level is replaced by a 1 × 1 convolutional layer, the residual error network from the eighth level is not changed, the dimensions of the convolutional layers in the upsampling residual error networks from the various levels are 512, 256, 128 and 64 respectively, the layers are connected through the upsampling layer and the splicing layer, and the splicing layer splices the high-dimensional information and the low-dimensional information with corresponding dimensions, wherein the splicing measure is as follows:
(3.1) splicing the feature image output by the fourth-level residual error network after passing through the pooling layer with the feature image output by the bridge part;
(3.2) splicing the characteristic image of the output of the third-level residual error network after passing through the pooling layer with the characteristic image of the output of the fifth-level residual error network after passing through the upper sampling layer;
(3.3) splicing the characteristic image of the second-level residual error network after the output of the second-level residual error network passes through the pooling layer with the characteristic image of the sixth-level residual error network after the output of the sixth-level residual error network passes through the up-sampling layer;
(3.4) splicing the characteristic image of the first-level residual error network after the output of the first-level residual error network passes through the pooling layer with the characteristic image of the seventh-level residual error network after the output of the seventh-level residual error network passes through the up-sampling layer;
the dimensionality of the spliced characteristic images changes, the dimensionality of the characteristic images is adjusted by using a 1 × 1 convolutional layer for replacing constant connection, the dimensionalities of the four 1 × 1 convolutional layers are respectively 512, 256, 128 and 64, and finally the characteristic images are fused in a fusion layer;
in the step 3), the classification part is composed of a 1 × 1 convolution layer and a softmax layer, since the urban road image segmentation relates to six classes of vehicles, pedestrians, bicycles, traffic lights, neon lights and backgrounds, the feature images of 6 channels are obtained through the 1 × 1 convolution layer, but the pixel values of the original feature images are not probability values, so that the output is converted into probability distribution through the softmax layer, and the softmax function is shown in formula (2):
Figure FDA0004028244470000021
wherein d is k (x) The representation pixel x is a value on a channel K, K represents the number of item classes, g k (x) Representing the probability that pixel x belongs to class k, g k (x)∈[0,1]The highest probability in each channel is the corresponding class;
the deviation of the prediction from the actual is then evaluated using a cross-entropy loss function, see equation (3):
Figure FDA0004028244470000022
where t (x) represents the class to which pixel x corresponds, so g t(x) (x) The probability of the class is represented by,
Figure FDA0004028244470000023
and the probability that the pixels x corresponding to the labeled image belong to k classes is represented, so that the smaller the value of the loss function is, the closer the predicted image and the labeled image are, and the internal parameters of the neural network are continuously optimized through reverse transfer of the loss function, so that the loss function is continuously reduced and tends to be an ideal value. />
CN202010156966.XA 2020-03-09 2020-03-09 Urban road scene semantic segmentation method based on deep learning Active CN111598095B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010156966.XA CN111598095B (en) 2020-03-09 2020-03-09 Urban road scene semantic segmentation method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010156966.XA CN111598095B (en) 2020-03-09 2020-03-09 Urban road scene semantic segmentation method based on deep learning

Publications (2)

Publication Number Publication Date
CN111598095A CN111598095A (en) 2020-08-28
CN111598095B true CN111598095B (en) 2023-04-07

Family

ID=72181296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010156966.XA Active CN111598095B (en) 2020-03-09 2020-03-09 Urban road scene semantic segmentation method based on deep learning

Country Status (1)

Country Link
CN (1) CN111598095B (en)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018176000A1 (en) 2017-03-23 2018-09-27 DeepScale, Inc. Data synthesis for autonomous control systems
US11157441B2 (en) 2017-07-24 2021-10-26 Tesla, Inc. Computational array microprocessor system using non-consecutive data formatting
US10671349B2 (en) 2017-07-24 2020-06-02 Tesla, Inc. Accelerated mathematical engine
US11893393B2 (en) 2017-07-24 2024-02-06 Tesla, Inc. Computational array microprocessor system with hardware arbiter managing memory requests
US11409692B2 (en) 2017-07-24 2022-08-09 Tesla, Inc. Vector computational unit
US11561791B2 (en) 2018-02-01 2023-01-24 Tesla, Inc. Vector computational unit receiving data elements in parallel from a last row of a computational array
US11215999B2 (en) 2018-06-20 2022-01-04 Tesla, Inc. Data pipeline and deep learning system for autonomous driving
US11361457B2 (en) 2018-07-20 2022-06-14 Tesla, Inc. Annotation cross-labeling for autonomous control systems
US11636333B2 (en) 2018-07-26 2023-04-25 Tesla, Inc. Optimizing neural network structures for embedded systems
US11562231B2 (en) 2018-09-03 2023-01-24 Tesla, Inc. Neural networks for embedded devices
CN113039556B (en) 2018-10-11 2022-10-21 特斯拉公司 System and method for training machine models using augmented data
US11196678B2 (en) 2018-10-25 2021-12-07 Tesla, Inc. QOS manager for system on a chip communications
US11816585B2 (en) 2018-12-03 2023-11-14 Tesla, Inc. Machine learning models operating at different frequencies for autonomous vehicles
US11537811B2 (en) 2018-12-04 2022-12-27 Tesla, Inc. Enhanced object detection for autonomous vehicles based on field view
US11610117B2 (en) 2018-12-27 2023-03-21 Tesla, Inc. System and method for adapting a neural network model on a hardware platform
US11150664B2 (en) 2019-02-01 2021-10-19 Tesla, Inc. Predicting three-dimensional features for autonomous driving
US10997461B2 (en) 2019-02-01 2021-05-04 Tesla, Inc. Generating ground truth for machine learning from time series elements
US11567514B2 (en) 2019-02-11 2023-01-31 Tesla, Inc. Autonomous and user controlled vehicle summon to a target
US10956755B2 (en) 2019-02-19 2021-03-23 Tesla, Inc. Estimating object properties using visual image data
CN112070049B (en) * 2020-09-16 2022-08-09 福州大学 Semantic segmentation method under automatic driving scene based on BiSeNet
CN112348839B (en) * 2020-10-27 2024-03-15 重庆大学 Image segmentation method and system based on deep learning
CN112329780B (en) * 2020-11-04 2023-10-27 杭州师范大学 Depth image semantic segmentation method based on deep learning
CN112767361B (en) * 2021-01-22 2024-04-09 重庆邮电大学 Reflected light ferrograph image segmentation method based on lightweight residual U-net
CN112819688A (en) * 2021-02-01 2021-05-18 西安研硕信息技术有限公司 Conversion method and system for converting SAR (synthetic aperture radar) image into optical image
CN113076837A (en) * 2021-03-25 2021-07-06 高新兴科技集团股份有限公司 Convolutional neural network training method based on network image
CN113034598B (en) * 2021-04-13 2023-08-22 中国计量大学 Unmanned aerial vehicle power line inspection method based on deep learning
CN112949617B (en) * 2021-05-14 2021-08-06 江西农业大学 Rural road type identification method, system, terminal equipment and readable storage medium
CN113468963A (en) * 2021-05-31 2021-10-01 山东信通电子股份有限公司 Road raise dust identification method and equipment
CN113269276A (en) * 2021-06-28 2021-08-17 深圳市英威诺科技有限公司 Image recognition method, device, equipment and storage medium
CN113657174A (en) * 2021-07-21 2021-11-16 北京中科慧眼科技有限公司 Vehicle pseudo-3D information detection method and device and automatic driving system
CN113569774B (en) * 2021-08-02 2022-04-08 清华大学 Semantic segmentation method and system based on continuous learning
CN113705498B (en) * 2021-09-02 2022-05-27 山东省人工智能研究院 Wheel slip state prediction method based on distribution propagation diagram network
CN113689436B (en) * 2021-09-29 2024-02-02 平安科技(深圳)有限公司 Image semantic segmentation method, device, equipment and storage medium
CN113808128B (en) * 2021-10-14 2023-07-28 河北工业大学 Intelligent compaction whole process visualization control method based on relative coordinate positioning algorithm
CN114495236B (en) * 2022-02-11 2023-02-28 北京百度网讯科技有限公司 Image segmentation method, apparatus, device, medium, and program product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062756A (en) * 2018-01-29 2018-05-22 重庆理工大学 Image, semantic dividing method based on the full convolutional network of depth and condition random field
CN109145983A (en) * 2018-08-21 2019-01-04 电子科技大学 A kind of real-time scene image, semantic dividing method based on lightweight network
CN110111335A (en) * 2019-05-08 2019-08-09 南昌航空大学 A kind of the urban transportation Scene Semantics dividing method and system of adaptive confrontation study
CN110147794A (en) * 2019-05-21 2019-08-20 东北大学 A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062756A (en) * 2018-01-29 2018-05-22 重庆理工大学 Image, semantic dividing method based on the full convolutional network of depth and condition random field
CN109145983A (en) * 2018-08-21 2019-01-04 电子科技大学 A kind of real-time scene image, semantic dividing method based on lightweight network
CN110111335A (en) * 2019-05-08 2019-08-09 南昌航空大学 A kind of the urban transportation Scene Semantics dividing method and system of adaptive confrontation study
CN110147794A (en) * 2019-05-21 2019-08-20 东北大学 A kind of unmanned vehicle outdoor scene real time method for segmenting based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
基于卷积神经网络的交通场景语义分割方法研究;李琳辉等;《通信学报》;20180425(第04期);全文 *
基于多尺度特征提取的图像语义分割;熊志勇等;《中南民族大学学报(自然科学版)》;20170915(第03期);全文 *
基于彩色-深度图像和深度学习的场景语义分割网络;代具亭等;《科学技术与工程》;20180718(第20期);全文 *
基于深度学习的遥感图像新增建筑物语义分割;陈一鸣等;《计算机与数字工程》;20191220(第12期);全文 *

Also Published As

Publication number Publication date
CN111598095A (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN111598095B (en) Urban road scene semantic segmentation method based on deep learning
Serna et al. Classification of traffic signs: The european dataset
Alghmgham et al. Autonomous traffic sign (ATSR) detection and recognition using deep CNN
CN113506300B (en) Picture semantic segmentation method and system based on rainy day complex road scene
CN110263786B (en) Road multi-target identification system and method based on feature dimension fusion
CN112183203B (en) Real-time traffic sign detection method based on multi-scale pixel feature fusion
CN109886066A (en) Fast target detection method based on the fusion of multiple dimensioned and multilayer feature
CN112396607A (en) Streetscape image semantic segmentation method for deformable convolution fusion enhancement
CN113902915A (en) Semantic segmentation method and system based on low-illumination complex road scene
CN112990065B (en) Vehicle classification detection method based on optimized YOLOv5 model
CN111860269A (en) Multi-feature fusion tandem RNN structure and pedestrian prediction method
CN114495029A (en) Traffic target detection method and system based on improved YOLOv4
Al Mamun et al. Lane marking detection using simple encode decode deep learning technique: SegNet
Gupta et al. Image-based road pothole detection using deep learning model
CN114913498A (en) Parallel multi-scale feature aggregation lane line detection method based on key point estimation
CN114612883A (en) Forward vehicle distance detection method based on cascade SSD and monocular depth estimation
Naik et al. Implementation of YOLOv4 algorithm for multiple object detection in image and video dataset using deep learning and artificial intelligence for urban traffic video surveillance application
BARODI et al. Improved deep learning performance for real-time traffic sign detection and recognition applicable to intelligent transportation systems
CN114973199A (en) Rail transit train obstacle detection method based on convolutional neural network
Kadav et al. Development of Computer Vision Models for Drivable Region Detection in Snow Occluded Lane Lines
Yildiz et al. Hybrid image improving and CNN (HIICNN) stacking ensemble method for traffic sign recognition
Dong et al. Intelligent pixel-level pavement marking detection using 2D laser pavement images
CN114495050A (en) Multitask integrated detection method for automatic driving forward vision detection
CN112085001B (en) Tunnel identification model and method based on multi-scale edge feature detection
CN117058641A (en) Panoramic driving perception method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant