CN111488770A

CN111488770A - Traffic sign recognition method, and training method and device of neural network model

Info

Publication number: CN111488770A
Application number: CN201910081841.2A
Authority: CN
Inventors: 李亚
Original assignee: Momenta Suzhou Technology Co Ltd
Current assignee: Momenta Suzhou Technology Co Ltd
Priority date: 2019-01-28
Filing date: 2019-01-28
Publication date: 2020-08-04

Abstract

The embodiment of the invention discloses a traffic sign identification method, a neural network model training method and a device. The identification method of the traffic sign comprises the following steps: acquiring position information and category information of a current sub-image of a traffic sign area in a current road image, wherein the position information and the category information are obtained by performing feature extraction on a traffic sign image to be identified in the current road image by using a preset target detection model; according to the position information and the category information, performing feature extraction on the current sub-image by using a Convolutional Neural Network (CNN) to obtain a feature sequence of the current sub-image; and obtaining target semantic information corresponding to the current sub-image according to the characteristic sequence and a preset convolution cyclic neural network (CRNN) model, wherein the CRNN model enables the characteristic sequence of the image to be associated with the corresponding semantic information. By adopting the technical scheme, the identification precision of the traffic sign is improved.

Description

Traffic sign recognition method, and training method and device of neural network model

Technical Field

The invention relates to the technical field of image processing, in particular to a traffic sign recognition method, a neural network model training method and a device.

Background

With the development of current intelligent transportation technology, automatic driving and assistant driving technologies are receiving more attention. In the fields of automatic driving and auxiliary driving, an accurate sensing system is a precondition that all other systems can work correctly, and traffic sign recognition is an important function in a vehicle sensing system. Accurate traffic sign recognition is the basis for vehicles to react correctly to the environment, and plays a vital role in normal driving and safe driving of autonomous vehicles.

At present, traffic signboards are of various types, the appearance characteristics of the traffic signboards are close to each other, accurate distinguishing is difficult to achieve by using the existing image classification method, semantic information in the traffic signboards cannot be effectively extracted, and the traffic signboards are not beneficial to vehicles to utilize the traffic signboards.

Disclosure of Invention

The embodiment of the invention discloses a traffic sign identification method, a neural network model training method and a device, which improve the identification precision of a traffic sign board.

In a first aspect, an embodiment of the present invention discloses a method for identifying a traffic sign, including:

acquiring position information and category information of a current sub-image of a traffic sign area in a current road image, wherein the position information and the category information are obtained by performing feature extraction on a traffic sign image to be identified in the current road image;

according to the position information and the category information, performing feature extraction on the current sub-image by using a Convolutional Neural Network (CNN) to obtain a feature sequence of the current sub-image;

and obtaining target semantic information corresponding to the current sub-image according to the characteristic sequence and a preset convolution cyclic neural network (CRNN) model, wherein the CRNN model enables the characteristic sequence of the sub-image in the traffic sign area in the road image to be associated with the corresponding semantic information.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, the CRNN model is constructed in the following manner:

acquiring historical sub-images and semantic information thereof, wherein the historical sub-images only comprise traffic sign position areas in historical road images, and the historical sub-images are obtained by utilizing a preset target detection model to perform feature extraction on the traffic sign images to be identified in the historical road images;

extracting the characteristics of the historical subimages to obtain a characteristic sequence of the historical subimages;

generating a training sample set based on the characteristic sequences of the plurality of historical sub-images and the semantic information corresponding to the characteristic sequences;

training an initial neural network model based on a machine learning algorithm to obtain a CRNN model, wherein the CRNN model enables the characteristic sequence of each historical sub-image in the training sample set to be associated with corresponding semantic information.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, performing feature extraction on the to-be-identified traffic sign image in the current road image to obtain the position information of the current sub-image includes:

performing feature extraction on the traffic sign image to be identified in the current road image by using a full convolution network layer in a preset target detection model to obtain initial features of the traffic sign image, wherein the initial features at least comprise edge features of the traffic sign;

and extracting a network layer by utilizing the characteristics in a preset target detection model, determining an optimal candidate extracting frame containing the traffic sign board based on the initial characteristics, taking the position of the optimal candidate extracting frame as the position of a positioning boundary frame of the traffic sign board, and outputting a current sub-image containing the positioning boundary frame and the internal traffic sign board by utilizing a convolutional neural network R-CNN based on a region.

As an alternative implementation manner, in the first aspect of the embodiment of the present invention, the determining, based on the initial feature, a best candidate extracting frame including a traffic sign further includes:

performing separation convolution on the initial features to obtain a separation matrix;

and determining the best candidate extracting frame containing the traffic sign board based on the separation matrix.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, performing feature extraction on the current sub-image by using a convolutional neural network CNN to obtain a feature sequence of the current sub-image includes:

and performing feature extraction on the current sub-image according to the direction from left to right by using a Convolutional Neural Network (CNN), so that each column of feature maps corresponds to one feature vector, and the feature vectors corresponding to the columns of feature maps form a feature sequence of the current sub-image.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, obtaining the target semantic information corresponding to the current sub-image according to the feature sequence and a preset convolutional neural network CRNN model includes:

converting the characteristic sequence into a character sequence based on a Recurrent Neural Network (RNN) in the convolutional recurrent neural network model;

based on a transcription layer in the convolutional recurrent neural network model, performing redundancy removal processing on the character sequence according to a preset conditional probability to obtain target semantic information corresponding to a target character sequence; and the log-likelihood function of the preset conditional probability is a loss function of the convolution cyclic neural network model.

As an optional implementation manner, in the first aspect of the embodiment of the present invention, based on a transcription layer in the convolutional recurrent neural network model, and according to a preset conditional probability, performing redundancy removal processing on the character sequence to obtain target semantic information corresponding to a target character sequence, includes:

selecting a character sequence with the highest probability from a preset dictionary by using a connection time sequence classifier (CTC) as a target character sequence; the preset dictionary is a set of all legal traffic sign semantics.

In a second aspect, an embodiment of the present invention further provides an apparatus for identifying a traffic sign, where the apparatus includes:

the system comprises a current subimage information acquisition module, a current subimage information acquisition module and a traffic sign board recognition module, wherein the current subimage information acquisition module is configured to acquire position information and category information of a current subimage of a traffic sign board area in a current road image, and the position information and the category information are obtained by performing feature extraction on a to-be-recognized traffic sign board image in the current road image;

a current feature sequence determination module configured to perform feature extraction on the current sub-image by using a convolutional recurrent neural network (CNN) according to the position information and the category information to obtain a feature sequence of the current sub-image;

the first semantic information identification module is configured to obtain target semantic information corresponding to the current sub-image according to the feature sequence and a preset convolutional neural network (CRNN) model, and the CRNN model enables the feature sequence of the sub-image in the traffic sign area in the road image to be associated with the corresponding semantic information.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the CRNN model is constructed by the following method:

the historical subimage information acquisition module is configured to acquire historical subimages only containing traffic sign position areas in historical road images, and category information and semantic information of the historical subimages, wherein the historical subimages and the category information of the historical subimages are obtained by performing feature extraction on to-be-identified traffic sign images in the historical road images by using a preset target detection model;

the historical characteristic sequence determining module is configured to extract the characteristics of the historical sub-images to obtain a characteristic sequence of the historical sub-images;

the first training sample set generation module is configured to generate a first training sample set based on the feature sequences, the category information and the corresponding semantic information of the plurality of historical sub-images;

and the neural network model training module is used for training an initial neural network model based on a machine learning algorithm to obtain a CRNN model, and the CRNN model enables the feature sequence and the category information of each historical sub-image in the first training sample set to be associated with the corresponding semantic information.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the feature extraction of the to-be-identified traffic sign image in the current road image to obtain the position information of the current sub-image is implemented as follows:

the initial feature determination module is configured to utilize a full convolution network layer in a preset target detection model to perform feature extraction on a to-be-identified traffic sign image in the current road image to obtain initial features of the traffic sign image, wherein the initial features at least comprise edge features of the traffic sign;

and the optimal candidate extracting frame determining module is configured to extract a network layer by using the characteristics in the preset target detection model, determine an optimal candidate extracting frame comprising the traffic sign board based on the initial characteristics, use the position of the optimal candidate extracting frame as the position of a positioning boundary frame of the traffic sign board, and output a current sub-image comprising the positioning boundary frame and the internal traffic sign board by using a convolutional neural network R-CNN based on the region.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the best candidate extraction block determination module is specifically configured to:

and determining an optimal candidate extracting frame containing the traffic sign board based on the separation matrix, taking the position of the optimal candidate extracting frame as the position of a positioning boundary frame of the traffic sign board, and outputting a current sub-image containing the positioning boundary frame and the internal traffic sign board by utilizing a convolutional neural network R-CNN based on the region.

As an optional implementation manner, in the second aspect of the embodiment of the present invention, the current feature sequence determining module is specifically configured to:

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the first semantic information identifying module includes:

a character sequence conversion unit configured to convert the feature sequence into a character sequence based on a Recurrent Neural Network (RNN) in the convolutional recurrent neural network model;

the semantic information determining unit is configured to perform redundancy removal processing on the character sequence based on a transcription layer in the convolutional recurrent neural network model according to a preset conditional probability to obtain target semantic information corresponding to a target character sequence; and the log-likelihood function of the preset conditional probability is a loss function of the convolution cyclic neural network model.

As an optional implementation manner, in a second aspect of the embodiment of the present invention, the semantic information determining unit is specifically configured to:

In a third aspect, an embodiment of the present invention further discloses a training method for a neural network model, where the method includes:

acquiring historical sub-images and category information and semantic information thereof, wherein the historical sub-images and the category information thereof only comprise a traffic sign position area in a historical road image, and the historical sub-images and the category information thereof are obtained by utilizing a preset target detection model to perform feature extraction on a traffic sign image to be identified in the historical road image, and the preset target detection model enables the category information of the traffic sign to be associated with the semantic information;

generating a first training sample set based on the feature sequences, the category information and the corresponding semantic information of the plurality of historical sub-images;

training an initial neural network model based on a machine learning algorithm to obtain a CRNN model, wherein the CRNN model enables the characteristic sequence and the category information of each historical sub-image in the first training sample set to be associated with the corresponding semantic information.

As an optional implementation manner, in the third aspect of the embodiment of the present invention, the preset target detection model is constructed by:

acquiring a historical road sample image marked with position information, category information and semantic information of a traffic sign;

extracting position information of a traffic sign in the historical road sample image, and generating a second training sample set based on a plurality of position information and corresponding category information and semantic information thereof;

and establishing a preset target detection model based on a machine learning algorithm, wherein the preset target detection model enables each category information in the second training sample set to be associated with the corresponding semantic information.

As an optional implementation manner, in the third aspect of the embodiment of the present invention, after obtaining the CRNN model, the method further includes:

and performing end-to-end training on the cascaded preset target detection model and the CRNN model by using all historical road sample images, corresponding position information and category information thereof, and semantic information corresponding to historical sub-images in the historical road sample images.

In a fourth aspect, an embodiment of the present invention further discloses a training apparatus for a neural network model, where the apparatus includes:

the historical subimage information acquisition module is configured to acquire historical subimages only containing traffic sign position areas in the historical road images, and category information and semantic information of the historical subimages, wherein the historical subimages and the category information of the historical subimages are obtained by performing feature extraction on the traffic sign images to be identified in the historical road images by using a preset target detection model, and the preset target detection model enables the category information of the traffic signs to be associated with the semantic information;

As an optional implementation manner, in the fourth aspect of the embodiment of the present invention, the preset target detection model is constructed by:

the historical road sample image acquisition module is configured to acquire a historical road sample image marked with the position information, the category information and the semantic information of the traffic sign;

the second training sample set generating module is positioned for extracting the position information of the traffic signboards in the historical road sample images and generating a second training sample set based on the position information and the corresponding category information and semantic information thereof;

a preset target detection model training module configured to establish a preset target detection model based on a machine learning algorithm, the preset target detection model associating each category information in the second training sample set with its corresponding semantic information.

As an optional implementation manner, in a fourth aspect of the embodiment of the present invention, the apparatus further includes:

and the end-to-end training module is configured to perform end-to-end training on the cascaded preset target detection model and the CRNN model by utilizing all historical road sample images, the corresponding position information and the corresponding category information of the historical road sample images and the corresponding semantic information of the historical sub-images in the historical road sample images.

In a fifth aspect, an embodiment of the present invention further provides a method for identifying a traffic sign, where the method includes:

acquiring a current road image;

extracting current position information of a traffic sign in the current road image;

and determining the category information of the traffic sign board based on the current position information and a preset target detection model, and determining the semantic information of the traffic sign board according to the category information, wherein the preset target detection model enables the category information of the traffic sign board to be associated with the semantic information.

In a sixth aspect, an embodiment of the present invention further provides a device for identifying a traffic sign, including:

a current road image acquisition module configured to acquire a current road image;

a current position feature extraction module configured to extract current position information of a traffic sign in the current road image;

a second semantic information recognition module configured to determine category information of the traffic signboard based on the position information and a preset target detection model, and determine semantic information of the traffic signboard according to the category information, wherein the preset target detection model associates the category information of the traffic signboard with the semantic information.

In a seventh aspect, an embodiment of the present invention further provides a training method for a neural network model, including:

In a seventh aspect, an embodiment of the present invention further provides a training apparatus for a neural network model, including:

In an eighth aspect, an embodiment of the present invention further provides a vehicle-mounted terminal, including:

a memory storing executable program code;

a processor coupled with the memory;

the processor calls the executable program code stored in the memory to execute part or all of the steps of the method for identifying the traffic sign or part or all of the steps of the method for training the neural network model provided by any embodiment of the invention.

In a ninth aspect, the present invention further provides a computer-readable storage medium storing a computer program, where the computer program includes instructions for executing part or all of the steps of the method for recognizing a traffic sign or the method for training a neural network model provided in any of the embodiments of the present invention.

In a tenth aspect, the embodiment of the present invention further provides a computer program product, which when run on a computer, causes the computer to execute part or all of the steps of the method for identifying a traffic sign or the method for training a neural network model provided in any embodiment of the present invention.

According to the technical scheme of the embodiment of the invention, the position information and the category information of the current sub-image of the traffic sign area in the current road image can be obtained by identifying the current road image by using the preset target detection model. By inputting the position information and the category information of the sub-images into the preset convolution cyclic neural network model, the target semantic information corresponding to the sub-images can be obtained, so that the recognition result of the traffic sign board is more accurate.

The invention of the embodiment of the invention comprises the following steps:

1. the identification of the traffic sign board is realized by a cascade connection preset target detection model and a preset convolution cyclic neural network model. The preset target detection model is used for positioning and classifying the traffic signboards in the road image, and the preset convolution cyclic neural network model is used for further extracting semantic features of the traffic signboards according to the output of the preset target detection model so as to improve the identification precision of the traffic signboards.

2. In the preset target detection model, before the optimal candidate extraction frame of the traffic sign is determined, the separation matrix can be obtained by performing separation convolution on the output of the backbone network, so that the amount of calculation and the number of network parameters can be reduced under the condition of ensuring that the representation capability of the model is approximately unchanged, the calculation speed is increased, and the use of a memory is reduced, which is one of the invention points of the embodiment of the invention.

3. The method introduces a bidirectional Short-Term Memory network (L ong Short-Term Memory, L STM) unit to convert the characteristic sequence into a character sequence, can extract higher-level semantic characteristics, and is one of the invention points of the embodiment of the invention.

4. After the training of the preset target detection model and the preset convolution cyclic neural network model is completed, the two models are cascaded and further end-to-end training and optimization are carried out, so that the recognition effect of the models can be further improved, and the method is one of the invention points of the embodiment of the invention.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a training method of a neural network model according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a method for identifying a traffic sign according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a preset target detection model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a separation convolution according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of a training method of a neural network model according to another embodiment of the present invention;

FIG. 6 is a flow chart illustrating a method for identifying a traffic sign according to another embodiment of the present invention;

fig. 7 is a block diagram of a training apparatus for a neural network model according to an embodiment of the present invention;

fig. 8 is a block diagram of a traffic sign recognition apparatus according to an embodiment of the present invention;

FIG. 9 is a block diagram of a training apparatus for a neural network model according to an embodiment of the present invention;

fig. 10 is a block diagram illustrating a structure of an apparatus for recognizing a traffic sign according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an in-vehicle terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is to be noted that the terms "comprises" and "comprising" and any variations thereof in the embodiments and drawings of the present invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a method for training a neural network model according to an embodiment of the present invention. The method can be applied to automatic driving and executed by a training device of a neural network model, the device can be realized in a software and/or hardware mode, and can be generally integrated in vehicle-mounted terminals such as a vehicle-mounted Computer, a vehicle-mounted industrial control Computer (IPC) and the like, and the embodiment of the invention is not limited. The neural network model provided in this embodiment is mainly directed to a preset target detection model for identifying a simple traffic sign, where the simple traffic sign generally refers to a sign composed of only graphics, symbols, or a small number of characters, such as a warning sign, a prohibition sign, an indication sign, a road sign, a tourist area sign, a road construction safety sign, a speed limit sign, and the like. As shown in fig. 1, the training method of the neural network model provided in this embodiment specifically includes:

110. and acquiring a historical road sample image marked with the position information, the category information and the semantic information of the traffic sign.

In this embodiment, the training model adopts a supervised training mode, so that all the used sample images need to have corresponding labels, that is, each sign in the historical road sample image needs to have corresponding position information, specific category information and semantic information labels.

In order to ensure the accuracy of the detection model, training sample images, i.e., historical road sample images, need to be present as many as possible. In embodiments of the present application, the sample images may be derived from a video stream of a camera and a network data set. For video stream data, the camera needs to perform monocular calibration and record the internal parameters and distortion parameters of the camera. The single frame image is corrected according to camera parameters, and the picture is transformed into a distortion-free or near distortion-free state by using a relevant image post-processing technology. And constructing a sample library for model training after the road sample image after correction is labeled.

In the process of training the neural network model, a sample image needs to be input into the neural network, and before the sample is input into the neural network, some transformations can be performed on the sample by using an image enhancement technology, such as randomly cropping the image, and randomly adjusting the contrast, brightness, color saturation and gamma value (the value is within a certain range) of the image, so that the randomness of a sample space can be increased. The arrangement is equivalent to the expansion of the number of samples, and the generalization capability of the model is indirectly increased, so that the trained model can adapt to images with different illumination environments and different qualities.

120. And extracting the position information of the traffic sign in the historical road sample image, and generating a training sample set based on the plurality of position information and the corresponding category information and semantic information thereof.

130. And establishing a preset target detection model based on a machine learning algorithm, wherein the preset target detection model enables each category information in the second training sample set to be associated with the corresponding semantic information.

Since the preset target detection model is a target detection and classification model, all training image samples can be used for training in the training process. In the training process, one or more GPUs (graphics processing units) may be used for training, the picture batch-size (batch size) on each GPU may be set to 2, and the number of ROI (region of interest) of each picture may be set to 2000 and 1000 in training and testing, respectively. In the training process, as for the optimization algorithm for optimizing the training model, it is preferable to use a momentum-based random gradient descent algorithm, in which the parameter attenuation can be set to 0.0001, the momentum to 0.9, and the learning rate is adjusted step by step according to the training progress. Or Adam (a first-order optimization algorithm replacing the conventional stochastic gradient descent process), RMSProp (a deep neural network optimization algorithm), and the like can be used.

In the embodiment, the preset target detection model is established, so that the one-to-one mapping relation between the simple traffic signboard category information and the semantic information can be established, the aim of directly and quickly obtaining the semantic information is fulfilled, and the further analysis process of the signboard semantic information is omitted.

After obtaining the preset target detection model, the simple traffic sign board in the current road image acquired by the camera in real time can be located and identified by using the model, and referring to fig. 2, fig. 2 is a flow diagram of the traffic sign identification method provided by the embodiment of the present invention, the method can be executed by a traffic sign identification device, the device can be implemented by software and/or hardware, and can be generally integrated in a vehicle-mounted terminal such as a vehicle-mounted Computer, a vehicle-mounted Industrial control Computer (IPC), and the like, and the embodiment of the present invention is not limited. The method for identifying a traffic sign provided by the embodiment mainly aims at identifying a simple traffic sign board, and as shown in fig. 2, the method comprises the following steps:

and S210, acquiring a current road image.

The current road image can be acquired in real time through a camera in the vehicle.

And S220, extracting the current position information of the traffic sign in the current road image.

For example, the current location information of the traffic sign may be determined by location bounding boxes (BoundingBox) containing the traffic sign, and each location bounding box may be composed of two coordinates of an upper left corner and a lower right corner, or may also be composed of one coordinate and the width and height of the bounding box.

And S230, determining the category information of the traffic sign board based on the current position information and a preset target detection model, and determining the semantic information of the traffic sign board according to the category information, wherein the preset target detection model enables the category information to be associated with the semantic information.

By using a preset target detection model, category information of the traffic sign at the current location can be determined. Because the category information and the semantic information have one-to-one mapping relationship, the semantic information of the traffic sign can be directly determined through the category information.

In this embodiment, the predetermined object detection model may be implemented by a L light-Head-RCNN (L light Head Region based connected convolutional Network) in which an Xception Network is a backbone Network, and the existing object detection model based on the convolutional neural Network method may be classified into two types, i.e., a single-step detector such as YO L O (implemented object detection system), an SSD (single shot multi-box detector), etc. object detection models having higher accuracy but lower accuracy, and a two-step detector such as a Faster Region with a Faster conditional neural Network, etc. object detection models having higher accuracy but lower accuracy, e.g., a Faster speed R-CNN (Faster Regions with a Faster conditional neural Network), etc. object detection models having higher accuracy but lower speed, which are required in an automatic object detection system in this embodiment, and which are different from the conventional NN model, may be used as a conventional road Network classification algorithm for detecting objects in a conventional Xception-Head-RCNN (L) in which the accuracy of the predetermined object detection model is improved.

As shown in FIG. 3, L light-Head-RCNN is mainly divided into three parts, the first part is a main Network composed of only convolutional layer and pooling layer and mainly used for extracting image features to obtain a feature map, the second part is a feature map obtained by performing separation convolution on the last layer of the main Network, and the third part is a lightweight R-CNN (Regions with convolutional Neural Network) sub-Network.

First, the backbone network of the model, i.e. the first part of the model, is described as a feature extractor, preferably a full convolution network, suitable for the output of images of various sizes. A fully convolutional layer of a pre-trained ResNet (Residual Neural Network) or xcep Network may be used.

For the second part of the model, the characteristic diagram, is obtained by performing single separation convolution on the output of the backbone network, and the process of separation convolution can decompose the convolution of k × k into the convolution of k × 1 and 1 × k, so that the amount of calculation and the number of network parameters can be reduced under the condition of ensuring that the characterization capability of the model is approximately unchanged, the calculation speed is improved, and the use of a memory is reduced.

In the second part, a feature map is obtained by using an RPN (Region pro-social Network). The network is a sliding window based full convolutional network connected to the penultimate convolutional layer of the backbone network. The network contains a series of pre-defined anchor blocks (anchors) with three different aspect ratios { 1: 2,1: 1,2: 1 and five different sizes 32²，64²，128²，256²，512²Fifteen different shapes are used to cover different object shapes. Since a large number of overlapping extraction frames are contained in a large number of extractions (propofol), a Non-Maximum Suppression (Non Maximum Suppression) method is used to reduce the number of overlapping extraction proposal frames. Is fed into ROI (predictor) at the extraction boxBefore the network, a threshold value of a non-maximum suppression overlap ratio IOU (overlap degree) is set to be 0.7, and then a training label is given according to the overlap ratio of an anchor point and a real boundary box. If the overlapping ratio of the anchor point to any real bounding box is greater than 0.7, the anchor point is given a positive label, and if the anchor point has the maximum overlapping ratio to a certain real bounding box, the anchor point is also given a positive label. If the overlap ratio of the anchor point to all real bounding boxes is less than 0.3 by the margin, a negative label is assigned. From this, the best candidate extraction box may be determined.

The third part of the R-CNN sub-network mainly comprises a full connection layer with 2048 channels and two small full connection layers, wherein the two small full connection layers are respectively responsible for the classification of targets and the regression of target positions. The number of channels in the classification layer is determined by the total number of all classes (500 channels in the case of 500 classes), and the number of channels in the regression layer is 4, because no matter how the bounding box is labeled, only four numbers are needed to uniquely determine the position and size of the bounding box, for example: the coordinates of the top left corner vertex and the width and the height of the detection frame, or the coordinates of the top left corner vertex and the bottom right corner vertex.

In this embodiment, the L light-Head-RCNN model is used to identify the images acquired in real time, so as to determine the position information and classification information of the simple traffic sign in the road image, wherein the image corresponding to the position information is the image in the above best candidate extraction frame, i.e. the image only containing the sign region, and the image corresponding to the position information obtained through the L light-Head-RCNN model is hereinafter referred to as the sub-image containing the traffic sign region in the road image.

The embodiment of the invention preferably adopts a preset CRNN (Convolutional Recurrent Neural Network) model to further extract the semantic information of the sub-images obtained by the L light-Head-RCNN model and the corresponding classification information input, and firstly introduces the training method of the model.

Example two

Referring to fig. 4, fig. 4 is a flowchart illustrating a training method of a neural network model according to another embodiment of the present invention. The method can be applied to automatic driving, can be executed by a training device of a neural network model, can be realized in a software and/or hardware mode, and can be generally integrated in vehicle-mounted terminals such as a vehicle-mounted Computer, a vehicle-mounted industrial control Computer (IPC), and the like. The neural network model provided by the embodiment is mainly directed to a convolution cyclic neural network model for identifying the duplicate traffic sign. As shown in fig. 5, the training method of the neural network model provided in this embodiment specifically includes:

s310, acquiring historical sub-images only containing the position area of the traffic sign in the historical road image, and category information and semantic information of the historical sub-images.

The historical subimages and the category information thereof are obtained by extracting the characteristics of the traffic sign images to be identified in the historical road images by using the preset target detection model.

In an implementation manner, based on the introduction of the preset target detection model in the first embodiment, the position information of the current sub-image obtained by performing feature extraction on the to-be-identified traffic sign image in the current road image by using the preset target detection model is implemented as follows:

performing feature extraction on the traffic sign image to be identified in the current road image by using a full convolution network in a preset target detection model to obtain initial features of the traffic sign image, wherein the initial features at least comprise edge features of the traffic sign;

determining an optimal candidate extracting frame containing the traffic sign based on the initial characteristics by utilizing the RPN in the L light-Head-RCNN model, wherein the optimal candidate extracting frame can be embodied in a rectangular frame form, and the rectangular frame can be uniquely determined in various ways, such as by the coordinates of the vertex of the upper left corner and the width and the height of the rectangular frame, or the coordinates of the vertex of the upper left corner and the vertex of the lower right corner;

and taking the position of the best candidate extraction frame as the position of a positioning boundary frame of the traffic sign board, and outputting a current sub-image containing the positioning boundary frame and the internal traffic sign board by utilizing the R-CNN network.

Preferably, before determining the best candidate extraction frame containing the traffic sign based on the initial features, the traffic sign image can be subjected to separation convolution based on the initial features to obtain a separation matrix; correspondingly, the best candidate extracting frame containing the traffic sign is determined according to the separation matrix. The setting can reduce the amount of calculation and the number of network parameters under the condition of ensuring that the representation capability of the model is approximately unchanged, thereby improving the calculation speed and reducing the use of the memory.

And S320, extracting the characteristics of the historical sub-images to obtain a characteristic sequence of the historical sub-images.

S330, generating a first training sample set based on the feature sequences, the category information and the corresponding semantic information of the plurality of historical sub-images.

S340, training the initial neural network model based on a machine learning algorithm to obtain a CRNN model, wherein the CRNN model enables the feature sequence and the category information of each historical sub-image in the first training sample set to be associated with the corresponding semantic information.

In this embodiment, the CRNN model mainly comprises three parts, namely, a convolutional neural network, a cyclic neural network and a transcription layer, wherein the convolutional neural network extracts a fixed number of feature sequences from input sub-images, the feature sequences are input into a bidirectional cyclic neural network composed of L STM modules to obtain a preliminary prediction sequence, and the transcription layer removes redundant parts in the prediction sequence to obtain a final prediction sequence, the overall structure of the model is shown in fig. 5, and the following parts in the CRNN model are introduced:

extracting a characteristic sequence: in the CRNN model, the Convolutional layer is composed of a Convolutional layer and a maximum pooling layer in a standard CNN (Convolutional neural networks) model. Such components are used to extract the serialized feature representation from the input image. The heights of all images may be normalized to the same height before entering the network, and the widths of the images need not all be the same (since the length of the text sequence is not constant).

A series of feature vectors can then be extracted from the feature map as inputs to the recurrent neural network. Specifically, each feature vector in the feature sequence is extracted from each column of the feature map from left to right, which means that the ith feature vector is obtained from the ith column of the feature map. In this embodiment, since convolution, max pooling, and element-by-element activation functions all operate in local regions, they all feature transfer invariance. Each column on the final feature map corresponds to a rectangular area in the original input sub-image.

In the conventional CRNN model, the convolutional layer of VGG-16 is used as the backbone network, but the VGG network has more parameters and lower accuracy, so the convolutional layer in the trained mobile-net is preferably used as the backbone network of CRNN in this embodiment.

Sequence identification: in the CRNN model, a deep bidirectional RNN (Recurrent neural network) is located above the convolutional network and is used to convert the feature sequence into a character sequence. In the present application, a characteristic sequence X having a length T and output from a convolutional layer is represented by { X }₁,…X_TEach output y of the cyclic network is derived from the entire sequence X. The advantages of using the recurrent neural network are three points, first, RNN has a strong ability to utilize context information in a sequence, can effectively utilize context clues in an image-based sequence recognition problem, and is more stable than a single character recognition-based method. Second, the RNN can use a directional propagation algorithm to pass gradients (gradient refers to the gradient between each layer of the neural network in the back propagation algorithm) to the input layer, which can be trained with convolutional layers in the convolutional recurrent neural network model of this embodiment. Third, the RNN network can handle sequences of arbitrary length.

In the recurrent neural network in the model, a L STM (L ong Short-Term Memory network) unit is used, and a L STM unit comprises a Memory unit and three multiplication gates (an input gate, an output gate and a forgetting gate respectively).

L STM units are directional, that can use past (in one layer of the RNN network, for the ith L STM unit, the unit can use hidden states from 1 to i-1 units) context, but are useful and complementary for both the front and back directions of most text sequence recognition problems (including traffic sign semantic extraction). therefore, we can construct two L STM sequences into a bi-directional L STM sequence, and multiple bi-directional L STMs can be stacked into a deep bi-directional L STM network.

Transcription layer: in the CRNN model, the transcription layer is implemented by a CTC (connection temporal classifier) layer, which functions to convert the predicted sequence of RNN into the target sequence according to a conditional probability whose log-likelihood function is used as the target loss function of the whole network. Thus, for the entire convolutional recurrent neural network model, only the image containing the signpost and the structured character sequence corresponding to the signpost semantics are needed.

The transcription layer is embodied in two modes: there are dictionary (lexicon) and non-dictionary patterns. The dictionary is a collection of series of tag sequences that are used to constrain and aid in the prediction of CTCs. In the present embodiment, the dictionary should be a collection of all legitimate sign semantics. In the dictionary-based mode, the CTC selects the tag sequence with the highest probability from the dictionary as the output sequence. In the dictionary-less mode, the prediction is not subject to any restrictions but dictionary support is lost.

In the traffic sign recognition problem, it is more appropriate to use a dictionary-based CTC module because the content of the sign is very limited, and the larger the dictionary capacity, the closer the distribution is to the real data, and the closer the frequency of each element is to the real probability of the element.

Optionally, the type information (obtained by a preset target detection model) of the sign board and certain verification of some traffic signs can be used when the prediction is output, and if the verification is not passed, appropriate feedback is given, for example, "the sign content cannot be identified" or the like, so as to avoid misleading the output error prediction to the automatic driving system.

In the CRNN model in this embodiment, compared to a preset target detection model, the CRNN model is relatively low in complexity and low in training difficulty, so that an adadalta algorithm is preferably used to automatically calculate the gradient in each dimension, so that the model converges faster. Since the CRNN model is smaller and the input image is also smaller in storage, the batch-size on each GPU can be set to 64 or 128 (depending on the GPU video memory size) to allow the model to converge more quickly and stably.

Further, after the preset target detection model and the CRNN model are trained respectively, the two models can be cascaded for further tuning, and specifically, the mode of multi-task training and gradient transfer, in which all historical road sample images, corresponding position information and category information thereof, and semantic information corresponding to historical sub-images in the historical road sample images are used for end-to-end training of the cascaded preset target detection model and the CRNN model, can be used for further improving the model effect. Since both networks have already been trained, the initial learning rate has to be set lower in this training. The training samples may use all sample pictures, noting that only complex signboards may be entered when sub-images are entered into the CRNN model. In the training process, because two neural network models are cascaded, the occupied video memory is large, and thus the batch-size can be set to be 1.

After the training of the CRNN model is completed, the amplitude traffic sign board in the current road image acquired by the camera in real time may be identified by using the model, and referring to fig. 6, fig. 6 is a flowchart of another traffic sign identification method provided in an embodiment of the present invention, where the method may be executed by a traffic sign identification device, the device may be implemented by software and/or hardware, and may be generally integrated in a vehicle-mounted terminal such as a vehicle-mounted Computer, a vehicle-mounted Industrial Personal Computer (IPC), and the like, where the embodiment of the present invention is not limited. The method for identifying a traffic sign mainly aims at identifying a complex traffic sign board, and as shown in fig. 6, the method comprises the following steps:

and S410, acquiring the position information and the category information of the current sub-image of the traffic sign area in the current road image.

The position information and the category information are obtained by extracting features of the to-be-identified traffic sign image in the current road image by using a preset target detection model, and the specific extraction manner process may refer to the contents provided in the above embodiments, which is not described in detail in this embodiment.

And S420, extracting the features of the current sub-image by using the convolutional neural network CNN according to the position information and the category information to obtain a feature sequence of the current sub-image.

As an optional implementation manner, the obtaining the feature sequence of the current sub-image by performing feature extraction on the current sub-image may include: and performing feature extraction on the current sub-image according to the direction from left to right by using a Convolutional Neural Network (CNN), so that each column of feature map corresponds to one feature vector, and the feature vectors corresponding to each column of feature map form a feature sequence of the current sub-image. In this embodiment, since convolution, max pooling, and element-by-element activation functions all operate in local regions, they all feature transfer invariance. Each column on the final feature map corresponds to a rectangular area in the original input sub-image.

S430, obtaining target semantic information corresponding to the current sub-image according to the feature sequence and a preset convolution cycle neural network (CRNN) model, wherein the CRNN model enables the feature sequence of the sub-image in the traffic sign area in the road image to be associated with the semantic information corresponding to the sub-image.

As an optional implementation manner, obtaining target semantic information corresponding to the current sub-image according to the feature sequence and the preset convolutional recurrent neural network CRNN model, includes:

based on a transcription layer in a convolutional recurrent neural network model, performing redundancy removal processing on the character sequence according to a preset conditional probability to obtain target semantic information corresponding to the target character sequence; and the log-likelihood function of the preset conditional probability is a loss function of the convolution cyclic neural network model.

It should be noted that in the RNN of the present embodiment, two L STM sequences are formed into a bidirectional L STM sequence, and a plurality of bidirectional L STMs can be stacked to form a deep bidirectional L STM network.

It should be further noted that, in this embodiment, the performing redundancy removal processing on the character sequence based on the transcription layer in the convolutional recurrent neural network model according to the preset conditional probability to obtain the target semantic information corresponding to the target character sequence includes:

selecting a character sequence with the highest probability from a preset dictionary by using a connection time sequence classifier (CTC) as a target character sequence; the preset dictionary is a set of all legal traffic sign semantics. This is so because the content of the sign is quite limited and it is more appropriate to use a dictionary-based CTC module, the larger the dictionary capacity, the closer the distribution is to the real data, and the closer the frequency of each element is to the real probability of the element.

According to the technical scheme provided by the embodiment, the position information and the category information of the current sub-image of the traffic sign area in the current road image can be obtained by identifying the current road image by using the preset target detection model. By inputting the position information and the category information of the sub-images into the preset convolution cyclic neural network model, the target semantic information corresponding to the sub-images can be obtained, so that the recognition result of the traffic sign board is more accurate.

EXAMPLE III

Referring to fig. 7, fig. 7 is a block diagram illustrating a training apparatus for a neural network model according to an embodiment of the present invention. As shown in fig. 7, the apparatus includes: a historical road sample image acquisition module 510, a second training sample set generation module 520, and a preset target detection model training module 530, wherein,

a historical road sample image obtaining module 510 configured to obtain a historical road sample image labeled with the position information, the category information, and the semantic information of the traffic sign;

a second training sample set generating module 520, which is positioned to extract the position information of the traffic signboards in the historical road sample images, and generates a second training sample set based on a plurality of position information and the corresponding category information and semantic information thereof;

a preset target detection model training module 530 configured to establish a preset target detection model based on a machine learning algorithm, the preset target detection model associating each category information in the second training sample set with its corresponding semantic information.

After the preset target detection model is obtained, the simple traffic sign board in the current road image acquired by the camera in real time can be positioned and identified by the model.

Referring to fig. 8, fig. 8 is a structural diagram of an identification apparatus for a traffic sign according to an embodiment of the present invention. As shown in fig. 8, the apparatus includes: a current road image acquisition module 610, a current position feature extraction module 620 and a second semantic information identification module 630. Wherein the content of the first and second substances,

a current road image acquisition module 610 configured to acquire a current road image;

a current location feature extraction module 620 configured to extract current location information of the traffic sign in the current road image;

a second semantic information recognition module 630 configured to determine category information of the traffic signboard based on the location information and a preset target detection model, and determine semantic information of the traffic signboard according to the category information, wherein the preset target detection model associates the category information with the semantic information.

Example four

Referring to fig. 9, fig. 9 is a block diagram illustrating a training apparatus for a neural network model according to another embodiment of the present invention. As shown in fig. 9, the apparatus includes: a historical subimage information acquisition module 710, a historical feature sequence determination module 720, a first training sample set generation module 730, and a neural network model training module 740. Wherein the content of the first and second substances,

a history subimage information obtaining module 710 configured to obtain a history subimage only including a traffic sign position area in a history road image, and category information and semantic information of the history subimage, where the history subimage and the category information are obtained by performing feature extraction on a to-be-identified traffic sign image in the history road image by using a preset target detection model, and the preset target detection model associates the category information of the traffic sign with the semantic information;

a historical feature sequence determining module 720, configured to perform feature extraction on the historical sub-images to obtain a feature sequence of the historical sub-images;

a first training sample set generating module 530 configured to generate a first training sample set based on the feature sequences, the category information, and the corresponding semantic information of the plurality of historical sub-images;

the neural network model training module 740 is configured to train an initial neural network model based on a machine learning algorithm to obtain a CRNN model, where the CRNN model associates a feature sequence and category information of each historical sub-image in the first training sample set with corresponding semantic information.

In the embodiment, the preset convolution cyclic neural network model is established, so that the one-to-one mapping relation between the class information and the semantic information of the complex traffic signboards can be established, the aim of directly and quickly obtaining the semantic information is fulfilled, and the identification precision of the traffic signboards is improved.

On the basis of the above embodiment, the preset target detection model is constructed in the following manner:

On the basis of the above embodiment, the apparatus further includes:

After the preset convolution cyclic neural network is obtained, the model can be used for positioning and identifying the complex traffic sign board in the current road image acquired by the camera in real time. Referring to fig. 10, fig. 10 is a block diagram of a traffic sign recognition apparatus according to another embodiment of the present invention. As shown in fig. 10, the apparatus includes: a current sub-picture information acquisition module 810, a current feature sequence determination module 820, and a first semantic information recognition module 830, wherein,

a current sub-image information obtaining module 810 configured to obtain position information and category information of a current sub-image of a traffic sign area in a current road image, wherein the position information and the category information are obtained by performing feature extraction on a traffic sign image to be identified in the current road image;

a current feature sequence determining module 820 configured to perform feature extraction on the current sub-image by using a convolutional recurrent neural network CNN according to the position information and the category information to obtain a feature sequence of the current sub-image;

the first semantic information identifying module 830 is configured to obtain target semantic information corresponding to the current sub-image according to the feature sequence and a preset convolutional neural network CRNN model, where the CRNN model associates the feature sequence of the sub-image in the traffic sign area in the road image with the corresponding semantic information.

On the basis of the above embodiment, the CRNN model is constructed in the following manner:

On the basis of the embodiment, the position information of the current sub-image obtained by extracting the features of the to-be-identified traffic sign image in the current road image is realized by the following method:

On the basis of the above embodiment, the best candidate extraction block determination module is specifically configured to:

On the basis of the above embodiment, the current feature sequence determination module is specifically configured to:

On the basis of the above embodiment, the first semantic information identifying module includes:

On the basis of the foregoing embodiment, the semantic information determining unit is specifically configured to:

The vehicle positioning device provided by the embodiment of the invention can execute the vehicle positioning method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in the above embodiments, reference may be made to a vehicle positioning method according to any embodiment of the present invention.

EXAMPLE five

Referring to fig. 11, fig. 11 is a schematic structural diagram of a vehicle-mounted terminal according to an embodiment of the present invention. As shown in fig. 11, the in-vehicle terminal may include:

a memory 901 in which executable program code is stored;

a processor 902 coupled to a memory 901;

the processor 902 calls the executable program code stored in the memory 901 to execute the method for recognizing the traffic sign or the method for training the neural network model provided by any embodiment of the present invention.

Embodiments of the present invention further provide a computer-readable storage medium storing a computer program, where the computer program includes instructions for executing part or all of the steps of the method for recognizing a traffic sign or the method for training a neural network model provided in any of the embodiments of the present invention.

Embodiments of the present invention further provide a computer program product, which when run on a computer, causes the computer to execute part or all of the steps of the method for identifying a traffic sign or the method for training a neural network model provided in any of the embodiments of the present invention.

In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not imply an inevitable order of execution, and the execution order of the processes should be determined by their functions and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

In the embodiments provided herein, it should be understood that "B corresponding to A" means that B is associated with A from which B can be determined. It should also be understood, however, that determining B from a does not mean determining B from a alone, but may also be determined from a and/or other information.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a computer accessible memory. Based on such understanding, the technical solution of the present invention, which is a part of or contributes to the prior art in essence, or all or part of the technical solution, can be embodied in the form of a software product, which is stored in a memory and includes several requests for causing a computer device (which may be a personal computer, a server, a network device, or the like, and may specifically be a processor in the computer device) to execute part or all of the steps of the above-described method of each embodiment of the present invention.

It will be understood by those skilled in the art that all or part of the steps in the methods of the embodiments described above may be implemented by instructions associated with a program, which may be stored in a computer-readable storage medium, where the storage medium includes Read-Only Memory (ROM), Random Access Memory (RAM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), One-time Programmable Read-Only Memory (OTPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), compact disc-Read-Only Memory (CD-ROM), or other Memory, magnetic disk, magnetic tape, or magnetic tape, Or any other medium which can be used to carry or store data and which can be read by a computer.

The driving strategy generating method and device based on the automatic driving electronic navigation map disclosed by the embodiment of the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for identifying a traffic sign, comprising:

2. The method of claim 1, wherein the CRNN model is constructed by:

3. The method according to any one of claims 1-2, wherein the feature extraction of the traffic sign image to be identified in the current road image to obtain the position information of the current sub-image comprises:

4. The method of any one of claims 3, wherein determining the best candidate extraction box containing the traffic sign based on the initial features comprises:

5. The method according to any one of claims 1 to 4, wherein extracting features of the current sub-image by using a Convolutional Neural Network (CNN) to obtain a feature sequence of the current sub-image comprises:

6. The method according to any one of claims 1 to 5, wherein obtaining the target semantic information corresponding to the current sub-image according to the feature sequence and a preset convolutional neural network (CRNN) model comprises:

7. The method of claim 6, wherein the performing redundancy removal processing on the character sequence according to a preset conditional probability based on a transcription layer in the convolutional recurrent neural network model to obtain target semantic information corresponding to a target character sequence comprises:

8. An apparatus for identifying a traffic sign, comprising:

the system comprises a current subimage information acquisition module, a current subimage information acquisition module and a traffic sign recognition module, wherein the current subimage information acquisition module is configured to acquire position information and category information of a current subimage of a traffic sign area in a current road image, and the position information and the category information are obtained by performing feature extraction on a to-be-recognized traffic sign image in the current road image;

and the semantic information identification module is configured to obtain target semantic information corresponding to the current sub-image according to the feature sequence and a preset convolutional neural network (CRNN) model, and the CRNN model enables the feature sequence of the image of the traffic sign area in the road image to be associated with the corresponding semantic information.

9. A training method of a neural network model is characterized by comprising the following steps:

10. The method of claim 9, wherein the preset target detection model is constructed by:

11. The method of claim 10, wherein after obtaining the CRNN model, the method further comprises:

12. An apparatus for training a neural network model, comprising:

the historical subimage information acquisition module is configured to acquire historical subimages only containing traffic sign position areas in the historical road images, and category information and semantic information of the historical subimages, wherein the historical subimages and the category information of the historical subimages are obtained by performing feature extraction on to-be-identified traffic sign images in the historical road images by using a preset target detection model, and the preset target detection model enables the category information of the traffic signs to be associated with the semantic information;

13. A method for identifying a traffic sign, comprising:

acquiring a current road image;

14. An apparatus for identifying a traffic sign, comprising:

a semantic information recognition module configured to determine category information of the traffic signboard based on the position information and a preset target detection model, and determine semantic information of the traffic signboard according to the category information, wherein the preset target detection model associates the category information of the traffic signboard with the semantic information.

15. A training method of a neural network model is characterized by comprising the following steps:

16. An apparatus for training a neural network model, comprising: