CN114550137B

CN114550137B - Method and device for identifying traffic sign board and electronic equipment

Info

Publication number: CN114550137B
Application number: CN202210172954.5A
Authority: CN
Inventors: 李耀萍; 贾双成; 朱磊; 单国航
Original assignee: Zhidao Network Technology Beijing Co Ltd
Current assignee: Zhidao Network Technology Beijing Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2024-04-09
Anticipated expiration: 2042-02-22
Also published as: CN114550137A

Abstract

The application relates to a method, a device and electronic equipment for identifying traffic signboards. The method comprises the following steps: obtaining image information through the input layer, the image information comprising an image of the traffic sign; image information is processed through a plurality of feature extraction branches respectively, so that multi-scale fused traffic sign features are obtained; processing the characteristics of the multi-scale fused traffic sign board through the output layer, and determining and outputting the vertex coordinate information of the traffic sign board; at least one of the plurality of feature extraction branches is connected with the input layer, at least one of the plurality of feature extraction branches is connected with the output layer, the resolutions of feature graphs extracted by the plurality of feature extraction branches are different, each feature extraction branch comprises a plurality of feature extraction modules connected in series, and the feature extraction modules of the same stage in the different feature extraction branches are respectively connected with the feature extraction modules of the next same stage. The traffic sign board identification method and device can effectively improve the identification accuracy of traffic sign boards with various sizes.

Description

Method and device for identifying traffic sign board and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method, a device and electronic equipment for identifying traffic signboards.

Background

With the rapid development of computer technology and artificial intelligence technology, artificial intelligence technology is applied to more and more scenes such as identifying traffic signs and the like.

However, the applicant found that when the related art uses the same model to identify traffic sign boards of various sizes, the identification accuracy is to be improved.

Disclosure of Invention

In order to solve or partially solve the problems in the related art, the application provides a method, a device and electronic equipment for identifying traffic signboards, which can improve the identification accuracy of the traffic signboards with various sizes.

A first aspect of the present application provides a method of identifying traffic signs, comprising: obtaining image information through the input layer, the image information comprising an image of the traffic sign; image information is processed through a plurality of feature extraction branches respectively, so that multi-scale fused traffic sign features are obtained; processing the characteristics of the multi-scale fused traffic sign board through the output layer, and determining and outputting the vertex coordinate information of the traffic sign board; at least one of the plurality of feature extraction branches is connected with the input layer, at least one of the plurality of feature extraction branches is connected with the output layer, the resolutions of feature graphs extracted by the plurality of feature extraction branches are different, each feature extraction branch comprises a plurality of feature extraction modules connected in series, and the feature extraction modules of the same stage in the different feature extraction branches are respectively connected with the feature extraction modules of the next same stage.

According to some embodiments of the present application, a plurality of feature extraction branches are respectively connected to an input layer, the plurality of feature extraction branches each include a feature extraction module having a same number of stages, and the feature extraction modules of a same stage in different feature extraction branches each include a different number of residual structures.

According to some embodiments of the present application, a plurality of feature extraction branches are respectively connected to the input layer, the plurality of feature extraction branches each include a different number of stages of feature extraction modules, and the feature extraction modules of the same stage in different feature extraction branches each include a different number of residual structures.

According to some embodiments of the present application, the number of levels of feature extraction modules in the plurality of feature extraction branches is inversely related to the resolution.

According to some embodiments of the present application, the plurality of feature extraction modules each include a negative correlation between the number of residual structures and the resolution.

According to some embodiments of the present application, the feature extraction module comprises: the feature sampling unit is used for respectively copying, upsampling and/or downsampling the feature graphs output by the feature extraction modules of the last same stage of the current stage to obtain feature graphs of the same scale of the current stage; the feature fusion unit is used for fusing feature images with the same scale at the current stage to obtain a fused feature image; the feature extraction unit comprises at least one residual structure, and a plurality of residual structures are connected in series and used for processing and fusing the feature images to obtain the feature images.

According to some embodiments of the present application, downsampling a feature map includes: and carrying out convolution operation or pooling operation on the feature map.

According to certain embodiments of the present application, the above method may further comprise: acquiring a traffic sign image from the image information based on the vertex coordinate information of the traffic sign; and identifying the traffic sign images to obtain traffic sign information.

A second aspect of the present application provides an apparatus for identifying traffic signs, comprising: the image information input module is used for obtaining image information through the input layer, wherein the image information comprises images of traffic signboards; the multi-scale fusion feature acquisition module is used for respectively processing the image information through a plurality of feature extraction branches to obtain multi-scale fusion traffic sign features; at least one of the plurality of feature extraction branches is connected with the input layer, at least one of the plurality of feature extraction branches is connected with the output layer, the resolutions of feature graphs extracted by the plurality of feature extraction branches are different, each feature extraction branch comprises a plurality of serial-connection multi-stage feature extraction modules, and the feature extraction modules of the same stage in the different feature extraction branches are respectively connected with the feature extraction modules of the next same stage; and the vertex coordinate output module is used for processing the multi-scale fused traffic sign board characteristics through the output layer, and determining and outputting the vertex coordinate information of the traffic sign board.

A third aspect of the present application provides an electronic device, comprising: a processor; and a memory having executable code stored thereon that, when executed by the processor, causes the processor to perform the method.

A fourth aspect of the present application also provides a computer readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to perform the above-described method.

A fifth aspect of the present application also provides a computer program product comprising executable code which when executed by a processor implements the above method.

According to the method, the device and the electronic equipment for identifying the traffic sign board, the feature graphs with multiple resolutions are respectively extracted through the multiple feature extraction branches, and the feature graphs with multiple resolutions are fused step by step to obtain the traffic sign board features with the multi-scale fusion, so that the feature graphs can comprise rich high-resolution feature information and low-resolution feature information. According to the method and the device, the vertex coordinates of the traffic signboards with different sizes can be accurately extracted from the image information, and the recognition accuracy is improved.

According to the method, the device and the electronic equipment for identifying the traffic sign board, the feature extraction module is constructed through the plurality of serially connected residual structures, so that the receptive field of the feature map can be effectively improved, and the method and the device are suitable for scenes of peak identification of the traffic sign board.

According to the method, the device and the electronic equipment for identifying the traffic sign, at least part of the feature extraction modules respectively comprise a plurality of serially connected residual structures, and the residual structures in serial connection can effectively reduce the problems caused by gradient dispersion or gradient explosion: the model recognition accuracy decreases as the network depth is too deep.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The foregoing and other objects, features and advantages of the application will be apparent from the following more particular descriptions of exemplary embodiments of the application as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the application.

Fig. 1 schematically illustrates an application scenario of identifying traffic signs according to an embodiment of the present application;

FIG. 2 schematically illustrates an exemplary system architecture to which methods, apparatuses, and electronic devices for identifying traffic signs may be applied, according to embodiments of the present application;

FIG. 3 schematically illustrates a flow chart of a method of identifying traffic signs according to an embodiment of the application;

FIG. 4 schematically illustrates a schematic structural view of a signboard recognition model according to an embodiment of the present application;

FIG. 5 schematically illustrates another structural diagram of a signboard recognition model according to an embodiment of the present application;

FIG. 6 schematically illustrates another structural diagram of a signboard recognition model according to an embodiment of the present application;

FIG. 7 schematically illustrates a structural schematic of a feature extraction module according to an embodiment of the application;

FIG. 8 schematically illustrates another structural schematic of a feature extraction module according to an embodiment of the application;

FIG. 9 schematically illustrates a block diagram of an apparatus for identifying traffic signs according to an embodiment of the application;

fig. 10 schematically shows a block diagram of an electronic device according to an embodiment of the application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The terminology used in the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

It should be understood that although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first message may also be referred to as a second message, and similarly, a second message may also be referred to as a first message, without departing from the scope of the present application. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The vertex identification of traffic signs is a problem at the present stage in studying heat. The related art is divided into two main categories: based on conventional algorithm recognition and based on neural network recognition. When the related art identifies the vertexes of the traffic signboards based on the neural network, as the traffic signboards possibly have various sizes, when the same model is used for identifying the traffic signboards with different sizes, the difference of the identification effects is large, and the overall identification accuracy is poor.

According to the embodiment of the application, the feature map with high resolution is extracted through the high-resolution feature extraction branch, and the feature map with low resolution is extracted through the low-resolution feature extraction branch, so that the vertex coordinates of the traffic signboards with different sizes can be extracted from the image information respectively, and the recognition accuracy is improved. In addition, the feature graphs extracted by the different feature extraction branches are subjected to same-level fusion, so that the high-resolution features and the low-resolution features included in the feature graphs are effectively improved, the high-quality features can be extracted for the traffic signboards with different sizes, and the recognition effect is further improved.

A method, an apparatus and an electronic device for identifying a traffic sign according to embodiments of the present application will be described in detail below with reference to fig. 1 to 10.

Fig. 1 schematically illustrates an application scenario of identifying traffic signs according to an embodiment of the present application.

Referring to fig. 1, a schematic view of a highway is shown. Traffic signs may be provided above or on both sides of the road to indicate traffic information. Traffic signs are used for road installations that convey guidance, restriction, warning or indicating information in text or symbols. Traffic signs are divided into two main signs and auxiliary signs. The main flags include, but are not limited to: warning signs, ban signs, indicating signs, road directing signs, tourist area signs, road construction safety signs, etc.

For example, the triangular traffic sign in fig. 1 belongs to a warning sign indicating that the right side of the road ahead is narrowed. The rectangular traffic sign in fig. 1 belongs to the road directing sign.

Taking a scene of an imaging device of a vehicle with an auxiliary driving or automatic driving function as an example, a plurality of frames of images can be imaged by the imaging device during the running of the vehicle on a road. The ratio of the image of the same traffic sign in the captured image may change continuously, e.g., the ratio is higher as the distance of the vehicle from the traffic sign is closer. Because the photographed multi-frame images are required to be subjected to traffic sign recognition so as to realize auxiliary driving or automatic driving, the requirement on recognition speed is high.

Further, if traffic identification is performed by a computing device provided on a vehicle, being limited by the limited computing performance of the computing device, there are raised demands on the processing speed, off-line processing performance, and the like of the traffic sign identification model.

Referring to fig. 1, the rectangular traffic sign has four vertices and the triangular traffic sign has three vertices. The vertex coordinates of the vertices may be represented as (xi, yi), i being an integer greater than or equal to 1. Planning or tailoring of auxiliary/automatic driving strategies such as acceleration and deceleration, obstacle avoidance, path planning, etc. is facilitated by the identified traffic signs.

Fig. 2 schematically illustrates an exemplary system architecture to which a method, apparatus, and electronic device for identifying traffic signs may be applied according to an embodiment of the application. It should be noted that fig. 2 is only an example of a system architecture to which the embodiments of the present application may be applied to help those skilled in the art understand the technical content of the present application, and does not mean that the embodiments of the present application may not be used in other devices, systems, environments, or scenarios.

Referring to fig. 2, a system architecture 200 according to this embodiment may include mobile devices 201, 202, 203, a network 204, and a cloud 205. The network 204 is the medium used to provide communication links between the mobile devices 201, 202, 203 and the cloud 205. The network 204 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with other mobile devices and cloud 205 over network 204 using mobile devices 201, 202, 203 to receive or transmit information, etc., such as transmitting model training requests, traffic sign recognition requests, and receiving trained model parameters, traffic sign information, etc. The mobile devices 201, 202, 203 may be installed with various communication client applications, such as, for example, in-car applications, web browser applications, database class applications, search class applications, instant messaging tools, mailbox clients, social platform software, and the like.

Mobile devices 201, 202, 203 include, but are not limited to, automobiles, robots, tablet computers, laptop portable computers, and the like, which may support internet surfing, human interaction, and the like.

The cloud 205 may receive a model training request, a traffic sign recognition request, and the like, adjust model parameters to perform model training, issue model topology structures, issue trained model parameters, issue traffic sign information obtained by recognition, and the like, and may also send real-time traffic sign information to the mobile devices 201, 202, 203. For example, cloud 205 may be a background management server, a server cluster, a car networking, or the like.

It should be noted that the number of mobile devices, networks, and cloud servers are merely illustrative. There may be any number of removable devices, networks, and clouds, as desired for implementation.

Fig. 3 schematically illustrates a flow chart of a method of identifying traffic signs according to an embodiment of the application.

Referring to fig. 3, the embodiment provides a method of identifying a traffic sign, which includes operations S310 to S330, as described below.

In operation S310, image information including an image of a traffic sign is obtained through an input layer.

The image information may be obtained by photographing by a photographing device provided in the vehicle. The camera may be a monocular camera. In addition, a binocular camera may be used, or the like, and the traffic sign recognition may be performed after the two images captured by the binocular camera are fused. The image may be a photograph taken by a smart phone or a device having a photographing function.

For example, the image may be one or more frames of images captured by an onboard camera. The captured image may include an image of at least one traffic sign. The traffic sign is a sign board arranged above or beside a road, and the shape of the traffic sign board comprises triangle, rectangle, polygon and the like. For example, for a rectangular traffic sign, it has 4 vertices (only a portion of the vertices of the traffic sign may be displayed in the captured image). The number of traffic signs in the photographed image is not limited to one but may be plural.

It should be noted that, the technical scheme of this application can be well adapted to the image that the vehicle was shot by shooting device in the removal process: in this scenario, the vehicle may move at a high speed, and the frame rate of the photographed image is high, so that it is necessary to quickly identify the vertex of the traffic sign from the image frame. In addition, the change amplitude of the image size of the same traffic sign board in the images shot at different moments is large, such as 0.1% to 30% (or more) of the occupied image area (such as the traffic sign board is far away from the shooting device) and the like (such as the traffic sign board is near to the shooting device).

In operation S320, the image information is processed through the plurality of feature extraction branches, respectively, to obtain the multi-scale fused traffic sign features.

At least one of the plurality of feature extraction branches is connected with the input layer, at least one of the plurality of feature extraction branches is connected with the output layer, the resolutions of feature graphs extracted by the plurality of feature extraction branches are different, each feature extraction branch comprises a plurality of feature extraction modules connected in series, and the feature extraction modules of the same stage in the different feature extraction branches are respectively connected with the feature extraction modules of the next same stage.

For example, the plurality of feature extraction branches may include at least one feature extraction branch for extracting a high-resolution feature map, and may further include at least one feature extraction branch for extracting a low-resolution feature map. The feature extraction module may be formed of at least one convolution layer.

In some embodiments, the traffic sign may be divided into a small traffic sign, a medium traffic sign, a large traffic sign, and the like by size division. One or more traffic signs having different sizes may be included in the image information. In addition, as the distance between the vehicle and the traffic sign changes, the size of the traffic sign in the captured image may also change. In order to better perform vertex recognition on traffic signboards with various sizes (such as small, medium and large traffic signboards) and traffic signboard images with different image duty ratios, the convolution kernels and the convolution channel numbers adopted by each feature extraction branch can be mutually different.

For example, there are a plurality of feature extraction branches, each of which performs convolution computation on an image at least twice. For example, convolution kernels employed in convolution calculations include, but are not limited to: 1×1, 2×2, 3×3, 5×5, 7×7, and the like. The step size of the convolution calculation may be 1, 2, 3, 5, 7, etc. Each feature extraction module may output a feature map for multi-scale feature map fusion.

For example, each feature extraction branch performs a convolution calculation on the image n times, where n > =2, and n is an integer. For example, the feature extraction module of a feature extraction branch uses a convolution kernel of 3×3 and a step size of 2.

The convolution calculation process may be as follows: first, a convolution kernel NxN is initialized, where N is a positive odd number within the [1,7] interval. Then, the number of convolution channels is set. Then, the image is convolved based on the convolution kernel nxn and the number of convolution channels.

For example, assuming that there are 3 feature extraction branches, when the 1 st feature extraction branch performs the first convolution calculation on the image, the convolution kernel 1×1 and the convolution channel number channels=512 are adopted to perform the convolution calculation on the image, where the convolution kernel 1×1 focuses on the small traffic sign board in the image. When the 2 nd feature extraction branch carries out first convolution calculation on the image, the convolution kernel 3×3 and the convolution channel number channels=256 are adopted to carry out convolution calculation on the image, wherein the convolution kernel 3×3 can be mainly used for identifying medium-sized traffic signboards in the image. When the 3 rd feature extraction branch carries out first convolution calculation on the image, the convolution kernel 5×5 and the convolution channel number channels=128 are adopted to carry out convolution calculation on the image, wherein the convolution kernel 5×5 can be mainly used for identifying the large traffic sign board in the image. In this embodiment, by setting convolution kernels n×n (such as convolution kernels 5×5 and convolution kernels 3×3 and convolution kernels 1×1) with different sizes, the recognition effect on traffic sign boards with different sizes is improved.

Specifically, an image carrying a traffic sign element is acquired, and convolution calculation is sequentially performed on the image along the 1 st feature extraction branch, the 2 nd feature extraction branch and the 3 rd feature extraction branch respectively. For example, convolution along the 1 st feature extraction branch convolves the image with a convolution kernel of 1×1, convolution channel number channels=512. Convolution along the 2 nd feature extraction branch uses convolution kernel 3×3, convolution channel number channels=256 to perform convolution calculation on the image. The first convolution along the 3 rd feature extraction branch uses convolution kernel 5×5, convolution channel number channels=128 to perform convolution calculation on the image.

After feature images with different resolutions are extracted by the feature extraction modules of the feature extraction branches, the feature images with different resolutions extracted by the last feature extraction module of each of the feature extraction branches are fused to the feature image output by the feature extraction branch with the highest resolution, and the multi-scale fused feature image is obtained.

In addition, since the model includes a feature extraction branch capable of extracting a feature map of high resolution and a feature extraction branch capable of extracting a feature map of low resolution, the two types of feature extraction branches are connected in parallel, and multi-scale feature fusion is performed between different branches, each feature map of high resolution to low resolution repeatedly receives information from other feature maps represented in parallel, thereby obtaining feature maps having abundant high resolution features and low resolution features.

In operation S330, the multi-scale fused traffic sign features are processed through the output layer, and vertex coordinate information of the traffic sign is determined and output.

For example, coordinates of pixels having pixel values greater than a preset threshold in the multi-scale fused feature map may be used as vertex coordinates.

In the embodiment of the application, the input image information is lowered to the multiple feature extraction branches, a plurality of feature extraction modules are respectively transmitted in each feature extraction branch, and feature images of different scales extracted by the feature extraction modules are fused, so that the high-resolution feature images can be kept at any time, the high-resolution feature images are recovered by upsampling from the low-resolution feature images, and the recognition effect of traffic signboards with various sizes can be effectively improved.

Fig. 4 schematically shows a schematic structural view of a signboard recognition model according to an embodiment of the present application.

Referring to fig. 4, a plurality of feature extraction branches are respectively connected with an input layer, the number of the stages of feature extraction modules included in each of the plurality of feature extraction branches is the same, and the number of residual structures included in each of the feature extraction modules of the same stage in different feature extraction branches is different. The residual structure is adopted to extract the characteristics, so that the problems that a model is difficult to train and the like caused by gradient dispersion, gradient explosion and the like can be effectively reduced compared with a common convolution layer. In addition, the residual structure is adopted, so that the depth of the model can be increased, and the recognition rate of the model can be improved.

Referring to fig. 4, an exemplary description is made with a signboard recognition model including at least three feature extraction branches, wherein the resolution of feature graphs extracted by each of the at least three feature extraction branches is from high to low, for example, the previous one or the previous two feature extraction branches are high-resolution feature extraction branches, and the remaining feature extraction branches are low-resolution feature extraction branches. At least three feature extraction branches are connected in parallel, and multi-scale feature fusion is performed between different branches, so that each high-resolution to low-resolution feature map repeatedly receives information from other feature maps which are represented in parallel, and a high-resolution feature map with rich low-resolution features is obtained.

For example, at least one high-resolution feature extraction branch is connected to the input layer, at least one low-resolution feature extraction branch is connected to the input layer, and the at least one high-resolution feature extraction branch and the at least one low-resolution feature extraction branch each comprise the same number of stages of feature extraction modules.

Each block in fig. 4 represents a feature extraction module. At least some of the feature extraction modules may each be comprised of one or more residual structures. After the input image information (img) in fig. 4 drops to at least three branches, a plurality of feature extraction modules are respectively propagated, and feature graphs extracted by the feature extraction modules at the same level of different branches are fused. The above-mentioned fusion process is performed multiple times, for example, n-level in fig. 4 may be 8-level, and accordingly, the fusion process is performed 8 times, and finally, the fused feature map is transmitted to the first branch (for example, the high-resolution feature extraction branch) so as to determine the vertex coordinates of the traffic sign board from the fused high-resolution feature map.

For example, the feature extraction module may include: the device comprises a feature sampling unit, a feature fusion unit and a feature extraction unit.

The feature sampling unit is used for respectively copying, upsampling and/or downsampling the feature graphs output by the feature extraction modules of the same stage above the current stage to obtain feature graphs of the same scale of the current stage. For example, up-sampling is performed on the low-resolution feature map, down-sampling is performed on the high-resolution feature map, and copy transmission is performed on the feature map of the same level. Specifically, by upsamplingDownsampling->Filling and the like change feature images with different scales into feature images with the same scale. For example, downsampling the feature map may include: and carrying out convolution operation or pooling operation on the feature map.

For example, the function α (Xi, k) represents upsampling or downsampling Xi from resolution i to k. Downsampling may be performed using a 3 x 3 convolution kernel of a particular step size. For example, downsampling a convolution kernel of 3×3 is performed in steps of 2×2. For up-sampling, the number of channels can be calibrated from a 1 x 1 convolution with the simplest neighbor samples. If i=k, α (Xi, k) indicates an identification connection (→): α (Xi, k) =xi.

The feature fusion unit is used for fusing feature images with the same scale at the current stage to obtain a fused feature image.

The feature extraction unit comprises at least one residual structure, and a plurality of residual structures are connected in series and used for processing and fusing the feature images to obtain the feature images. Thus, a feature map can be obtained by a convolution operation.

Fig. 5 schematically shows another structural diagram of a signboard recognition model according to an embodiment of the present application.

Referring to fig. 5, a plurality of feature extraction branches are respectively connected with an input layer, the number of stages of feature extraction modules included in each of the plurality of feature extraction branches is different, and the number of residual structures included in each of feature extraction modules of the same stage in different feature extraction branches is different.

For example, at least one high resolution feature extraction branch is respectively connected to the input layer, at least one low resolution feature extraction branch is respectively connected to the input layer, and the number of stages of feature extraction modules respectively included in the at least one low resolution feature extraction branch is larger than the number of stages of feature extraction modules respectively included in the at least one high resolution feature extraction branch.

In some embodiments, the number of levels of feature extraction modules in the plurality of feature extraction branches is inversely related to the resolution. For example, the number of feature extraction modules in the high resolution feature extraction branch is less than the number of feature extraction modules in the low resolution feature extraction branch.

For example, the feature map can be extracted through a residual structure, the more the residual structures are connected in series, the larger the model body quantity is, and the body quantity of the identification model of the signpost can be reduced and the storage space occupied by the model can be reduced by reducing the number of feature extraction modules in a high-resolution feature extraction branch.

Fig. 6 schematically shows another structural diagram of a signboard recognition model according to an embodiment of the present application.

Referring to fig. 6, the signboard recognition model includes four feature extraction branches from the point of view of the branches. From a level perspective, the sign recognition model includes a plurality of repeating base units, such as 8 repeating base units (which may have more or fewer base units). After the images are respectively input into the feature extraction branches, the feature extraction branches respectively extract the features of the images. Specifically, the plurality of feature extraction modules each include a negative correlation between the number of residual structures and the resolution.

Each feature extraction branch in fig. 6 includes a plurality of feature extraction modules connected in series. The connection relationship between the feature extraction module of the fourth branch and the feature extraction module of the next stage is not fully shown, for example, the connection relationship between the feature extraction module of the first and second feature extraction branches is not shown.

For example, the sign recognition model may include four feature extraction branches. Wherein the at least one high resolution feature extraction branch comprises: the resolution of the feature map extracted by the first branch and the second branch is sequentially reduced. The at least one low resolution feature extraction branch comprises: and the third branch and the fourth branch sequentially reduce the resolution of the feature images extracted by the third branch and the fourth branch. Wherein the feature extraction module in the second branch comprises a greater number of residual structures than the feature extraction module in the first branch.

The feature extraction modules of the different feature extraction branches in fig. 6 each comprise a different number of residual structures. For example, the feature extraction module of the first feature extraction branch comprises 1 residual structure. The feature extraction module of the second feature extraction branch comprises 2 residual structures in series. The feature extraction module of the third feature extraction branch comprises 4 residual structures connected in series. The feature extraction module of the fourth feature extraction branch comprises 8 residual structures connected in series. Wherein, a plurality of residual structures connected in series are helpful to promote receptive field, so that the model is more suitable for identifying the scene with the peak of the traffic sign board with small size from multi-frame image information.

The resolution of the feature map extracted by the 4 feature extraction branches in fig. 6 gradually decreases, and may be used to identify traffic signs having different sizes, respectively. For example, large resolution identifies small-sized traffic signs. The low resolution identifies large-sized traffic signs. The 4 feature extraction branches may reduce the input to a sufficiently small resolution.

The test results of the identification model of the signboard shown in fig. 6 are shown in table 1.

In this embodiment, feature extraction branches capable of extracting feature maps of a plurality of resolutions are connected in parallel. In the process of feature extraction, information is directly and repeatedly exchanged in parallel feature extraction branches to perform multi-scale repeated fusion. In particular, low resolution feature maps of the same depth and similar level are utilized to enhance the representation capabilities of high resolution feature maps and vice versa. Therefore, the vertex coordinates of the traffic sign board determined based on the multi-scale fusion traffic sign board features are effectively improved.

Fig. 7 schematically illustrates a structural schematic diagram of the feature extraction module according to an embodiment of the present application.

Referring to fig. 7, a first batch normalization (Batch Normalization, BN) layer performs batch normalization processing on the pictures to be identified.

And the first convolution layer carries out convolution calculation on the image information subjected to batch normalization processing to obtain a first feature map.

And the second batch normalization layer is used for carrying out batch normalization processing on the first feature images to obtain second feature images.

And the activation function layer is used for activating the second feature map.

And the second convolution layer carries out convolution calculation on the second feature map after the activation processing to obtain a third feature map.

And carrying out feature fusion on the third feature image and the image to be identified according to a preset residual fusion algorithm to obtain the feature image output by the feature extraction module.

When the image (or the feature map) is subjected to batch normalization processing, pixel values of the image can be acquired first, corresponding average values and variances can be calculated according to the pixel values of the image, scaling coefficients and translation coefficients corresponding to the image are set, and the image is subjected to batch normalization processing according to the average values, the variances, the scaling coefficients and the translation coefficients corresponding to the image.

Note that, the configuration of the feature extraction module shown in fig. 7 is only exemplary, and is not limited thereto. For example, the feature extraction module may also include more batch normalization layers, convolution layers, activation layers, and the like. Further, the order between layers in the feature extraction module may be changed. For example, the image may be convolved by a convolution layer to obtain the feature map. And then, carrying out batch normalization operation and the like on the feature images through a batch normalization layer.

For example, a batch normalization process of the image (or feature map) by equation (1) may be employed.

Wherein x is _i For pixel values within an image, μ is the average value, σ, calculated from pixel values in the image ² For variance calculated from pixel values in an image, gamma is a scaling factor corresponding to the image, beta is a translation factor corresponding to the image, x _inew For pixel values after batch normalization, ε is a constant that is greater than 0, and the effect is not to let the denominator be 0.

Wherein sigma ² The calculation can be performed using equation (2).

N in the formula (2) is the number of pixel values in the first convolution feature map, xi is the pixel value in the first convolution feature map, and μ can be calculated by using the formula (3).

N in the formula (2) is the number of pixel values of the first convolution feature map, and x _i Is the pixel value of the i-th pixel in the first convolution feature map.

Referring to fig. 6 and fig. 7 together, after the images are respectively transmitted to the four feature extraction branches, the feature extraction may be performed as described above for the feature extraction module in at least some of the feature extraction branches.

Fig. 8 schematically shows another structural schematic of the feature extraction module according to an embodiment of the present application.

Referring to fig. 8, the residual structure is a structural block. The residual structure does not let the network fit the original mapping directly, but fits the residual mapping. The optimal demapping is represented, for example, by H (x), but the stacked nonlinear layers are fit to another mapping F (x) =h (x) -x, and the original mapping is reconverted to F (x) +x.

In practice, it is much easier to approximate the residual to 0 than to approximate this mapping to another non-linear layer. The solution using residuals converges much faster than the normal solution without residuals.

The formula of F (x) +x may be implemented by a feed-forward neural network with a shortcut that skips one or more layers. Without increasing the parameters and complexity.

Referring to fig. 8, the input image is 256 dimensions (d), and the convolution kernel of each convolution operation is 3×3, and the step size may be 2, which are sequentially subjected to two convolution operations. The activation needs to be performed by an activation function (e.g., relu function) after the convolution operation. After the image information is processed by the residual structure, the dimension of the output information is still 256 dimensions.

It should be noted that the residual structure is finally added (add, no) instead of concatenated (the number of channels increases). Add is used to achieve feature map addition and to keep the number of channels unchanged.

In certain embodiments, the above-described methods may further comprise the following operations.

First, a traffic sign image is acquired from image information based on vertex coordinate information of a traffic sign.

And then, identifying the traffic sign image to obtain traffic sign information.

This facilitates auxiliary/automatic driving planning based on traffic sign information, and if the identified traffic sign information is a front school zone, the vehicle speed can be controlled not to exceed 30km/h.

It should be noted that, the model training process of the identification model of the signpost may be offline training or online training, and the model training may be performed at the cloud. The mobile device may download the model topology and model parameters of the trained traffic sign recognition model from the cloud to enable recognition of traffic sign information at the mobile device. The mobile device may also report traffic status information to the cloud so that the cloud identifies traffic sign information using the trained traffic sign identification model and the identified traffic sign information is sent (or broadcast) by the cloud to one or more mobile devices.

Another aspect of the present application also provides an apparatus for identifying traffic signs.

Fig. 9 schematically shows a block diagram of an apparatus for identifying traffic signs according to an embodiment of the application.

Referring to fig. 9, the apparatus 900 for identifying a traffic sign may include: an image information input module 910, a multi-scale fusion feature acquisition module 920, and a vertex coordinates output module 930.

The image information input module 910 is configured to obtain image information including an image of a traffic sign through an input layer.

The multi-scale fusion feature obtaining module 920 is configured to process the image information through a plurality of feature extraction branches, respectively, to obtain multi-scale fusion traffic sign features; at least one of the plurality of feature extraction branches is connected with the input layer, at least one of the plurality of feature extraction branches is connected with the output layer, the resolutions of feature graphs extracted by the plurality of feature extraction branches are different, each feature extraction branch comprises a plurality of feature extraction modules connected in series, and the feature extraction modules of the same stage in the different feature extraction branches are respectively connected with the feature extraction modules of the next same stage.

The vertex coordinate output module 930 is configured to process the multi-scale fused traffic sign features through the output layer, and determine and output vertex coordinate information of the traffic sign.

In some embodiments, the plurality of feature extraction branches are respectively connected to the input layer, the plurality of feature extraction branches each include a same number of stages of feature extraction modules, and the plurality of feature extraction branches each include a different number of residual structures in the same stage of feature extraction modules in different feature extraction branches.

In some embodiments, the plurality of feature extraction branches are respectively connected to the input layer, the plurality of feature extraction branches each include different numbers of levels of feature extraction modules, and the plurality of feature extraction branches each include different numbers of residual structures in the same level of feature extraction modules in different feature extraction branches.

In some embodiments, the number of levels of feature extraction modules in the plurality of feature extraction branches is inversely related to the resolution.

In some embodiments, the plurality of feature extraction modules each include a negative correlation between the number of residual structures and the resolution.

In some embodiments, the feature extraction module may include: the device comprises a feature sampling unit, a feature fusion unit and a feature extraction unit.

The feature sampling unit is used for respectively copying, upsampling and/or downsampling the feature graphs output by the feature extraction modules of the same stage above the current stage to obtain feature graphs of the same scale of the current stage. The feature fusion unit is used for fusing feature images with the same scale at the current stage to obtain a fused feature image. The feature extraction unit comprises at least one residual structure, and a plurality of residual structures are connected in series and used for processing and fusing the feature images to obtain the feature images.

In some embodiments, the feature sampling unit is specifically configured to downsample the feature map, including: and carrying out convolution operation or pooling operation on the feature map.

In some embodiments, the apparatus 900 further includes: and the traffic sign image acquisition module and the traffic sign information acquisition module.

The traffic sign image acquisition module is used for acquiring traffic sign images from the image information based on vertex coordinate information of the traffic sign.

The traffic sign information acquisition module is used for identifying the traffic sign images to obtain traffic sign information.

The specific manner in which the respective modules and units perform the operations in the apparatus 900 in the above embodiment has been described in detail in the embodiment related to the method, and will not be described in detail here.

Another aspect of the present application also provides an electronic device.

Referring to fig. 10, an electronic device 1000 includes a memory 1010 and a processor 1020.

The processor 1020 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 1010 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 1020 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 1010 may comprise any combination of computer-readable storage media including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 1010 may include readable and/or writeable removable storage devices, such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 1010 has stored thereon executable code that, when processed by the processor 1020, can cause the processor 1020 to perform some or all of the methods described above.

Furthermore, the method according to the present application may also be implemented as a computer program or computer program product comprising computer program code instructions for performing part or all of the steps of the above-described method of the present application.

Alternatively, the present application may also be embodied as a computer-readable storage medium (or non-transitory machine-readable storage medium or machine-readable storage medium) having stored thereon executable code (or a computer program or computer instruction code) which, when executed by a processor of an electronic device (or a server, etc.), causes the processor to perform part or all of the steps of the above-described methods according to the present application.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of identifying traffic signs, comprising:

obtaining image information through an input layer, wherein the image information comprises an image of a traffic sign board;

processing the image information through a plurality of feature extraction branches respectively to obtain the features of the multi-scale fused traffic sign board;

processing the characteristics of the multi-scale fused traffic sign board through an output layer, and determining and outputting vertex coordinate information of the traffic sign board;

at least one of the plurality of feature extraction branches is connected with the input layer, at least one of the plurality of feature extraction branches is connected with the output layer, the resolution of feature graphs extracted by the plurality of feature extraction branches is different, each feature extraction branch comprises a plurality of serial-connection multi-stage feature extraction modules, and the feature extraction modules of the same stage in different feature extraction branches are respectively connected with the feature extraction modules of the next same stage.

2. The method of claim 1, wherein a plurality of the feature extraction branches are respectively connected to the input layer, the plurality of feature extraction branches each include a same number of levels of feature extraction modules, and the number of residual structures each included in a feature extraction module of a same level in different feature extraction branches is different.

3. The method of claim 1, wherein a plurality of the feature extraction branches are respectively connected to the input layer, the plurality of feature extraction branches each include a different number of levels of feature extraction modules, and the feature extraction modules of the same level in different feature extraction branches each include a different number of residual structures.

4. A method according to claim 3, wherein the number of steps of feature extraction modules in a plurality of said feature extraction branches is inversely related to resolution.

5. A method according to claim 2 or 3, wherein a plurality of said feature extraction modules each comprise a negative correlation between the number of residual structures and the resolution.

6. The method of any one of claims 1 to 4, wherein the feature extraction module comprises:

the feature sampling unit is used for respectively copying, upsampling and/or downsampling the feature graphs output by the feature extraction modules of the last same stage of the current stage to obtain feature graphs of the same scale of the current stage;

the feature fusion unit is used for fusing feature images with the same scale at the current stage to obtain a fused feature image;

the feature extraction unit comprises at least one residual structure, and a plurality of residual structures are connected in series and used for processing the fusion feature map to obtain the feature map.

7. The method of claim 6, wherein downsampling the signature comprises: and carrying out convolution operation or pooling operation on the feature map.

8. The method as recited in claim 6, further comprising:

acquiring a traffic sign image from the image information based on the vertex coordinate information of the traffic sign;

and identifying the traffic sign image to obtain traffic sign information.

9. An apparatus for identifying traffic signs, comprising:

the image information input module is used for obtaining image information through the input layer, wherein the image information comprises images of traffic signboards;

the multi-scale fusion feature obtaining module is used for respectively processing the image information through a plurality of feature extraction branches to obtain multi-scale fusion traffic sign features; at least one of the plurality of feature extraction branches is connected with the input layer, at least one of the plurality of feature extraction branches is connected with the output layer, the resolution of feature graphs extracted by the plurality of feature extraction branches is different, each feature extraction branch comprises a plurality of serial-connection multi-stage feature extraction modules, and the feature extraction modules of the same stage in the different feature extraction branches are respectively connected with the feature extraction module of the next same stage;

And the vertex coordinate output module is used for processing the multi-scale fused traffic sign board characteristics through an output layer, and determining and outputting vertex coordinate information of the traffic sign board.

10. An electronic device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method according to any of claims 1-8.