CN116189191A

CN116189191A - Variable-length license plate recognition method based on yolov5

Info

Publication number: CN116189191A
Application number: CN202310219782.7A
Authority: CN
Inventors: 陈琰; 谢嘉健; 赵瀚霖; 许冠超; 邱文斌; 郑路璐
Original assignee: South China Agricultural University
Current assignee: South China Agricultural University
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-05-30

Abstract

The invention discloses a variable length license plate recognition method based on yolov5, which comprises the following steps: license plate positioning, license plate correction and license plate recognition. According to the invention, a qualified training set is obtained through manual screening, a YOLOV5S neural network is trained to accurately obtain a license plate positioning result, deformed and distorted license plates are corrected through a training STN network to obtain better license plate characteristics, and finally the license plates are identified through a training LPRNet network to obtain accurate license plate information.

Description

Variable-length license plate recognition method based on yolov5

Technical Field

The invention relates to the field of license plate number detection, in particular to a variable-length license plate recognition method based on yolov 5.

Background

As license plate recognition is a challenging and important task in urban traffic management, video surveillance, vehicle recognition and parking management, the image sharpness, lighting conditions, weather factors, image deformation and variability of license plate characters complicate the license plate recognition problem. License plate recognition technology is used as an important means of traffic management automation and an important link of a vehicle detection system. Plays an important role in traffic monitoring and control.

The key of the license plate recognition technology is that the license plate positioning, the license plate character segmentation and the license plate character recognition are carried out on the rear three parts.

Most of domestic and foreign technical schemes are to use the commonality of text textures in vehicle images for positioning and recognition, for example, foreign YUNTaoCui provides a license plate recognition system, after the license plate is positioned, a Markov field is used for extracting and binarizing license plate features, the key work is to put on binarization, and finally, the recognition of samples of the license plate features reaches higher recognition rate; luo Xuechao, liu Guixiong and the like of national university of south China provide a binarization method based on license plate characteristic information, and the recognition rate of the system on license plates with good photographing effect reaches 96%.

However, some difficulties in the prior art need to be resolved, such as: (1) The photographed license plate images are interfered by environmental factors, such as backlight, night light, diffraction of optical imaging and the like, and the quality of the pictures is difficult to ensure; (2) Other character areas are interfered, such as other license plates are hung beside the license plate, so that the license plate is difficult to accurately position; (3) The license plate is stained and dirty, the handwriting is blurred and faded or the license plate is worn seriously, and the noise is seriously stained; (4) the license plate part is blocked and the license plate is deformed; (5) complex image background and multiple license plates of one image; (6) The photographed license plate is a moving image license plate, and the license plate photo is fuzzy and distorted to form saw teeth; (7) The novel license plate recognition model has high iterative retraining cost.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a variable-length license plate recognition method based on yolov 5.

The technical scheme of the invention is realized as follows: a variable length license plate recognition method based on yolov5 comprises the following steps:

s1: positioning a license plate;

the step S1 comprises the following steps:

s11: data preparation, namely collecting vehicle photos under the condition of complex and changeable environments, and manufacturing a picture data set; removing pictures which do not contain complete license plates and damaged pictures in the picture data set by a manual screening mode, removing pictures with higher similarity, and manually classifying license plates of different types; labeling tag information of each vehicle picture sample in a picture data set by using a labelimg tool to generate a labeling txt file, wherein the data set format in the txt file is yolo format, and each row in the txt file represents information of one license plate and comprises a license plate category, a license plate center point abscissa, a license plate center point ordinate, a license plate width and a license plate height; dividing images in the picture data set into a training data set, a verification data set and a test data set;

s12: training a YOLOV5S model, the YOLOV5S model comprising: the system comprises a ConvBNSILU module, a BottleNeck1 module, a BottleNeck2 module, a C3 module and an SPPF module, wherein the ConvBNSILU module network structure consists of a conv convolution layer +bn layer +silu activation layer, the BottleNeck1 module network structure comprises a ConvBNSILU module which is sequentially connected with 1*1, then a ConvBNSILU module which is sequentially connected with 3*3 is further connected with an initial input through a residual structure, the BottleNeck2 module network structure comprises a ConvBNSILU module which is sequentially connected with 1*1 and then a ConvBNSILU module which is sequentially connected with 3*3, the C3 module network structure is divided into two branches, one branch is stacked through a plurality of Bolleck, the other branch is only stacked through one basic convolution module, then the two branches are subjected to concat operation, and finally the ConvBNSILENILU module is further processed; the SPPF module comprises 3 serial maximum pooling layers, and the pooling cores of the maximum pooling layers are 5*5 in size;

s2: correcting the license plate;

the step S2 comprises the following steps: constructing a STN-based network model structure, wherein the STN-based network input is U, the output is V, and the STN-based network structure comprises an input end, a localization net, a Grid generator and a Sampler output end, wherein the localization net inputs U, outputs a change parameter Θ, and the parameter Θ is used for mapping the coordinate relationship between U and V; the Grid generator calculates a coordinate point of V relative to U according to the coordinate point and the change parameter theta in V; the output end of the Sampler fills V according to a series of coordinates and original pictures U obtained by the Grid generator;

s3: license plate recognition;

the step S3 comprises the following steps: the license plate is identified by using an LPRNet, wherein the LPRNet comprises an input end, a backbox end, a Neck end and a Head, the input end is an RGB picture of 94 x 24, the backbox end comprises a Small Basic Block module, convolution, maxPooling module, avg pool and Dropout, the Neck end fuses four scales through the avg pool layer, the Head is a 1*1 convolution, the input channel number is 448 plus the category number of the predicted alphanumerics, and the output channel number is the category number of the predicted alphanumerics.

Further, the step of collecting the vehicle photo under the circumstance of complex and changeable environment in S11 includes the steps of: and collecting license plate photos in rainy days, cloudy days, inclined and fuzzy environments.

Further, the image data with higher similarity is removed in S11, and the similarity is calculated by calculating the similarity of the images through a histogram.

Further, the step of manually classifying license plates of different types in S11 includes the steps of: and the license plates are manually classified according to the new energy license plates, the yellow license plates, the two-land license plates and the common blue license plates.

Further, in S11, the images in the picture data set are divided into a training data set, a verification data set and a test data set, and the ratio of the training data set, the verification data set and the test data set is 8:1:1.

Further, the loss function of the YOLOV5S model in S12 is: loss=λ ₁ L _cls +λ ₂ L _obj +λ ₃ L _loc Wherein lambda is ₁ 0.5 lambda ₂ Is 1 lambda ₃ 0.05, L _cls Classification loss for positive samples、L _obj For confidence loss, L _loc Is the loss of localization of the positive sample.

Further, the step S12 further includes the steps of: in the process of training the YOLOV5S model, a training data set is expanded through a data enhancement method, wherein the data enhancement method comprises picture rotation, picture translation, picture folding and picture random clipping.

Further, in the training process of the YOLOV5S model, the training period of S12 is 100, the learning_rate is 0.01, the learning rate is adjusted once every 20 epochs, the momentum is 0.9, a weight file is generated, and the best model weight parameter is selected as the weight of the best YOLOV5 model, so that the best YOLOV5 model is obtained.

Further, the invention relates to a variable length license plate recognition method based on yolov5, which further comprises the following steps: and (3) post-processing and decoding, namely finding the category corresponding to the maximum probability of each position in the sequence by using argmax to obtain a sequence with the length of 18, and removing blank and de-duplication of the sequence to obtain a final predicted sequence.

Compared with the prior art, the method has the beneficial effects that the qualified training set is obtained through manual screening, the YOLOV5S neural network is trained to accurately obtain the license plate positioning result, the deformed and distorted license plate is corrected through the training STN network to obtain better license plate characteristics, and finally the license plate is identified through the training LPRNet network to obtain accurate license plate information.

Drawings

FIG. 1 is a schematic flow chart of a variable length license plate recognition method based on yolov 5;

FIG. 2 is a schematic diagram of a license plate positioning process in the present invention;

FIG. 3 is a schematic diagram of a data preparation flow in the present invention;

FIG. 4 is a schematic diagram of a BottleNeck1 module in a YOLOV5S network architecture according to the present invention;

FIG. 5 is a schematic diagram of a BottleNeck2 module in a YOLOV5S network architecture according to the present invention;

FIG. 6 is a schematic diagram of a C3 module in a Yolov5S network structure according to the present invention;

FIG. 7 is a schematic diagram of SPPF module in a yolo 5S network architecture according to the present invention;

fig. 8 is a structural diagram of an STN network in the present invention;

FIG. 9 is a diagram of the LPRNet network according to the present invention;

fig. 10 is a diagram of a network structure small basic block in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, 2 and 3, the present invention makes a photograph of a complex and changeable environment (rainy, cloudy, inclined, blurred) of a vehicle photograph as a dataset by equally scaling a frame number of video images acquired from a parking lot of agricultural university of south China, wherein the resolution of each image is 720 (width) ×1160 (height) ×3 (channel). Then, the damaged pictures without the complete license plate are removed in a manual screening mode, and because the picture data obtained through video frame extraction has high continuous picture similarity, the picture data with high similarity are required to be removed (a method for calculating the picture similarity through a histogram), and different types of license plates (such as new energy automobile license plates, yellow license plates, two places of license plates and common blue license plates) are manually classified.

Labeling tag information of each vehicle sample in each data set picture by using a labelimg tool, and generating a labeling txt file, wherein the data set format in the txt file is yolo format, and each row in the txt file represents information of one license plate and comprises a license plate category, a license plate center point abscissa, a license plate center point ordinate, a license plate width and a license plate height. Randomly extracting the data set marked with the label according to a certain proportion, and dividing the image into a training data set, a verification data set and a test data set, wherein the proportion is 8:1:1.

in the present invention, the YOLOV5S network structure comprises:

the ConvBNSILU module comprises a network structure consisting of a conv convolution layer, a bn layer and a silu activation layer, wherein a parameter k in the conv convolution layer refers to a convolution kernel, a subsequent number refers to a convolution kernel size, s refers to a stride, a subsequent number refers to a stride, p refers to a stride, a subsequent number refers to a stride size, c refers to an output channel, and a subsequent number refers to a channel size;

the network structure of the BottleNeck1 module, as shown in FIG. 4, comprises ConvBNSILU modules connected in sequence to 1*1, then ConvBNSILU modules connected to 3*3, and then added with the initial input through a residual structure;

the network structure of the BottleNeck2 module, as shown in FIG. 5, includes ConvBNSILU modules sequentially connected to 1*1, and then ConvBNSILU modules connected to 3*3;

the network structure of the C3 module, as shown in fig. 6, is a main module for learning residual characteristics, and the structure of the C3 module is divided into two branches, one branch passes through the ConvBNSiLU module and then passes through the specified plurality of Bottleneck stacks, the other branch only passes through one basic convolution module, and finally the two branches are subjected to concat operation. Finally, passing through a ConvBNSILU module;

SPPF module, as shown in fig. 7, includes 3 serial max pooling layers, each of which has a pooling core size of 5*5. The SPPF module structure is four branches, the first branch is a ConvBNSILU module which passes through 1*1, the second branch is a first branch output which passes through a maximum pooling layer, the third branch is a second branch output which passes through a maximum pooling layer, and the fourth branch is a third branch output which passes through a maximum pooling layer. Finally, the four branches are subjected to concat operation and pass through the ConvBNSILU module of 1*1.

In fig. 4 to 7, k denotes kernel, s denotes stride, p denotes padding, and c denotes channel. Inputting 640 x 3jpg images to a ConvBNSILU (k 6, s2, p2, c 64) module, and outputting to obtain 320 x 64 feature images; inputting the output characteristic diagram to a ConvBNSILU (k 6, s2, p1, c 128) module, and outputting to obtain a characteristic diagram of 160×160×128; inputting the output characteristic diagram into a C3 module (wherein BottleNeck selects 1 structure and repeats three times) module to output 160×160×128 characteristic diagram; inputting the output characteristic diagram into a ConvBNSILU (k 3, s2, p1, c 256) module to output a characteristic diagram of 80 x 256; the output feature map is input to a C3 module (wherein the BottleNeck selects 1 structure, and repeats six times) module to obtain a feature map of 80×80×256. The feature map is divided into two branches, wherein one branch is marked as P1, and the other branch is continuously input to a ConvBNSILU (k 3, s2, P1, c 256) module to output to obtain a 40 x 512 feature map; the output feature map is input to a C3 module (wherein the BottleNeck selects 1 structure, and repeats nine times) module output to obtain a feature map of 40×40×512. The feature map is divided into two branches, wherein one branch is marked as P2, and the other branch is continuously input to a ConvBNSILU (k 3, s2, P1, c 1024) module for outputting to obtain a 20 x 1024 feature map; inputting the output characteristic diagram into a C3 module (wherein BottleNeck selects 1 structure and repeats three times) module to output 20 x 1024 characteristic diagram; inputting the output characteristic diagram into an SPPF module to output 20 x 1024 characteristic diagrams; the output feature map is input into a ConvBNSILU (k 1, s1, p0, c 512) module to obtain a feature map of 20 x 512. The feature map is divided into two branches, wherein one branch is marked as F1, and the other branch is continuously input into an upsampling module to be upsampled and output to obtain a feature map of 40 x 512; then inputting the output feature map into a Concat module and carrying out feature fusion with P2 to obtain a feature map of 40 x 512; the output feature map is input to a C3 module (wherein the BottleNeck selects the 2 structure and repeats three times) module to obtain a feature map of 40×40×512.

The YOLOV5S model loss function consists of the following respectively: classification loss (class loss), which is BCE loss, is used to calculate the classification loss of the positive sample only; confidence loss (objective loss), which is still BCE loss, refers to CIoU of the target bounding Box and GT Box of network prediction, where the loss of all samples is calculated, location loss (Location loss), which is CIoU loss, is used, only the Location loss of positive samples is calculated,

Loss＝λ ₁ L _cls +λ ₂ L _obj +λ ₃ L _loc

wherein lambda is the equilibrium coefficient, 0.5,1 and 0.05 respectively.

When the YOLOV5S model is trained, the YOLOV5S pre-training weight is used as the training weight of the whole network to train, and in the training process, a training data set is expanded through a data enhancement method, wherein the data enhancement method comprises picture rotation, picture translation, picture folding and picture random cutting. The picture rotation means that the picture is rotated at any angle; the picture translation refers to horizontally or vertically moving a picture so that a target is positioned at the center of the picture; the picture turning-over refers to turning over in the horizontal or vertical direction; the random cropping of the picture refers to randomly cropping a part of the area in the picture, and adjusting the cropped area to the proportional size of the original picture. The training data set is expanded through the data enhancement method, so that the overfitting can be reduced, and the generalization capability of the model is improved.

The YOLOV5S model training period is 100, and the required weight files are generated. And selecting the trained optimal model weight parameters from the obtained weight files as the weight of the optimal yolov5 model, and obtaining the optimal yolov5 model.

Referring to fig. 8, in order to solve the influence of license plate tilting deformation and unfixed shooting angle on license plate recognition, robustness and precision of license plate recognition application are improved, so that the network has translational, rotational, zooming and shearing invariance. And constructing a network model structure based on STN. The STN network structure mainly comprises an input end, a localization net, a Grid generator and a sampling output end.

Localization net: is a self-defined network that inputs U and outputs a variable parameter theta that is used to map the co-ordinate relationship of U and V. The invention predicts the transformation matrix theta by extracting the features of the image through the CNN.

Grid generator: from the coordinate point in V and the change parameter Θ, a coordinate (1.2,1.6) is obtained by multiplying the coordinate (1, 1) in V and the change parameter Θ, i.e. the pixel value of the coordinate (1, 1) in V is equal to the pixel value of the coordinate (1.2,1.6) in U. The size of V is defined by the user, so that all coordinate points of V can be obtained, and the pixel value of each coordinate point in V is obtained from U when filling, so that the coordinate relative to U is obtained by calculating according to each coordinate point in V and the change parameter Θ.

Sampler: to be done is to fill V, a series of coordinates obtained from Grid generator and original image U (because pixel values are to be taken from U), and because the calculated coordinates may be decimal, the module is to be filled with bilinear interpolation.

STN is a bootable network so it can achieve end-to-end training after insertion into a subsequent LPRNet network. The new reorganization network has space invariance, which makes the training of the network after STN easier and improves the overall expressive force of the network.

Referring to fig. 9, in order to solve the problem that the license plate number length is different, and the influence of the accuracy of the current license plate recognition technology is limited by a plurality of rows of license plates, a novel license plate and the like, the license plate recognition module adopts an LPRNet (algorithm which can be used for real-time high-quality support of a variable-length license plate recognition network structure.

The Backbone network in LPRNet is shown in fig. 10, wherein the input terminal is an RGB picture of 94×24, and the backhaul terminal mainly comprises Small Basic Block modules, convoltion, maxPooling, avg pool, dropout.

For better utilization of the feature extracted by the backup, the neg end does not take the context information extracted by using a full connection layer in the LPRNet paper, but fuses the four scales by using the avg pool layer.

Head is a convolution of 1*1, the number of input channels is 448 plus the number of predicted alphanumeric categories (68 in this method), and the number of output channels is the number of predicted alphanumeric categories. The final network outputs [68, 18], where 68 represents the probability that each position character in the sequence is of the corresponding class, 68 classes are shared by license plate characters, and 18 represents the character sequence length. The structure of Small Basic Block is shown in fig. 7.

When the STN-based network model is trained, training is carried out by taking the weight after the CCPD data set is trained as the initial training weight of the whole network, and the training data set is expanded by a data enhancement method in the training process, wherein the data enhancement method comprises picture rotation, picture translation, picture folding and picture random cutting. The picture rotation means that the picture is rotated at any angle; the picture translation refers to horizontally or vertically moving a picture so that a target is positioned at the center of the picture; the picture turning-over refers to turning over in the horizontal or vertical direction; the random cropping of the picture refers to randomly cropping a part of the area in the picture, and adjusting the cropped area to the proportional size of the original picture. The training data set is expanded through the data enhancement method, so that the overfitting can be reduced, and the generalization capability of the model is improved.

Max_epoch is 100, learning_rate is 0.01, learning rate is adjusted every 20 epochs, momentum is 0.9, and meanwhile required weight files are generated. And selecting the trained optimal model weight parameters from the obtained weight files as the weight of the optimal LPRNet model to obtain the optimal yolov5 model.

In order to deal with the problem of alignment of sequences of varying length, i.e. the length of the predicted sequence obtained is not as long as the sequence of real gt, so that some classical loss functions such as cross entropy cannot be used. The invention adopts CTC Loss. Finally, in order to obtain a final predicted sequence, firstly, argmax is used for finding the category (greedy search) corresponding to the maximum probability of each position in the sequence, a sequence with the length of 18 is obtained, then, blank ('-' characters are removed from the sequence, no characters exist in the current position of the sequence), and duplicate removal is carried out (characters in two adjacent positions of the sequence cannot be repeated), so that the final predicted sequence is obtained.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. The variable length license plate recognition method based on yolov5 is characterized by comprising the following steps:

s1: positioning a license plate;

the step S1 comprises the following steps:

s2: correcting the license plate;

s3: license plate recognition;

2. The yolov 5-based variable length license plate recognition method of claim 1, wherein the step of collecting a vehicle photograph under the condition of complex and changeable environments in S11 comprises the steps of: and collecting license plate photos in rainy days, cloudy days, inclined and fuzzy environments.

3. The yolov 5-based variable length license plate recognition method of claim 1, wherein the step of removing the picture data with higher similarity in step S11 is to calculate the similarity of the pictures through a histogram.

4. The yolov 5-based variable length license plate recognition method of claim 1, wherein the step of manually classifying different types of license plates in S11 comprises the steps of: and the license plates are manually classified according to the new energy license plates, the yellow license plates, the two-land license plates and the common blue license plates.

5. The yolov 5-based variable length license plate recognition method of claim 1, wherein the step S11 is to divide the images in the picture dataset into a training dataset, a verification dataset and a test dataset, and the ratio is 8:1:1.

6. the YOLOV 5-based variable length license plate recognition method of claim 1, wherein the loss function of the YOLOV5S model in S12 is: loss=λ ₁ L _cls +λ ₂ L _obj +λ ₃ L _loc Wherein lambda is ₁ 0.5 lambda ₂ Is 1 lambda ₃ 0.05, L _cls Classification loss, L, for positive samples _obj For confidence loss, L _loc Is the loss of localization of the positive sample.

7. The yolov 5-based variable length license plate recognition method of claim 1, wherein S12 further comprises the steps of: in the process of training the YOLOV5S model, a training data set is expanded through a data enhancement method, wherein the data enhancement method comprises picture rotation, picture translation, picture folding and picture random clipping.

8. The YOLOV 5-based variable length license plate recognition method of claim 1, wherein in the training of the YOLOV5S model, the training period is 100, the learning_rate is 0.01, the learning rate is adjusted once every 20 epochs, the momentum is 0.9, a weight file is generated, and the best model weight parameter is selected as the weight of the best YOLOV5 model to obtain the best YOLOV5 model.

9. The yolov 5-based variable length license plate recognition method of claim 1, further comprising the steps of: and (3) post-processing and decoding, namely finding the category corresponding to the maximum probability of each position in the sequence by using argmax to obtain a sequence with the length of 18, and removing blank and de-duplication of the sequence to obtain a final predicted sequence.