CN118155012A

CN118155012A - YOLOv5 s-based lightweight vehicle model training method

Info

Publication number: CN118155012A
Application number: CN202311806698.1A
Authority: CN
Inventors: 张小川; 张润钦; 邢欣来
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-06-07

Abstract

The invention discloses a YOLOv s-based lightweight vehicle-mounted model training method, and relates to the technical field of lightweight vehicle-mounted models. The invention at least comprises the following steps: s1: acquiring an original image data set and preprocessing the original image data set to obtain a target image data set, wherein the original image data set is a large number of image sets containing targets to be detected, which are acquired through an automobile camera, and the preprocessing comprises data labeling and data enhancement; s2: training YOLOv s network model by using the target image dataset to obtain a first network model; s3: based on a lightweight MobileOne network and a Slim-Neck structure, the YOLOv5s network is improved to obtain a lightweight YOLOv s network model. According to the method, the target image data set is obtained by preprocessing the original image set such as data labeling and data enhancement, so that the data breadth and the distribution space of the data set are improved, and further the detection precision and the generalization capability of the lightweight YOLOv s network model are effectively ensured.

Description

YOLOv5 s-based lightweight vehicle model training method

Technical Field

The invention relates to the technical field of lightweight vehicle-mounted models, in particular to a YOLOv s-based lightweight vehicle-mounted model training method.

Background

The task of object detection is to find all objects of interest in an image, determine their category and location, which is the premise and basis for many computer vision tasks, and is always the most challenging problem in the field of computer vision.

Along with a great deal of experimental research, more and more image processing and recognition technologies are continuously emerging, and particularly, in recent years, application and popularization of artificial intelligence technologies represented by deep learning are realized, an important new model is provided for target detection, and proper target characteristics can be automatically calculated by only building a proper network model and training a data set.

Along with deep learning, automatic driving and rapid development of new energy automobiles, more and more models need to run at the vehicle-mounted end, however, along with the continuous deepening of network layers, the target detection model becomes more and more complex, and the required calculation amount is also continuously increased, so that the models are difficult to run at the vehicle-mounted end, and meanwhile, the accuracy and the speed of model detection are also difficult to reach balance. The YOLO series network has the advantages of high detection speed, high real-time performance and the like, and is widely applied to the field of real-time target detection, but the existing YOLO algorithm still cannot meet the vehicle-mounted application scene in terms of accuracy and speed.

There is therefore a need to propose a new solution to the above problems.

Disclosure of Invention

The invention aims to provide a YOLOv s-based lightweight vehicle model training method.

In order to achieve the above purpose, the present invention provides the following technical solutions: a YOLOv s-based lightweight vehicle-mounted model training method at least comprises the following steps:

s1: acquiring an original image data set and preprocessing the original image data set to obtain a target image data set, wherein the original image data set is a large number of image sets containing targets to be detected, which are acquired through an automobile camera, and the preprocessing comprises data labeling and data enhancement;

S2: training YOLOv s network model by using the target image dataset to obtain a first network model;

s3: based on a lightweight MobileOne network and a Slim-Neck structure, improving a YOLOv5s network to obtain a lightweight YOLOv s network model;

s4: and taking the first network model as a teacher model, and taking the light YOLOv s network model as a student model to carry out knowledge distillation to obtain a target network model.

Preferably, the data annotation tags each image in the dataset by LabelImg means, converts the image into a format that can be read for YOLOv s, and finally separates the dataset into a training set and a validation set.

Preferably, the data enhancement method comprises a Mosaic algorithm and a Mixup algorithm;

The Mosaic algorithm is to splice 4 pictures in a random scaling, random cutting and random arrangement mode to form a large picture, so that the diversity of training data is increased;

The Mixup algorithm randomly selects two pictures, and mixes the two pictures according to a certain proportion to generate a new image as enhanced data;

The method comprises the steps of carrying out data labeling on an original image set to obtain a first image set, carrying out data enhancement on the first image data set to obtain a target image data set, and improving the data breadth and the distribution space of the data set, so that the detection precision and the generalization capability of a lightweight YOLOv s network model are effectively ensured.

Preferably, the step S2 includes at least the following steps, after configuring relevant training parameters according to the resource calculation condition of the server, training the YOLOv S network model by using the training set of the target image dataset; after training is completed, evaluating the trained YOLOv s network model by using a verification set of the target image data set until the result meets a preset index to obtain a first network model;

the evaluation uses a target detection model in deep learning to evaluate mAP (multi-class average Precision), precision (accuracy), recall (Recall) and FLOPs (number of floating point operations performed per second) indexes of the main stream;

When YOLOv s network model is trained, the original CIoU loss function is replaced by MPDIoU loss function, MPDIoU contains all relevant factors considered in the existing loss function, namely overlapping or non-overlapping area, center point distance and wide-high deviation, and meanwhile, the calculation process is simplified;

MPDIoU the loss function can be expressed by the following formula:

Where IOU represents the conventional cross-ratio penalty, d ₁ represents the relative distance between the upper left corner of the real frame and the predicted frame, d ₂ represents the relative distance between the real frame and the lower right corner of the predicted frame, w represents the width of the input image, and h represents the height of the input image.

Preferably, the MobileOne network in S3 is a mobile-end efficient backbone network recently released by apple team;

the lightweight MobileOne network in the S3 is a MobileOne network after the CoordAttention attention mechanism module is used for replacing the original SE attention mechanism module of the network;

the Slim-Neck structure is a Neck structure which is introduced with a lightweight convolution technology GSConv to replace standard convolution;

CoordAttention is a light-weight attention mechanism module based on coordinated attention, which can further reduce the computational complexity of MobileOne networks by replacing SE attention mechanism modules in MobileOne networks with CoordAttention attention mechanism modules, and also help MobileOne networks more accurately locate and identify the position of a target object in an input image;

The calculation cost of the lightweight convolution technique GSConv is about 60% -70% of that of the standard convolution, so that the calculation amount can be effectively reduced, and the output is as close to the standard convolution as possible;

The lightweight YOLOv s network model obtained by the method not only can reduce the parameter quantity of the model and reduce the hardware resource consumption, but also can effectively ensure the detection precision and generalization capability of the network model.

Preferably, the step S4 at least includes the following steps:

Using a first network model obtained after training the target image dataset in the step S2 as a teacher model, using a lightweight YOLOv S network model obtained by improvement in the step S3 as a student model, performing fine tuning training on the student model by using knowledge distillation, and finally obtaining a target network model which meets expectations by adopting an evaluation method in the step S2;

A FGFI (Fine-grained feature simulation) technical method is adopted during distillation, and the core idea of the method is that a teacher model transmits more key effective information to a student model, but not ineffective background information;

In general, feature maps near key locations of target areas contain some important information of the teacher model, so key locations near the target areas can be estimated first, and then the student model can be made to simulate feature maps of the teacher model at these locations to obtain better performance.

The specific operation of distillation is as follows:

For each truth box, calculating the IOU between the truth box and the key point to obtain an IOU graph w×h×k (W is the width of the feature graph, H is the height of the feature graph, and K is the number of key positions), called M, taking the maximum value M (m=max (M)), and calculating the threshold F according to M:

Filtering out the positions with IOU values lower than F according to a threshold F, and obtaining a mask of W.H by combining the IOU graphs of the rest positions through OR operation;

After all the truth boxes are subjected to the operation, each mask is combined to obtain a final FGFI mask, and the mask contains information of the key position of the target area;

The distance formula between the teacher model and the student model feature map is calculated as follows:

Wherein (i, j) represents a position, c represents a channel, s and t are respectively network feature graphs of a student model and a teacher model, and f _adap is an adaptation function;

for all estimated key locations, the distillation objective is to minimize the distance of the student model from the teacher model feature map at these locations, i.e., minimize:

wherein I is the mask obtained by the above operation, N _P is the number of positive examples in the mask, and the loss function of the final student model is as follows:

L＝L_MPDIoU+L_imitation

where L _MPDIoU is the MPDIoU penalty function used in training the teacher model.

And the model distillation technology is adopted, knowledge is extracted from the trained YOLOv s network model to the light YOLOv s network model, and the detection precision of the target network model can be ensured. The fine granularity characteristic is applied to the model distillation process by FGFI technical method, so that the student model can be helped to capture more abundant characteristic information; by introducing the supervision signal of the original target detection task in the training process, the FGFI technology can better keep the detail characteristics in the complex model, thereby improving the performance of the lightweight model.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the method, the target image data set is obtained by preprocessing the original image set such as data labeling and data enhancement, so that the data breadth and the distribution space of the data set are improved, and the detection precision and the generalization capability of the lightweight YOLOv s network model are further effectively ensured;

2. according to the invention, the MPDIoU loss function is adopted to replace the original CIoU loss function during training, and MPDIoU contains all relevant factors considered in the existing loss function, namely an overlapped or non-overlapped area, a center point distance and a wide-high deviation, so that the calculation process is simplified, and the convergence rate of the model can be accelerated while the model is enabled to obtain higher precision through training;

3. According to the invention, the SE attention mechanism module in the MobileOne network is replaced by the CoordAttention attention mechanism module, so that the computational complexity of the MobileOne network can be further reduced, and the MobileOne network is helped to more accurately locate and identify the position of the target object in the input image; the calculation cost of the lightweight convolution technique GSConv is about 60% -70% of that of the standard convolution, so that the calculation amount can be effectively reduced, and the output is as close to the standard convolution as possible; the lightweight YOLOv s network model obtained by the method not only can reduce the parameter quantity of the model and reduce the hardware resource consumption, but also can effectively ensure the detection precision and generalization capability of the network model.

4. According to the invention, a model distillation technology is adopted, knowledge is extracted from a trained YOLOv s network model to a lightweight YOLOv s network model, and the detection precision of a target network model can be ensured; the fine granularity characteristic is applied to the model distillation process by FGFI technical method, so that the student model can be helped to capture more abundant characteristic information; by introducing the supervision signal of the original target detection task in the training process, the FGFI technology can better keep the detail characteristics in the complex model, thereby improving the performance of the lightweight model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a YOLOv s-based lightweight vehicle model training method;

FIG. 2 is a schematic diagram of a CoordAttention attention mechanism module structure introduced in the present invention;

FIG. 3 is a schematic diagram of a MobileOne network architecture incorporating the present invention;

FIG. 4 is a schematic diagram of a GSConv network architecture incorporating the present invention;

FIG. 5 is a schematic diagram of a model distillation structure employing a fine grain characterization simulation method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Referring to fig. 1-5, a lightweight vehicle model training method based on YOLOv s at least includes the following steps:

s1: acquiring an original image data set and preprocessing the original image data set to obtain a target image data set, wherein the original image data set refers to a large number of image sets containing targets to be detected, which are acquired through an automobile camera, and the preprocessing comprises data labeling and data enhancement;

The data labeling tags each image in the data set through LabelImg tools, converts the images into a format which can be read and trained by YOLOv s, and finally divides the data set into a training set and a verification set.

The data enhancement method comprises a Mosaic algorithm and a Mixup algorithm;

S2, at least, in the following steps, after relevant training parameters are configured according to the resource calculation condition of a server, training a YOLOv S network model by using a training set of a target image dataset; after training is completed, evaluating the trained YOLOv s network model by using a verification set of the target image data set until the result meets a preset index to obtain a first network model;

During evaluation, using a target detection model in deep learning to evaluate mAP (multi-class average Precision), precision (accuracy), recall (Recall) and FLOPs (number of floating point operations performed by the model per second) indexes of the main stream;

MPDIoU the loss function can be expressed by the following formula:

The MobileOne network in the S3 is a mobile terminal efficient backbone network recently released by apple team;

The Slim-Neck structure is a Neck structure which is introduced with a lightweight convolution technique GSConv to replace standard convolution;

S4 at least comprises the following steps:

The specific operation of distillation is as follows:

L＝L_MPDIoU+L_imitation

1. Based on a lightweight MobileOne network and a Slim-Neck structure, improving a YOLOv5s network to obtain a lightweight YOLOv s network model;

2. Acquiring an image dataset of a target to be detected, and preprocessing the image dataset to obtain a target image dataset; training the YOLOv s network model by adopting the target image dataset and replacing the training loss function with MPDIoU loss function to obtain a first network model;

3. And taking the first network model as a teacher model, taking the lightweight YOLOv s network model as a student model, and carrying out knowledge distillation by adopting the target image dataset and adopting a fine-grained feature imitation technical method to obtain the target network model.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A YOLOv s-based lightweight vehicle-mounted model training method is characterized by comprising the following steps of: at least comprises the following steps:

2. The YOLOv s-based lightweight vehicle model training method as claimed in claim 1, wherein: the data labeling tags each image in the data set through LabelImg tools, converts the images into a format which can be read and trained by YOLOv s, and finally divides the data set into a training set and a verification set.

3. The YOLOv s-based lightweight vehicle model training method as claimed in claim 2, wherein: the data enhancement method comprises a Mosaic algorithm and a Mixup algorithm;

4. The YOLOv s-based lightweight vehicle model training method as claimed in claim 1, wherein: the step S2 at least comprises the following steps that after relevant training parameters are configured according to the resource calculation condition of a server, a YOLOv S network model is trained by using a training set of a target image dataset; after training is completed, evaluating the trained YOLOv s network model by using a verification set of the target image data set until the result meets a preset index to obtain a first network model;

During the evaluation, using a target detection model in deep learning to evaluate mAP, precision, recall and FLOPs indexes of the main stream;

MPDIoU the loss function can be expressed by the following formula:

5. The YOLOv s-based lightweight vehicle model training method as claimed in claim 1, wherein: the lightweight MobileOne network in the S3 is a MobileOne network after the CoordAttention attention mechanism module is used for replacing the original SE attention mechanism module of the network;

The Slim-Neck structure is Neck structure which introduces a lightweight convolution technique GSConv instead of standard convolution.

6. The YOLOv s-based lightweight vehicle model training method as claimed in claim 1, wherein: the step S4 at least comprises the following steps:

the FGFI technical method is adopted during distillation;

the specific operation of distillation is as follows:

L＝L_MPDIoU+L_imitation