CN111415338A

CN111415338A - Method and system for constructing target detection model

Info

Publication number: CN111415338A
Application number: CN202010181777.8A
Authority: CN
Inventors: 方思勰; 毛云青; 王国梁
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2020-07-14

Abstract

The invention discloses a method and a system for constructing a target detection model, wherein the method for constructing the target detection model comprises the step of constructing a target detection network, the target detection network comprises a backbone network, a connecting network and a detection network, and the network structure of the backbone network is an inverted residual error structure with a linear bottleneck. According to the invention, through the design of the backbone network structure, the memory requirement of forward calculation can be effectively reduced, and no large vector is needed in the middle process, so that the memory access is reduced, and the detection efficiency of the obtained target detection model is improved.

Description

Method and system for constructing target detection model

Technical Field

The invention relates to the field of image processing, in particular to a method and a system for constructing a target detection model.

Background

The target detection is a hot direction of computer vision and digital image processing, and is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like.

Nowadays, a target detection network is usually used by using darknet53-yolov3, but the target detection based on the darknet53-yolov3 network has the defect of slow detection rate, and in actual use, a GPU is usually required to be additionally purchased to increase the operation rate so as to meet the actual detection requirement, so that the detection cost is high.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method and a system for constructing a target detection model.

In order to solve the technical problem, the invention is solved by the following technical scheme:

a method for constructing an object detection model comprises the step of constructing an object detection network, wherein the object detection network comprises a backbone network, a connection network and a detection network, and the network structure of the backbone network is an inverted residual error structure with a linear bottleneck.

As one possible implementation, the backbone network employs efcientnetb 4.

As an implementation, a feature pyramid network with a network structure of a residual structure is constructed as a connection network.

As an implementation manner, the connection network adopts a feature pyramid network based on MBConv and Conv2 d.

As one possible implementation, the detection network employs yolov 3.

As an implementable embodiment, the method further includes the step of training the target detection network, and the specific steps are as follows:

acquiring sample data based on a detection target and labeling the sample data to acquire training data;

training parameters of a target detection network by using an Adam optimization algorithm and the training data until a loss value generated in the training process reaches a preset termination condition, and outputting the target detection network obtained by training as a target detection model;

the loss values include a confidence loss value, a classification loss value, and a GIoU loss value.

The invention also provides a system for constructing the target detection model, which comprises a network construction module and a model training module, wherein the network construction module is used for constructing the target detection network, the target detection network comprises a backbone network, a connecting network and a detection network, and the network construction module comprises a backbone network construction unit, a connecting network construction unit and a detection network construction unit;

and the backbone network construction unit is used for constructing a backbone network with an inverted residual error structure of a linear bottleneck.

As an implementable embodiment:

the connection network construction unit is used for constructing a characteristic pyramid network with a network structure of a residual error structure as a connection network.

As an implementable embodiment, the model training module is configured to:

training parameters of a target detection network by using an Adam optimization algorithm and the training data until a loss value generated in the training process reaches a preset termination condition, and outputting the target detection network obtained by training as a target detection model; the loss values include a confidence loss value, a classification loss value, and a GIoU loss value.

The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of any of the methods described above.

Due to the adoption of the technical scheme, the invention has the remarkable technical effects that:

1. according to the invention, the trunk network adopts an inverted residual error structure with a linear bottleneck, the memory requirement of forward calculation can be effectively reduced, and a large-scale vector is not required in the middle process, so that the memory access is reduced, and the detection efficiency of the obtained target detection model can be improved.

2. According to the invention, through the design of the connection network, an attention mechanism can be introduced, so that the performance of the backbone network is fully exerted, the detection effect of the constructed target detection model is greatly improved, and the mAP of the target detection model can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for constructing a target detection model according to the present invention;

FIG. 2 is an architecture diagram of an object detection network in embodiment 2;

FIG. 3 is an architectural diagram of the FPN _ block of FIG. 2;

FIG. 4 is a schematic diagram of module connections of a system for constructing an object detection model according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

Embodiment 1, a method for constructing a target detection model, as shown in fig. 1, includes the following steps:

s100, constructing a target detection network;

s200, determining a detection target, acquiring sample data based on the detection target, labeling the sample data, and acquiring training data; and training the target detection network by using the training data to obtain and output a target detection model.

The ways of judging the completion of the training include the following two ways:

and presetting an iteration time threshold of training, and judging that the training is finished when the iteration training time of the target detection network reaches the iteration time threshold, so as to obtain and output a target detection model.

Calculating a loss value of the target detection network, judging that training is finished when the obtained loss value tends to be stable, and obtaining and outputting a target detection model; in this embodiment, the loss value is used as an index for model training.

The target detection network in step S100 includes a backbone network, a connection network, and a detection network;

the backbone network is used for extracting the features of the image to be detected to obtain the image features;

the detection network is used for analyzing the image characteristics and outputting a target detection result;

the connection network is used for connecting the backbone network and the detection network.

The backbone network of the existing and disclosed darknet53-yolov3 network is darknet53, the detection network is yolov3, and the connection network is a connection layer for connecting the backbone network and the detection network.

In this embodiment, a feature extraction network is constructed, and replaces a darknet53 network in an existing darknet53-yolov3 network with the feature extraction network, so as to obtain a target detection network;

the network structure of the feature extraction network is an inverted residual knot with a linear bottleneck.

The main branch of the residual structure comprises three convolutions, the number of two point-by-point convolution channels is large, the inverted residual structure is opposite to the point-by-point convolution channels, the number of the convolution channels in the middle (a depth separation convolution structure is used) is large, the number of the convolution channels in the middle is small, and the inverted residual structure can keep the model expressive force.

The inverted residual structure with linear bottleneck can expand the input low-dimensional compressed representation to a high dimension, use a lightweight deep convolution for filtering, and then project the features back to the low-dimensional compressed representation with a linear bottleneck (linear bottleneck), in this embodiment, stack 7 layers as a basic model, i.e., a backbone network.

In the embodiment, the trunk network adopts an inverted residual error structure with a linear bottleneck, the memory requirement of forward calculation can be effectively reduced, a large vector is not needed in the middle process, memory access is reduced, and therefore the detection efficiency of the obtained target detection model can be improved.

Further, the backbone network adopts efficentnetb 4.

The model parameter of the target detection model constructed based on the darknet53-yolov3 network is 240m, and when efcientnet 4 is used for replacing the darknet as a main network, the model parameter of the constructed target detection model is 70m, and meanwhile, the running speed is improved to be 3 times of the original running speed.

The specific steps of the step S200 are as follows:

s210, determining a detection target, acquiring sample data based on the detection target, labeling the sample data, and acquiring training data;

collecting related images containing a detection target as sample data, performing data annotation on the related images in a VOC data format, storing the related images in a pic directory, and storing annotation data in a label directory.

Extracting the labeled data in the label directory, acquiring corresponding xml information, packaging the extracted xml information and related pictures into a file in tfrecrd format, acquiring training data in VOC2007 data set format, and storing the training data in a distributed file system.

S220, training parameters of the target detection network by using an Adam optimization algorithm and the training data until a loss value generated in the training process reaches a preset termination condition, and outputting the target detection network obtained by training as a target detection model;

the loss values include confidence loss values, classification loss values, and GIoU loss values, and the manner in which the loss values are calculated is well known in the art and thus will not be described in detail herein.

The training steps are specifically:

s221, preprocessing training data;

the training data is randomly divided into a training set, a verification set and a test set according to the ratio of 6:2:2, and a person skilled in the art can set the ratio of the data in the training set, the verification set and the test set according to actual needs.

The data in the training set, the verification set and the test set are preprocessed, for example, the preprocessing such as horizontal inversion, random clipping, random color change, gaussian blur and the like is performed to expand the training set, the verification set and the test set.

S222, pre-training a model;

based on the Adam optimization algorithm and a preset loss function, the constructed target detection network is pre-trained by using a training set and a verification set in a semi-supervised learning mode to obtain an initial detection model, and a person skilled in the art can set the loss function by himself to enable the obtained loss value to include a confidence coefficient loss value, a classification loss value and a GIoU loss value.

Note: the input of the target detection network is a picture, and the output is the confidence, the category and the coordinates of the detected target.

In this embodiment, the initial learning rate of the Adam optimization algorithm is 1e-4, the learning rate is adjusted according to the training round number (for example, the learning rate can be implemented by using the cosine schedule disclosed in the prior art), the total training round number is 100 times, and the training is terminated when the validation set loss does not decrease after 10 training cycles.

S223, optimizing the model;

freezing the initial detection model, only reserving a prediction layer (namely the prediction layer of yolov3), retraining the prediction layer to obtain the optimal tuning parameters, and then optimizing the initial detection model by using the optimal tuning parameters to obtain the target detection model.

Further, the model may be trained in a distributed training manner, where the distributed training manner is: based on a data parallel strategy, the same model is cloned to different GPUs, related batches of data are transmitted, corresponding loss values are obtained, and finally the loss values are synchronously summed up in the CPU.

The technical personnel in the field can carry out monitoring identification and industrial detection through the constructed target detection model, and can also realize automatic annotation of the image.

Embodiment 2 is to change the connection network in embodiment 1 to a feature pyramid network with a residual structure, and the rest of the networks are the same as those in embodiment 1.

In the embodiment, a feature pyramid network is adopted, which can be matched with a backbone network to extract high, medium and low dimensional features of the backbone network respectively, perform feature fusion of different levels by using upsampling, and ensure that an original manifold is not damaged by using a residual network structure, where a target detection network (i.e., an effective-yolov 3) constructed in the embodiment is shown in fig. 2.

Referring to fig. 3, in the present embodiment, a Feature Pyramid Network (FPN) based on MBConv and Conv2d is used as the connection network.

In this embodiment, through the design of the connection network, an attention mechanism (that is, focusing attention on an important point, and ignoring other unimportant factors) can be introduced, so that the performance of the backbone network efinenetb 4 is fully exerted, the detection effect of the constructed target detection model is greatly improved, and the improvement of 2% mAP can be realized on the VOC2007 data set compared with the original target detection model based on the darknet53-yolov3 network.

In summary, the target detection network provided by this embodiment can operate on a low-end device faster and provide a better detection effect, and can implement distributed training and disaster recovery backup of data.

Case, identifying city management events;

collecting a monitoring video, converting the monitoring video into a video frame, and taking the video frame as sample data;

and labeling the video frame based on the city management event, and generating training data in a VOC2007 data set format. The technical personnel in the field can set the category of the urban management event to be detected according to the actual needs, and label the sample data according to the urban management event to be detected, wherein the urban management event comprises but is not limited to illegal parking, littering garbage, ex-store operation, mobile vendor, littering materials and the like.

The training data are randomly divided into a training set, a verification set and a test set according to the proportion of 6:2:2, the preprocessed training set and the preprocessed verification set and the preprocessed test set are used for training the target recognition network constructed in the embodiment (training is carried out according to the training steps disclosed in the embodiment), and a corresponding target detection model is obtained and is used for detecting urban management events of the surface water.

In actual use, a current monitoring video is obtained, the current monitoring video is converted into video frames to be detected, each video frame to be detected is input into a target detection model, and the target detection model outputs the identification result (corresponding category, position and confidence) of a corresponding city management event.

Embodiment 3, a system for constructing an object detection model, as shown in fig. 4, includes a network construction module 100 and a model training module 200;

the network construction module 100 is configured to construct an object detection network, where the object detection network includes a backbone network, a connection network, and a detection network;

the model training module 200 is configured to determine a detection target, obtain sample data based on the detection target, and label the sample data to obtain training data; and training the target detection network by using the training data to obtain and output a target detection model.

The network construction module 100 includes a backbone network construction unit 110, a connection network construction unit 120, and a detection network construction unit 130;

the backbone network constructing unit 110 is configured to construct a backbone network with an inverted residual structure of a linear bottleneck, and in this embodiment, the backbone network constructing unit is configured to construct an efficientnetb 4.

The connection network constructing unit 120 is configured to construct a feature pyramid network with a network structure of a residual structure as a connection network, and in this embodiment, the connection network constructing unit is configured to construct a feature pyramid network based on MBConv and Conv2 d.

The inspection network construction unit 130 in this embodiment is used to construct yolov3 as an inspection network.

The model training module 200 is configured to:

The system further comprises a data acquisition module, a data processing module, a data warehouse and a model release module;

the data acquisition module is configured to receive a training request sent by a user, and instruct the model training module 200 to train the constructed target detection network based on the training request.

The data warehouse is used for storing training data, and a distributed file system is used as the data warehouse in the embodiment.

The data processing module is used for acquiring training data from a data warehouse, preprocessing the training data and inputting the preprocessed training data into the model training module 200;

the data warehouse is also used for storing the trained target detection model and feeding the target detection model back to the user through the model publishing module.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Embodiment 4, a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method of embodiment 1 or 2.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that:

reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

In addition, it should be noted that the specific embodiments described in the present specification may differ in the shape of the components, the names of the components, and the like. All equivalent or simple changes of the structure, the characteristics and the principle of the invention which are described in the patent conception of the invention are included in the protection scope of the patent of the invention. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. A method for constructing an object detection model comprises the step of constructing an object detection network, wherein the object detection network comprises a backbone network, a connection network and a detection network, and is characterized in that the network structure of the backbone network is an inverted residual error structure with a linear bottleneck.

2. The method for constructing the object detection model according to claim 1, wherein the backbone network employs efcientnetb 4.

3. The method for constructing the object detection model according to claim 1 or 2, wherein a feature pyramid network having a network structure of a residual structure is constructed as a connection network.

4. The method for constructing the object detection model according to claim 3, wherein the connection network adopts a feature pyramid network based on MBConv and Conv2 d.

5. The method for constructing the object detection model according to claim 1, wherein the detection network adopts yolov 3.

6. The method for constructing the target detection model according to claim 5, further comprising the step of training the target detection network, specifically comprising the steps of:

7. A construction system of a target detection model comprises a network construction module and a model training module, wherein the network construction module is used for constructing a target detection network, and the target detection network comprises a backbone network, a connection network and a detection network;

8. The system for constructing an object detection model according to claim 7, wherein:

9. The system of constructing the object detection model of claim 7, wherein the model training module is configured to:

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.