CN114708185A

CN114708185A - Target detection method, system and equipment based on big data enabling and model flow

Info

Publication number: CN114708185A
Application number: CN202111258992.4A
Authority: CN
Inventors: 张兆翔; 彭君然; 卜兴源; 常清
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2022-07-05

Abstract

The invention belongs to the field of computer vision and pattern recognition, and particularly relates to a target detection method, a system and equipment based on big data enabling and model flow, aiming at solving the problems that the existing target detection model is limited by training data, so that the model performance is not high, and the reusability is poor in different application scenes. The invention comprises the following steps: integrating all the disclosed target detection data sets, and building a model sampling space by taking any model as a reference; completing dynamic extranet covering various operation requirements in one training; in the current scene, model initialization and sub-model screening are carried out through semantic information vectors of categories; and pre-training the sub-models through the current scene data, and finally obtaining the target detection model to perform target detection on the image to be detected of the target in the current scene. After a flexible dynamic extranet is constructed, the target detection model with excellent performance in the current scene can be obtained by using a small amount of labeled data in the current scene for quick fine adjustment.

Description

Target detection method, system and equipment based on big data enabling and model flow

Technical Field

The invention belongs to the field of computer vision and pattern recognition, and particularly relates to a target detection method, a system and equipment based on big data enabling and model flow.

Background

Target detection is an important and challenging computer vision task, and has wide application in the fields of security monitoring, intelligent video analysis, automatic driving and the like. In each application scenario, the cost is too high to achieve the data set with the similar COCO labeling order, and the realization is difficult. Generally, only a small amount of low-cost labeled data can be obtained, and the limited data causes that the model training is very difficult and the performance of the model is difficult to be exerted as much as possible. Meanwhile, in different application scenes, the difference of usable hardware resources is very obvious, the models capable of being deployed are different, and in a new scene, a new model needs to be selected again from the beginning according to the experience of human experts for retraining, so that the reusability of the model is very poor, and the hardware resources are very wasted.

Disclosure of Invention

In order to solve the above problems in the prior art, namely that the existing target detection model is limited by insufficient training data, so that the model performance is not high, and the reusability is poor, in different application scenarios, the model needs to be changed due to the change of hardware requirements, so that the model needs to be retrained, the invention provides a target detection method based on big data enabling and model flow, which comprises the following steps:

step S10, acquiring a target detection image of the current scene, and converting the category information of the image into a semantic information vector through a Word2vector model;

step S20, performing cosine similarity matching between the semantic information vector and the semantic information vectors of all the public target detection data set image category information;

step S30, initializing the target detection model of the current scene by the classification full-connection weight of the class with the highest matching value in the ultra-network with variable width and depth;

step S40, pre-specifying the floating point operation times and parameter quantity of the model, and traversing in the model sampling space one by one to obtain K target detection submodels of the current scene;

step S50, pre-training the K target detection submodels by respectively adopting the training images of the current scene, and taking the pre-trained submodel with the maximum mAP value as the final target detection model of the current scene;

and step S60, carrying out target detection on the image to be detected by the target in the current scene through the final target detection model in the current scene to obtain a target detection result.

In some preferred embodiments, the model sampling space is obtained by:

step A10, normalizing the images in all the obtained public target detection data sets, and converting the category information corresponding to the images into semantic information vectors through a Word2vector model;

step A20, calculating cosine similarity among semantic information vectors of each category, merging categories with the cosine similarity larger than a preset threshold value, and then performing label remapping;

and A30, selecting any deep learning target detection model, and extracting the width and the depth sampling space of the backbone network set model according to the characteristics of the model to obtain a model sampling space.

In some preferred embodiments, the width and depth variable ultra-mesh is constructed and trained by the following method:

step B10, dividing the image set after label remapping into batches with set size, extracting the characteristics of any batch of images through the characteristic extraction backbone network, and randomly sampling a model in the model sampling space;

step B20, inputting the characteristics of the current batch of images into the classification of the corresponding random sampling model and the regression branch of the bounding box for forward propagation, and calculating the global loss of the model;

and step B30, reducing the global loss through a back propagation method and a random gradient descent method to update model parameters, and carrying out iterative training until a set training end condition is reached to obtain the ultra-net with variable width and depth.

In some preferred embodiments, before dividing the image set after the tag remapping in step B10 into batches with a set size, an image set expanding step is further provided, and the method is as follows:

and carrying out random multi-scale scaling and multi-angle turning operation on the image subjected to the label remapping to obtain an expanded image set.

In some preferred embodiments, the global penalty is expressed as:

L_all＝λL_rcnn+L_rpn

wherein L is_rcnnClassification and bounding Box regression losses, L, for the representative model_rpnThe area of the model proposes the loss of the network part, and λ is a balance factor for balancing the two losses.

In some preferred embodiments, the classification and bounding box regression loss for the model is expressed as:

where k represents the number of the prediction box, p_kiRepresenting the prediction probability of the kth prediction box being predicted as class i,

represents the kth preActual probability, L, of the test frame corresponding to the i-th class label_clsFor cross entropy loss, N_clsRepresents the total number of categories included in the dataset; t is t_kiThe prediction coordinates representing the k-th prediction box,

representing the true coordinates, L, corresponding to the kth prediction box_regTo smooth L1 loss, N_regRepresents the total number of prediction boxes; γ is a balancing factor for balancing the two losses.

In some preferred embodiments, the region of the model proposes a network portion loss, which is expressed as:

where k denotes the number representing the prediction frame, p_iA prediction probability representing whether the kth prediction box contains an object,

actual probability, L, representing whether the kth prediction box contains an object_clsFor binary cross entropy loss, K represents the number of all predicted detection frames; t is t_kThe prediction coordinates representing the k-th prediction box,

representing the true coordinates, L, corresponding to the kth prediction box_regLost to smooth L1; β is a balance factor for balancing the two losses.

In another aspect of the invention, an object detection system based on big data enabling and model flow is provided, which comprises the following modules:

the semantic information vector extraction module is configured to acquire an image to be detected of a target in a current scene and convert the category information of the image into a semantic information vector through a Word2vector model;

the matching module is configured to perform cosine similarity matching between the semantic information vector and semantic information vectors of all public target detection data set image category information;

the initialization module is configured to initialize the target detection model of the current scene according to the classification full-connection weight of the class with the highest matching value in the ultra-network with variable width and depth;

the submodel screening module is configured to pre-specify the floating point operation times and the parameter quantity of the model, and traverse the models one by one in a model sampling space to obtain K target detection submodels of the current scene;

the model training module is configured to respectively adopt the training images of the current scene to pre-train the K target detection submodels, and takes the pre-trained submodel with the maximum mAP value as the final target detection model of the current scene;

and the target detection module is configured to perform target detection on the image to be subjected to target detection in the current scene through the final target detection model in the current scene to obtain a target detection result.

In a third aspect of the present invention, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,

the memory stores instructions executable by the processor for execution by the processor to implement the big data enable and model flow based object detection method described above.

In a fourth aspect of the present invention, a computer-readable storage medium is provided, which stores computer instructions for execution by the computer to implement the above-mentioned big data enabling and model flow based object detection method.

The invention has the beneficial effects that:

(1) according to the target detection method based on big data enabling and model flow, disclosed by the invention, the supernet with dynamically variable depth and width is constructed by integrating publicly available large target detection data sets, and the supernet is enabled by big data enabling, so that the adaptability of the model can be greatly improved, and the effect of simultaneously training hundreds of thousands of submodels is achieved.

(2) According to the target detection method based on big data enabling and model flow, the trained model has extremely strong generalization capability because of being trained by a million-order data set, and can be used as an excellent initialization model of a downstream actual deployment scene, and the model can achieve extremely excellent performance only by providing a small amount of labeled data in the specific deployment scene.

(3) The invention relates to a target detection method based on big data enabling and model flow, which utilizes the characteristic that images distributed in the same domain are similar in the characteristic graph of a full connection layer of a network to match and select the closest image with a label from a database trained by a super-network for the images provided by a downstream deployment scene, and the closest image with the label is added into the training of model deployment, thereby further improving the performance of a model.

(4) According to the target detection method based on big data enabling and model flow, simultaneous training including subnets is achieved through the design of a dynamic structure in the initial extranet training process, when deployment is conducted on various different scenes, only direct extraction is needed to be conducted on the trained subnets according to various requirements of the corresponding scenes, and model deployment consumption under various environments is saved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow diagram of a big data enabling and model flow based object detection method of the present invention;

FIG. 2 is a block diagram of the present invention large data enabling and model flow based object detection method;

FIG. 3 is a schematic search space diagram of an embodiment of a big data enabling and model flow based object detection method according to the present invention, which uses a residual neural network as a backbone network;

FIG. 4 is a schematic diagram of similar image extraction in a sparse data scene according to an embodiment of the target detection method based on big data enabling and model flow.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention provides a target detection method based on big data enabling and model flow, aiming at the problems in the prior art in target detection, aiming at the current situation that the performance of the existing target detection model is low in a few-sample scene on one hand, and aiming at the situations of complexity of model deployment and resource waste on the other hand, hundreds of thousands of submodels are trained at one time by integrating a large target detection data set and constructing a flexible dynamic extranet. In a specific use scene, selecting the weight corresponding to the closest semantic category in the training database according to the word vector model for initialization, and quickly fine-tuning by using a small amount of labeled data in the use scene to achieve excellent target detection performance. Under the condition that corresponding use scene labeled data are sparse, similar data are selected from a training database according to the distribution characteristics of the labeled data in a network full-connection layer feature diagram for assisting deployment training, and the performance of the model is further enhanced.

The invention relates to a target detection method based on big data enabling and model flow, which comprises the following steps:

step S30, initializing the target detection model of the current scene by the classification full-connection weight of the category with the highest matching value in the hyper-network with variable width and depth;

In order to more clearly describe the object detection method based on big data enabling and model flow of the present invention, the following describes the steps in the embodiment of the present invention in detail with reference to fig. 1 and 2.

The object detection method based on big data enabling and model flow in the first embodiment of the invention has the following steps:

step A10, normalizing the images in all the obtained public target detection data sets, and converting the category information corresponding to the images into semantic information vectors through a Word2vector model. In one embodiment of the invention, the length of the short side of all the images is adjusted to 800 pixels.

And step A20, calculating cosine similarity among semantic information vectors of each category, merging categories with the cosine similarity larger than a preset threshold value, and then performing label remapping. In one embodiment of the present invention, the predetermined threshold is 0.8.

In one embodiment of the invention, a two-stage target detector FasterRCNN model is selected as a target detection model, a residual error network Resnet is selected as a backbone network of the model, and a Resnet50 network, a Resnet77 network or a Resnet101 network is selected as a model anchor point.

As shown in fig. 3, which is a schematic diagram of a search space when a residual neural network is used as a backbone network according to an embodiment of the target detection method based on big data enabling and model flow in the present invention, AR50, AR77, and AR101 represent respet 50, respet 77, and respet 101 networks respectively, dmin represents the depth of the minimum network, Danchor represents the depth of the corresponding anchor model, Dmax represents the depth of the maximum network, Dstep represents the step size of the network that can be dynamically changed from the minimum depth to the maximum depth, Wmin represents the width of the maximum network, Wanchor represents the width of the corresponding anchor model, Wmax represents the depth of the corresponding maximum model, Wstep represents the step size of the network that can be dynamically changed from the minimum width to the maximum width, Smin represents the minimum input resolution of the network, Sanchor represents the reference input resolution, Smax represents the maximum input resolution of the network, and Sstep represents the step size of the network that accepts the input resolution to be changed from the maximum to the minimum.

And step B10, dividing the image set (or the expanded image set) into batches with set sizes, extracting the features of any batch of images through the feature extraction backbone network, and randomly sampling a model in the model sampling space.

Step B20, inputting the characteristics of the current batch of images into the classification of the corresponding random sampling model and the regression branch of the bounding box for forward propagation, and calculating the global loss of the model, as shown in formula (1):

L_all＝λL_rcnn+L_rpn (1)

wherein L is_rcnnClassification and bounding box regression loss, L, of the representative model_rpnThe area of the model proposes the loss of the network part, and λ is a balance factor for balancing the two losses.

Classification and bounding box regression loss for the model, which is expressed as shown in equation (2):

represents the actual probability that the corresponding label of the kth prediction box is the ith class, L_clsFor cross entropy loss, N_clsRepresents the total number of categories included in the dataset; t is t_kiThe predicted coordinates representing the kth prediction box,

represents the true coordinate corresponding to the kth prediction box, L_regTo smooth L1 loss, N_regRepresents the total number of prediction boxes; γ is a balancing factor for balancing the two losses.

The area of the model proposes a network portion loss, which is expressed as shown in equation (3):

representing the true coordinates, L, corresponding to the kth prediction box_regLoss of smoothing L1; β is a balance factor for balancing the two losses.

Step S10, obtaining an image to be target-detected in a current scene (i.e., a downstream target-detected scene that needs to be deployed), and converting the category information of the image into a semantic information vector through a Word2vector model.

And step S20, performing cosine similarity matching between the semantic information vector and the semantic information vectors of all the public target detection data set image category information.

And step S30, initializing the target detection model of the current scene by the classification full-connection weight of the class with the highest matching value in the ultra-net with variable width and depth.

Step S40, appointing the floating point operation times (FLOPs, floating point operation times, which can be used for measuring algorithm/model complexity) and parameters of the model in advance, traversing one by one in the model sampling space, screening out submodels meeting the limitation requirement, and randomly sampling K selected submodels to obtain K target detection submodels of the current scene.

Step S50, pre-training the K target detection submodels respectively by using the training images of the current scene (i.e. the data set of the downstream target detection scene to be deployed), to obtain the maps, and selecting the pre-trained submodel with the best result from the K target detection submodels as the final target detection model of the current scene (i.e. the final deployment model of the downstream target detection scene to be deployed) according to the maps result.

As shown in fig. 4, a schematic diagram of similar image extraction in a sparse data scene is shown, according to an embodiment of the target detection method based on big data enabling and model flow, labeled images very similar to a deployment scene can be provided from a locally maintained huge database, and the performance of a deep learning algorithm is greatly improved along with the improvement of data volume.

The method has the advantages that the potential capability of a large-scale data set is mined, so that the trained model has excellent migration capability, and excellent target detection performance can be achieved only by little related domain data in other unseen data domains. In an actual application scene, data are often sparse, the method not only provides good model initialization, but also selects the most similar related data aiming at the deployed specific data domain from the labeling database used for training through the characteristics of the distribution of the characteristic diagram of the image in the network, assists the specific downstream model deployment training, and obtains excellent target detection performance in the scene with sparse data. .

Although the foregoing embodiments describe the steps in the above sequential order, those skilled in the art will understand that, in order to achieve the effect of the present embodiments, the steps may not be executed in such an order, and may be executed simultaneously (in parallel) or in an inverse order, and these simple variations are within the scope of the present invention.

A second embodiment of the present invention is a big-data-enabled and model-flow-based object detection system, comprising the following modules:

the model training module is configured to respectively adopt training images of a current scene to pre-train the K target detection submodels, and takes the pre-trained submodel with the maximum mAP value as a final target detection model of the current scene;

and the target detection module is configured to perform target detection on the image to be detected by the target in the current scene through the final target detection model in the current scene to obtain a target detection result.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that, the object detection system based on big data enabling and model flow provided by the foregoing embodiment is only illustrated by the above division of each functional module, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

An electronic apparatus according to a third embodiment of the present invention includes:

at least one processor; and

A computer-readable storage medium of a fourth embodiment of the present invention stores computer instructions for execution by the computer to implement the big data enabling and model flow based object detection method described above.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Those of skill in the art will appreciate that the various illustrative modules, method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. An object detection method based on big data enabling and model flow, characterized in that the object detection method comprises:

step S50, pre-training the K target detection sub-models respectively by using the training images of the current scene, and taking the pre-trained sub-model with the largest mean Average Precision value as the final target detection model of the current scene;

2. The big data enabling and model flow based object detection method according to claim 1, wherein the model samples the space and is obtained by:

3. The big data enabling and model flow based object detection method according to claim 2, wherein the ultra-net with variable width and depth is constructed and trained by the following method:

4. The big data enabling and model flow based object detection method according to claim 3, wherein before dividing the image set after label remapping in step B10 into batches with set size, there is further provided an image set expansion step, and the method comprises:

5. The big-data-enabled and model-flow-based object detection method of claim 3, wherein the global penalty is expressed as:

L_all＝λL_rcnn+L_rpn

6. The big-data-enabled and model-flow-based object detection method of claim 5, wherein the classification and bounding box regression loss of the model is expressed as:

representing the actual probability that the corresponding label of the kth prediction box is of the ith class, L_clsFor cross entropy loss, N_clsRepresents the total number of categories included in the dataset; t is t_kiThe prediction coordinates representing the k-th prediction box,

7. The big-data-enabled and model-flow-based object detection method of claim 5, wherein the model's area proposal network section loss is expressed as:

where k denotes the number representing the prediction frame, p_iPrediction probability representing whether the k-th prediction box contains an object，

Actual probability, L, representing whether the kth prediction box contains an object_clsFor binary cross entropy loss, K represents the number of all predicted detection frames; t is t_kThe predicted coordinates representing the kth prediction box,

8. An object detection system based on big data enabling and model flow, characterized in that the object detection system comprises the following modules:

the submodel screening module is configured to pre-specify the floating point operation times and the parameter quantity of the model, traverse one by one in a model sampling space and obtain K target detection submodels of the current scene;

9. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the processor for execution by the processor to implement a big data enabling and model flow based object detection method of any of claims 1-7.

10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions for execution by the computer to implement the big-data enabling and model-flow based object detection method of any of claims 1-7.