CN115761444A

CN115761444A - Training method of incomplete information target recognition model and target recognition method

Info

Publication number: CN115761444A
Application number: CN202211480465.2A
Authority: CN
Inventors: 张栩铭; 姜舜译; 闫淇
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-03-07
Anticipated expiration: 2042-11-24
Also published as: CN115761444B

Abstract

A training method and a target recognition method of an incomplete information target recognition model relate to the technical field of image data processing and data recognition, and solve the problem that an incomplete information target recognition technology is urgently needed, and the training method comprises the following steps: establishing an image video data set comprising a first image and continuous video frames, and artificially labeling a target position and a target type on the image to obtain a labeling label; and training the incomplete target detection model according to the image video data set, the first output characteristic vector and the first fusion characteristic vector and according to the labeling label. The invention improves the accuracy of the target detection model to the incomplete target and reduces the false detection rate of the model; the robustness of the algorithm to incomplete information target detection is better through the enhancement of the spatial context; the method can effectively utilize the context relationship of time domains aiming at the video data, and utilizes the relational modeling of different time domain information to improve the target detection accuracy.

Description

Training method of incomplete information target recognition model and target recognition method

Technical Field

The invention relates to the technical field of image data processing and data recognition, in particular to a training method and a target recognition method of an incomplete information target recognition model.

Background

The target recognition technology is one of the main technologies for intelligently processing video images as a new-generation information technology, and has attracted extensive attention and application in the civil and military defense fields. In the current mainstream video image target identification technology, a better identification effect can be obtained only on the assumption of an ideal condition that targets in an image are clearly visible and contour features are obvious. However, in a real application scene, there are many uncooperative harsh scenes, such as a target being partially blocked and artificially camouflaged, so that only local target information, i.e. incomplete target information (abbreviated as incomplete information), can be obtained from a video image, which makes target identification difficult.

Most of the existing mainstream target detection methods are based on a convolutional neural network, and the position and the type of a target are directly obtained by extracting the low-level and high-level features of a whole graph. The method has the defects that for an incomplete target, the difference between the characteristics of the target and the characteristics of the complete target is large, and the common convolutional neural network cannot accurately identify the target. For incomplete targets, only partial structural information of the target is included, and therefore, a technique for identifying the incomplete target is needed to solve the above problems.

Disclosure of Invention

The invention provides a training method of an incomplete information target recognition model and a target recognition method, aiming at solving the problem of accurately recognizing an incomplete target.

The technical scheme adopted by the invention for solving the technical problem is as follows:

a training method of an incomplete information target recognition model comprises the following steps:

step 1, establishing an image video data set, wherein the image video data set comprises a first image and continuous video frames, and the first image and the continuous video frames both have incomplete targets;

step 2, a plurality of second images of continuous frames in the continuous video frames are obtained, and the target positions and the target types on the first images and the second images are manually marked to obtain marking labels;

step 3, training an incomplete information target recognition model according to the image video data set, the first output feature vector, the first fusion feature vector and the label to obtain the trained incomplete information target recognition model;

the first output feature vector and the first fusion feature vector are obtained by the following method:

for a first image, extracting the features of the first image according to the label of the first image to obtain a first output feature vector; and for continuous video frames in the picture video data set, performing feature extraction on second images of the continuous frames according to the label tags of the second images to obtain second output feature vectors, and performing feature fusion on the second output feature vectors of the second images of the continuous frames to obtain first fusion feature vectors.

The invention has the beneficial effects that:

according to the training method and the target recognition method of the incomplete information target recognition model, the relationships between the target and the whole image and between the local characteristics of the target are effectively modeled, the accuracy of the target recognition model on the incomplete target is improved, and the false detection rate of the model is reduced; the robustness of the algorithm to incomplete information target detection is better through the enhancement of the spatial context; the method can effectively utilize the context relationship of time domains aiming at the video data, and utilizes the relational modeling of different time domain information to improve the target detection accuracy.

Drawings

FIG. 1 is a schematic diagram of a precursor Transformer-based object recognition algorithm.

Fig. 2 is a schematic diagram of an improved signature coding network structure of the present invention.

FIG. 3 is a schematic diagram of the objective recognition algorithm based on spatiotemporal context enhancement for consecutive video frames based on the transform improvement according to the present invention.

FIG. 4 is a schematic diagram of a target recognition algorithm based on spatiotemporal context enhancement for images and improved based on Transformer according to the present invention.

FIG. 5 is a schematic diagram of a target distribution of a test data set according to the present invention.

FIG. 6 is a comparison graph of detection indexes of the target recognition algorithm based on Transformer according to the present invention.

FIG. 7 is a graph showing a comparison of partial images of the recognition results of the original transform method and the method of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.

Example one

Fig. 1 is a schematic diagram of an original target recognition algorithm based on a transform, and the problem that for an incomplete target, the difference between the target feature and the complete target feature is large, and a general convolutional neural network cannot accurately recognize the target exists, so the present embodiment is proposed.

The embodiment provides a training method of an incomplete information target recognition model, which comprises the following steps:

step 2, images of the continuous video frames are called second images, a plurality of continuous frame second images in the continuous video frames are obtained, and target positions and target types on the first images and the second images are manually marked to obtain marking labels;

for the first image, extracting the features of the first image according to the label of the first image to obtain a first output feature vector; and for continuous video frames in the picture video data set, performing feature extraction on the second images of the continuous frames according to the label tags of the second images of the continuous frames to obtain second output feature vectors, and performing feature fusion on the second output feature vectors of the second images of the continuous frames to obtain first fusion feature vectors.

The incomplete information object recognition model obtained in the embodiment is not limited to be capable of recognizing only incomplete information objects.

Example two

step 1, establishing an image video data set, wherein the image video data set comprises a first image and continuous video frames, and both the first image and the continuous video frames have incomplete targets and also have complete targets.

Step 2, manually marking the target positions and the target types on the first image and the continuous video frames to obtain marking labels, namely obtaining marked image video data sets; the images in successive video frames are referred to as second images. As an embodiment, the second images of all of the consecutive video frames are manually annotated.

For the continuous video frames in the above step 1 and step 2, the image video data set may also include a video and a first image, the continuous video frames are continuous frames extracted in the video for a certain time period or certain time periods, and the frame extraction proportion of the continuous video frames obtained by extracting the frames in the video is not less than 40% of the total number of frames contained in the video. In the first image, the continuous video frame and the second image of the marked continuous frame, the targets simultaneously comprise a complete information target and an occluded incomplete information target.

Step 3, for the first image, extracting the features of the first image according to the manually marked target type and target position to obtain a first output feature vector; for continuous video frames in a picture video data set, performing feature extraction on second images of continuous frames (for example, 16 frames in an adjacent time domain) in a certain period according to a target type and a target position which are artificially labeled to obtain a second output feature vector, and performing feature fusion (feature splicing) on the second output feature vectors of the second images (16 frames) of the continuous frames to obtain a first fusion feature vector, so that time domain context enhancement is realized; and the second images subjected to feature extraction in the step 3 are the second images manually labeled in the step 3.

Feature extraction is typically performed for each image, i.e. each first image and each second image. Preferably, the method of feature extraction for the first image is the same as the method of feature extraction for the second image;

and training the incomplete information target recognition model according to the first output feature vector, the first fusion feature vector and the label to obtain the trained incomplete information target recognition model.

And 4, training the incomplete information target recognition model according to the image video data set, the first output feature vector, the first fusion feature vector and the label to obtain the trained incomplete information target recognition model.

The trained incomplete information target identification model is used for acquiring the target type and the target position of an incomplete target in a video or an image according to the video or the image, and is also used for acquiring the target type and the target position of a complete target in the video or the image according to the video or the image.

The first output feature vector and the second output feature vector both include spatial context information, specifically including structural relationship information between components of the target and spatial relationship information between the target and the original image. The first image in the image video data set and the second image in successive video frames in the image video data set are both referred to as original images. The second output characteristic vector realizes the enhancement of the time domain context

In step 1, in order to ensure the generalization ability of the method, the number of images in the image data set is not less than 100 ten thousand, and the labeled target category is not less than 1000.

Fig. 2 is a schematic diagram of a feature extraction network structure, and the specific method for feature extraction in step 3 is as follows:

step 3.1, scaling the original image to a first pixel size in a unified manner, wherein the first pixel size is x _p1 ×y _p1 ，x _p1 And y _p1 Are all positive integers, and then divide the image of the first pixel size into N ₁ ×M ₁ An image grid, N ₁ And M ₁ Are all integers greater than 2, it is worth noting that for values characterizing pixel size, x is a positive integer _p1 Need to be greater than N ₁ 、y _p1 Need to be greater than N ₁ And the like, and the obvious value and size relationship is not described in detail herein.

For example, with uniform scaling to 1024 × 1024 pixel size, the image is divided into 16 × 16 image grid areas (i.e., 256 grid areas with 16 rows and 16 columns).

Step 3.2 from N ₁ ×M ₁ Discarding the image grids with the random extraction ratio r in each image grid, and remaining N ₁ ×M ₁ X (1-r) image grids for the remaining N ₁ ×M ₁ And each image grid in the (1-r) image grids is subjected to feature extraction by utilizing a first convolution neural network to obtain a multi-dimensional first feature vector.

For example, grids with the proportion of r are randomly extracted from 256 image grid areas and discarded, and each remaining image grid is subjected to feature extraction by using a first convolution neural network to obtain a 128-dimensional first feature vector. r is equal to or greater than 0.3 and equal to or less than 0.6, r =0.5 in the embodiment, and the first convolutional neural network is ResNet50.

3.3, extracting a target area image from the original image according to the manually marked target type and target position, and aligning the target area image with the targetScaling the target area image to a second pixel size, the second pixel size being x _p2 ×y _p2 Usually x _p2 ＜x _p1 ，y _p2 ＜y _p1 。

Extracting an image of a target area in the original image, namely a target area screenshot (the target screenshot in fig. 2) according to the manual annotation result, and zooming the target area screenshot to 256 × 256 pixels;

step 3.4, carrying out grid division on the zoomed target area image, and dividing the zoomed target area image into N ₂ ×M ₂ A grid area, N ₂ And M ₂ Are integers of 2 or more. For example, the division into 8 × 8 grid regions for a total of 64 grids.

Step 3.5: from N ₂ ×M ₂ Randomly extracting grids with the proportion of f from each grid region, discarding the rest N ₂ ×M ₂ And (1-f) grid areas are subjected to feature extraction by utilizing a second convolutional neural network to obtain a multidimensional second feature vector.

For example, a 128-dimensional second feature vector is obtained, wherein the extraction ratio f satisfies 0.2 ≦ f ≦ 0.7, the number of the second feature vectors is 64 × (1-f) × n, where n is the number of targets; in this embodiment, f =0.4 is set, the number of the second eigenvectors is 38 × n, the second convolutional neural network is ResNet18, and the last layer of ResNet18 is changed to a full-connected layer with 128-dimensional output.

And 3.6, aiming at each target, according to the target position of the artificial mark, adopting a preset coding rule to code the target position to obtain a multidimensional coding feature vector, namely, expressing the target position (such as a target central point) of the artificial mark into a vector, such as a 128-dimensional coding feature vector, wherein the first feature vector, the second feature vector and the coding feature vector have the same dimension.

The preset encoding rule is as follows:

wherein PE represents a position code; pos represents the grid number of the center of the current target position belonging to the image, and the grid number complies with the line priority criterion; d represents the dimension of the coded feature vector;

d _ index represents the coded feature vector element position, i.e., the position in the coded feature vector divided by 2 rounded down.

Step 3.7, carrying out vector fusion on the obtained coding feature vector and the second feature vector to obtain a third feature vector;

and 3.8, inputting the first feature vector and the third feature vector into a transform Encoder (transform Encoder), and obtaining an output feature vector through a Self-Attention mechanism (Self-Attention). When the original image in step 3.1 is the first image, the output feature vector is the first output feature vector, and when the original image in step 3.1 is the second image, the output feature vector is the second output feature vector. The output feature vector includes spatial context information, and the spatial context information includes a structural relationship of the target component and a spatial relationship between the target whole and the image whole.

And 3.9, when the original image in the step 3.1 is the second image, performing feature fusion on the second output feature vector of the second image of the continuous frame to obtain a first fusion feature vector.

The training in the step 4 includes:

step 4.1, inputting the first fusion characteristic vector and the first output characteristic vector into a Transformer Decoder (Transformer Decoder), and simultaneously inputting a preset number m of Query key value vectors (Query) into the Transformer Decoder, and decoding to obtain m decoding characteristic outputs, wherein m is an integer not less than 256, and the Query key value vectors are trainable parameters;

step 4.2, calculating each decoding characteristic output through a feed forward neural network (FFN) to obtain a result vector containing a target type and a target position, and analyzing the result vector to obtain an analyzed target type and an analyzed target position;

step 4.3, comparing the analyzed target type and the analyzed target position with the label, calculating a loss function, and updating the network parameters of the incomplete information target identification model through a back propagation algorithm;

the method specifically comprises the following steps: comparing the target category obtained by analysis with a target category label manually marked, comparing the target position obtained by analysis with a target position label manually marked, calculating a loss function according to the comparison result, and updating parameters through a back propagation algorithm;

and 4.4, iteratively updating the network parameters, and finishing model training when the iteration times are finished and/or the preset optimal performance metric value is reached, namely obtaining the trained incomplete information target recognition model, and finishing training the incomplete information target recognition model.

And step 4, acquiring a first output characteristic vector and a first fusion characteristic vector according to the image video data set and the label, wherein the specific acquisition method is the same as step 3.

As shown in fig. 3, if input data to be detected is a video, that is, if an incomplete information target identification model is input as a continuous video frame, feature extraction is performed on images in the continuous video frame according to a manually labeled target type and a target position to obtain a second output feature vector, feature fusion is performed on second output feature vectors of a plurality of frames of pictures in adjacent time domains to obtain a first fusion feature vector, and the model obtains a target type and a target position of an incomplete target according to the first fusion feature vector, specifically: inputting the first fusion characteristic vector into a Transformer decoder, wherein a preset number m of query key value vectors are arranged in the Transformer decoder, m decoding characteristic outputs are obtained through decoding, each decoding characteristic output is calculated through a feedforward neural network to obtain a result vector containing a target type and a target position, the result vector is analyzed to obtain an analyzed target type and an analyzed target position, and the target type and the target position at the moment are target types and target positions of incomplete targets output through model detection.

The algorithmic network structure for the first image is shown in fig. 4. If the input data to be detected is a single image, namely the input of the incomplete information target identification model is a first image, performing feature extraction on the first image to obtain a first output feature vector, and obtaining the target type and the target position of the incomplete target by the model according to the first fusion feature vector, wherein the method specifically comprises the following steps: and inputting the first fusion feature vector into a transform decoder, wherein the transform decoder is provided with a preset number m of query key value vectors, m decoding feature outputs are obtained through decoding, each decoding feature output is calculated through a feed-forward neural network to obtain a result vector containing a target type and a target position, the result vector is analyzed to obtain an analyzed target type and an analyzed target position, and the target type and the target position at the moment are used as the target type and the target position of an incomplete target detected and output by a model.

EXAMPLE III

The invention also provides a target identification method of the incomplete information target identification model, which comprises the following steps:

and performing incomplete target detection on the video and/or the image by using the trained incomplete information target identification model in the first embodiment or the second embodiment, outputting the target type and the target position of the incomplete target by using the trained incomplete information target identification model of the video and/or the image, and further outputting the target type and the target position of the complete target by using the model.

And (3) performing test verification on the model, selecting 1000 sheets as a test data set, and comparing the target identification result with the original target identification method based on the Transformer by using the method disclosed by the invention. The data set contains visible and infrared images shot by the unmanned aerial vehicle, the shot scenes comprise mountainous regions, plains/suburbs, oceans, deserts/gobi, and the target categories comprise vehicles, airplanes and ships. The target proportion is 42% after completion, the target proportion is 43% below the shielding rate of 30%, the target proportion is 15% between the shielding rate of 30% and the shielding rate of 80%, the target distribution is shown in figure 5, and the result statistics are shown in figure 6. The partial image recognition results are shown in fig. 7.

Compared with the prior art, the technical scheme of the invention has the following advantages:

1) The method effectively models the relationship between the target and the whole image and the relationship between the local characteristics of the target, improves the accuracy of the target identification model, and reduces the false detection rate of the model. In addition, the robustness of the algorithm to incomplete information target detection is better through enhancement of the spatial context.

2) The method can effectively utilize the context relationship of time domains aiming at the video data, and utilizes the relational modeling of different time domain information to improve the target detection accuracy by more than 3 percent.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A training method of an incomplete information target recognition model is characterized by comprising the following steps:

step 2, obtaining a plurality of continuous frame second images in continuous video frames, and manually marking the target positions and the target types on the first images and the second images to obtain marking labels;

step 3, training an incomplete information target recognition model according to the image video data set, the first output characteristic vector, the first fusion characteristic vector and the label to obtain the trained incomplete information target recognition model;

for the first image, extracting the features of the first image according to the label of the first image to obtain a first output feature vector; and for continuous video frames in the picture video data set, performing feature extraction on second images of the continuous frames according to the label tags of the second images to obtain second output feature vectors, and performing feature fusion on the second output feature vectors of the second images of the continuous frames to obtain first fusion feature vectors.

2. The method as claimed in claim 1, wherein the first image and the second image are both referred to as original images, and the method for extracting features in step 3 comprises:

step 3.1, scaling the original image to a first pixel size which is x _p1 ×y _p1 ，x _p1 And y _p1 Are all positive integers, and then divide the image of the first pixel size into N ₁ ×M ₁ An image grid, N ₁ And M ₁ Are all integers greater than 2;

step 3.2 from N ₁ ×M ₁ Discarding the image grids with the random extraction ratio r in each image grid, and remaining N ₁ ×M ₁ X (1-r) image grids for the remaining N ₁ ×M ₁ Each image grid in the (1-r) image grids is subjected to feature extraction by utilizing a first convolution neural network to obtain a multi-dimensional first feature vector;

step 3.3, extracting a target area image from the original image according to the manually marked target type and target position, and zooming the target area image to a second pixel size, wherein the second pixel size is x _p2 ×y _p2 Usually x _p2 ＜x _p1 ，y _p2 ＜y _p1 ；

Step 3.4, carrying out grid division on the zoomed target area image, and dividing the zoomed target area image into N ₂ ×M ₂ A grid area, N ₂ And M ₂ Are integers greater than or equal to 2;

step 3.5: from N ₂ ×M ₂ Randomly extracting grids with the proportion of f from each grid region and discarding the rest N ₂ ×M ₂ (1-f) carrying out feature extraction on the grid areas by utilizing a second convolutional neural network to obtain a multidimensional second gridTwo feature vectors;

3.6, according to the manually marked target positions, coding the target position of each target by adopting a preset coding rule to obtain a multidimensional coding feature vector, wherein the first feature vector, the second feature vector and the coding feature vector have the same dimension;

step 3.7, carrying out vector fusion on the coding feature vector and the second feature vector to obtain a third feature vector;

step 3.8, inputting the first feature vector and the third feature vector into a transform encoder, and obtaining an output feature vector through a self-attention mechanism; when the image in the step 3.1 is a first image, the output characteristic vector is a first output characteristic vector, and when the image in the step 3.1 is a second image, the output characteristic vector is a second output characteristic vector and the step 3.9 is carried out;

and 3.9, performing feature fusion on the second output feature vector of the second image of the continuous frame to obtain a first fusion feature vector.

3. The method as claimed in claim 2, wherein the predetermined coding rule is:

4. The method as claimed in claim 2, wherein r is 0.3. Ltoreq. R.ltoreq.0.6, f is 0.2. Ltoreq. F.ltoreq.0.7.

5. The method as claimed in claim 2, wherein the first convolutional neural network is ResNet50, and the second convolutional neural network is ResNet18.

6. The method as claimed in claim 1, wherein the first output feature vector and the second output feature vector each include information of a spatial relationship between the target and the original image and information of a structural relationship between components of the target.

7. The method as claimed in claim 1, wherein the training of the incomplete information object recognition model comprises:

inputting the first fusion characteristic vector and the first output characteristic vector into a Transformer decoder, inputting a preset number m of query key value vectors into the Transformer decoder, and decoding to obtain m decoding characteristic outputs, wherein m is an integer, and the query key value vectors are trainable parameters;

calculating by each decoding characteristic output through a feedforward neural network to obtain a result vector containing a target type and a target position, and analyzing the result vector to obtain an analyzed target type and an analyzed target position;

and comparing the analyzed target type and the analyzed target position with the label, calculating a loss function, and updating the network parameters of the incomplete information target identification model through a back propagation algorithm.

8. A method of object recognition, comprising:

acquiring an image or video to be identified;

inputting the image or video to be recognized into an incomplete information target recognition model for processing to obtain the target type and the target position output by the incomplete information target recognition model, wherein the incomplete information target recognition model is obtained by training through the training method of any one of claims 1 to 7.