CN114937231B

CN114937231B - Target identification tracking method

Info

Publication number: CN114937231B
Application number: CN202210858864.1A
Authority: CN
Inventors: 寇映
Original assignee: CHENGDU XIWU SECURITY SYSTEM ALLIANCE CO LTD
Current assignee: CHENGDU XIWU SECURITY SYSTEM ALLIANCE CO LTD
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-09-30
Anticipated expiration: 2042-07-21
Also published as: CN114937231A

Abstract

The invention relates to a target identification tracking method, which comprises the following steps: inputting a video image into a target identification module, extracting pixel characteristics of all target detection frames in the video image by the target identification module, and determining a target object through the pixel characteristics; the target tracking module locates a target object, and determines the facing direction of the target object in a two-dimensional image frame, wherein the facing direction is represented by an angle range of 360 degrees; and the target tracking module predicts the moving track of the target object in the next frame of image according to the angle faced by the target object. The invention aims at the pixel characteristics of a plurality of target objects in a video image to determine the detection frame of the target object so as to quickly track and position the target object.

Description

Target identification tracking method

Technical Field

The invention relates to the technical field of automatic target identification and tracking, in particular to a target identification and tracking method.

Background

The collected video images are used for identifying and tracking people, and the items are very favorable for development in the modern tracking technology, people can be well identified through a deep neural network, but when a plurality of people targets exist in the video images and the targets are easy to generate large movement deviation, shield each other and the like, for example, in scenes with large people flow such as markets, squares, stations and the like, the target tracking performance is limited, the problem that a target detection frame jumps occurs, and the target tracking precision is reduced. Therefore, the technology for identifying and tracking the target in the video image is further improved.

Disclosure of Invention

The invention aims to determine a detection frame of a plurality of target objects in a video image according to pixel characteristics of the target objects so as to quickly track and position the target objects, and provides a target identification and tracking method.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

a method of target recognition tracking, comprising the steps of:

step S1, inputting the video image into a target identification module, extracting the pixel characteristics of all target detection frames in the video image by the target identification module, and determining a target object through the pixel characteristics;

step S2, the target tracking module locates the target object, determines the facing direction of the target object in the two-dimensional image frame, and the facing direction is represented by an angle range of 360 degrees;

in step S3, the target tracking module predicts the movement trajectory of the target object in the next frame of image according to the angle that the target object faces.

In the scheme, the pixel characteristics of the target detection frame in the image are obtained, the trend of the next frame of the target detection frame is determined through the pixel characteristics, the problem that the target detection frame frequently jumps when the video image is shielded by more people is solved, and the target object can be quickly tracked and positioned as long as the detection frame of the target object is determined even if shielding or large movement deviation exists.

The step of extracting the pixel characteristics of all target detection frames in the video image by the target identification module comprises the following steps:

the target identification module comprises a chromaticity extraction unit, a deep convolutional neural network and a feature fusion layer;

the chroma extraction unit extracts four chroma values of pixel points aiming at the pixel points of the target detection frame in the current frame image

Respectively representing red chromatic values, green chromatic values, blue chromatic values and transparent chromatic values of jth pixel points in the ith target detection frame;

the deep convolutional neural network is used for respectively extracting the characteristics of the four chromatic values to obtain the chromatic characteristics corresponding to the four chromatic values;

and the feature fusion layer fuses the four chromaticity features through softmax to obtain the pixel features of the pixel points.

The number of the deep convolutional neural networks is four, and each deep convolutional neural network is used for extracting characteristics of a chromatic value; each deep convolutional neural network comprises an average pooling layer, a two-dimensional deformable convolutional layer, a first linear layer, a second linear layer, a first batch of normalization layers, a second batch of normalization layers, a first nonlinear activation layer, a second nonlinear activation layer and a full connection layer;

inputting any colorimetric value into an average pooling layer coding input characteristic, and obtaining a position characteristic and a texture characteristic after the input characteristic passes through a two-dimensional deformable convolution layer; inputting the position features into a first linear layer, a first batch of normalization layers and a first nonlinear activation layer in sequence to obtain weight vectors of the position features; inputting the texture features into a second linear layer, a second batch of normalization layers and a second nonlinear activation layer in sequence to obtain weight vectors of the texture features; and finally, carrying out weighting processing on the weight vector of the position characteristic and the weight vector of the texture characteristic through a full connection layer to obtain the chroma characteristic corresponding to the chroma value.

In the scheme, the structure of the deep convolutional neural network is improved and is divided into two branches of position feature extraction and texture feature extraction. The position characteristics refer to the positions of the pixel points in the target detection frame, for example, the position of the jth pixel point is (x, y), when the target object moves, the jth pixel point can be ensured to be always in the position of (x, y) in the target detection frame, and thus the target detection frame can be prevented from jumping as much as possible. The texture features refer to external features such as human body wearing, for example, when clothes are wrinkled, clothes of the same color under the same light can also have different gray level representations, so that the texture features can make up for the determination of the chromaticity features.

The feature fusion layer fuses the four chromaticity features through softmax to obtain pixel features of the pixel points, and the method comprises the following steps:

wherein, O _i,j Expressing the pixel characteristics of a jth pixel point in an ith target detection frame;

、

、

、

respectively representing the red chromaticity characteristic, the green chromaticity characteristic, the blue chromaticity characteristic and the transparent chromaticity characteristic of a jth pixel point in an ith target detection frame;

、

、

、

the weights of the red chrominance feature, the green chrominance feature, the blue chrominance feature, and the transparent chrominance feature are expressed respectively.

The target tracking module predicts the moving track of the target object in the next frame of image according to the facing angle of the target object, and the method comprises the following steps:

by a loss function L _center And regressing the position of the central point of the target detection frame to restrict the distance between the predicted target detection frame and the real target detection frame in the next frame image:

wherein x is _i Indicating the pixels belonging to the target object in the input i-th target detection box, y _i Pixels representing that the input ith target detection frame belongs to the background;

a set of pixels representing the target object is shown,

representing a set of background pixels;

which is indicative of a parameter of the balance,

，

；

representing the center offset of the target detection frame;

indicating the angle of direction the predicted target object is facing,

，

representing the angle of direction that the real target object is facing,

) (ii) a r represents a cosine boundary;

a cosine function representing a direction angle in which the prediction target object faces,

a cosine function representing a direction angle that a real target object faces;

represents a center weight;

the scale parameter is indicated.

Compared with the prior art, the invention has the beneficial effects that:

the chromaticity characteristics of each pixel point in the target detection frame are determined through the deep convolution neural network, so that the target detection frame is determined according to the pixel characteristics of the pixel points, even if large motion deviation and mutual shielding conditions occur when the flow of people is large on site, the target object can be quickly tracked and positioned according to the pixel characteristics of the target object, and the accuracy of target identification and tracking is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic structural diagram of a target recognition module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a deep convolutional neural network according to an embodiment of the present invention;

fig. 4 is a schematic diagram of determining a facing direction of a target object in a two-dimensional image frame according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Also, in the description of the present invention, the terms "first", "second", and the like are used for distinguishing between descriptions and not necessarily for describing a relative importance or implying any actual relationship or order between such entities or operations.

The embodiment is as follows:

the invention is realized by the following technical scheme, as shown in fig. 1, a target identification tracking method comprises the following steps:

and step S1, inputting the video image into the target recognition module, extracting the pixel characteristics of all target detection frames in the video image by the target recognition module, and determining the target object according to the pixel characteristics.

The video image is shown in the form of video, but is finally the video formed by multi-frame images, so the scheme is explained for a single-frame image. Referring to fig. 2, the target identification module includes a chroma extraction unit, a deep convolutional neural network, and a feature fusion layer. Since the identification of the person in the image is a mature prior art, the details of how to obtain the target detection frame of the person are not repeated, and the reference to the prior art is only needed.

Because the image is composed of one pixel point, and each pixel point is composed of four bytes, the four bytes represent the meaning: the first byte represents a red chrominance value, the second byte represents a green chrominance value, the third byte represents a blue chrominance value, and the fourth byte represents a transparent chrominance value. The red, green and blue are three primary colors, and other colors in the nature are mixed by different proportions of the three primary colors.

，

The red chrominance value of the jth pixel point in the ith target detection frame is represented,

representing the green chrominance value of the jth pixel point in the ith target detection frame,

representing the blue chrominance value of the jth pixel point in the ith target detection frame,

and expressing the transparent chromatic value of the jth pixel point in the ith target detection frame.

And respectively extracting the characteristics of the four colorimetric values by the deep convolutional neural network to obtain the colorimetric characteristics corresponding to the four colorimetric values. Referring to fig. 2, the number of the deep convolutional neural networks is four, and the structures of the deep convolutional neural networks are the same, and each deep convolutional neural network performs feature extraction on one colorimetric value respectively.

Referring to fig. 3, each of the deep convolutional neural networks includes an averaging pooling layer, a two-dimensional deformable convolutional layer, a first linear layer, a second linear layer, a first batch of normalization layers, a second batch of normalization layers, a first nonlinear active layer, a second nonlinear active layer, and a full connection layer.

Taking a red colorimetric value as an example for explanation, inputting the red colorimetric value into an average pooling layer coding input characteristic, and obtaining a position characteristic and a texture characteristic after the input characteristic passes through a two-dimensional deformable convolution layer; inputting the position features into a first linear layer, a first batch of normalization layers and a first nonlinear activation layer in sequence to obtain weight vectors of the position features; inputting the texture features into a second linear layer, a second batch of normalization layers and a second nonlinear activation layer in sequence to obtain weight vectors of the texture features; and finally, carrying out weighting processing on the weight vector of the position characteristic and the weight vector of the texture characteristic through a full connection layer to obtain a red chrominance characteristic. And inputting other colorimetric values into the deep convolution neural network, and obtaining corresponding colorimetric characteristics through the same processing.

And finally, fusing the four chromaticity characteristics through softmax by the characteristic fusion layer to obtain the pixel characteristics of the jth pixel point:

、

、

、

、

、

、

In step S2, the target tracking module locates the target object, determines a direction in which the target object faces in the two-dimensional image frame, the facing direction being represented by an angular range of 360 °.

Since the captured video image is actually a three-dimensional space but can only be displayed as a two-dimensional image, if the direction in which the human body faces is proportional when viewed from directly above or directly below the human body, for example, if the human body rotates by β degrees, then β degrees rotation is displayed in the two-dimensional image. However, since the captured video image is not directly above or below the human body, the two-dimensional image does not show β degrees when the human body is actually rotated by β degrees. Therefore, fitting is required, a frame of image is fixed by taking the head or some other part of the human body as a circle center, please refer to fig. 4, a coordinate system of the frame of image is designed, and distances between the human body and an origin of the coordinate system are different at different positions, so as to fit a linear relation between an actual rotation angle of the human body and a rotation angle in the two-dimensional image. For example, when a human body rotates from point b to point b ', the actual rotation angle is β ', but β and β ' have a linear relationship because β is linearly fit to β in the coordinate system shown in fig. 4.

In step S3, the target tracking module predicts a moving track of the target object in the next frame of image according to the angle that the target object faces.

By a loss function L _center And regressing the position of the central point of the target detection frame to constrain the distance between the predicted target detection frame and the real target detection frame in the next frame of image, so that the prediction precision is continuously improved:

wherein x is _i Indicating the pixels belonging to the target object in the input i-th target detection box, y _i Pixels indicating that the input ith target detection frame belongs to the background;

a set of pixels representing the target object is shown,

representing a set of background pixels;

the balance parameters are represented by a number of parameters,

，

；

representing the center offset of the target detection frame;

indicating the angle of direction the predicted target object is facing,

，

representing the angle of the direction that the real target object is facing,

) (ii) a r represents a cosine boundary;

a cosine function representing a direction angle in which a real target object faces;

represents a center weight;

the scale parameter is indicated.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for identifying and tracking a target is characterized in that: the method comprises the following steps:

the target identification module comprises a chromaticity extraction unit, a deep convolution neural network and a feature fusion layer;

the deep convolution neural network is used for respectively extracting the characteristics of the four colorimetric values to obtain colorimetric characteristics corresponding to the four colorimetric values;

the number of the deep convolutional neural networks is four, and each deep convolutional neural network is used for extracting the characteristics of a chromatic value; each deep convolutional neural network comprises an average pooling layer, a two-dimensional deformable convolutional layer, a first linear layer, a second linear layer, a first batch of normalization layers, a second batch of normalization layers, a first nonlinear activation layer, a second nonlinear activation layer and a full connection layer;

inputting any colorimetric value into an average pooling layer code input characteristic, and obtaining a position characteristic and a texture characteristic after the input characteristic passes through a two-dimensional deformable convolution layer; inputting the position features into a first linear layer, a first batch of normalization layers and a first nonlinear activation layer in sequence to obtain weight vectors of the position features; inputting the texture features into a second linear layer, a second batch of normalization layers and a second nonlinear activation layer in sequence to obtain weight vectors of the texture features; finally, the weight vector of the position characteristic and the weight vector of the texture characteristic are subjected to weighting processing through a full connection layer to obtain a chrominance characteristic corresponding to the chrominance value;

the feature fusion layer fuses the four chrominance features through softmax to obtain pixel features of the pixel point;

the feature fusion layer fuses the four chromaticity features through softmax to obtain the pixel features of the pixel point, and the method comprises the following steps:

、

、

、

、

、

、

respectively representing the weight of the red chrominance characteristic, the weight of the green chrominance characteristic, the weight of the blue chrominance characteristic and the weight of the transparent chrominance characteristic;

step S3, the target tracking module predicts the moving track of the target object in the next frame image according to the angle faced by the target object;

by a loss function L _center And regressing the position of the central point of the target detection frame to constrain the distance between the predicted target detection frame and the real target detection frame in the next frame image: