CN115908500A

CN115908500A - High-performance video tracking method and system based on 3D twin convolutional network

Info

Publication number: CN115908500A
Application number: CN202211720733.3A
Authority: CN
Inventors: 梁敏; 桂彦; 欧懿汝
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-04-04

Abstract

The application discloses a high-performance video tracking method based on a 3D twin convolutional network. Firstly, a space-time feature extractor based on a 3D twin network is designed and used for extracting space-time features of a template sequence and a search sequence. Secondly, a multi-template matching module is designed, and target features in a search block are enhanced by transmitting template features to search features, wherein the module comprises a template feature conversion sub-module and a space-time feature matching sub-module. Transmitting the appearance and motion information in the template frame to a search branch by using a template characteristic conversion template; the spatio-temporal feature matching module consists of two depth-related branches, which are used for classification and regression respectively. Then, an object box prediction module is employed, including classification, quality assessment, and regression branches, to accurately predict the object location and bounding box of each video frame in the search sequence. Finally, the video tracking model is optimized by minimizing the defined joint loss for target tracking prediction from video sequence to video sequence.

Description

High-performance video tracking method and system based on 3D twin convolutional network

Technical Field

The invention relates to the field of computer vision, in particular to a high-performance video tracking method and system based on a 3D twin convolutional network.

Background

Video target tracking refers to a technology of utilizing context information of a video or an image sequence to model appearance and motion information of a target, so as to predict a motion state of the target and mark a position. Generally, according to a target designated in a first frame of a video, the specific target is continuously tracked in subsequent video frames, and the positioning and target scale estimation of the target are realized. The video target tracking has wide application value and can be used in the fields of video monitoring, unmanned driving, accurate guidance and the like.

With the rapid development of deep learning and convolutional networks, more and more video target trackers based on convolutional networks appear. Researchers prefer twin network-based trackers, which not only take advantage of tracking speed, but also achieve good accuracy. Such twin network based trackers view visual tracking as a similarity matching problem. In 2016, bertinetto et al first proposed a siamf tracker for visual tracking (Luca Bertinetto, jackval marmdre,

henriques, andrea Vedaldi, philip h.s.torr: full-conditional space Networks for Object tracking.eccv works phones (2) 2016: 850-865.) by means of a twin network, and calculating the degree of cross-correlation between the target template and the search area using correlation filtering.

In recent years, some twin network based approaches have attempted to update the template online to account for changes in the appearance of the target as it moves. Guo et al propose a dynamic twin network (Qing Guo, wei Feng, ce Zhou, rui Huang, liang Wan, and Song Wang. However, such methods use historical prediction for model updates and do not explicitly model the relationship between time and space. In addition, some methods introduce attention mechanisms and transformers into the twin network to improve discrimination of targets in complex scenes. Chen et al designed a feature fusion network using a Transformer (Xin Chen, bin Yan, jianwen Zhu, dong Wang, xiaoyun Yang, huchuan Lu. Transformer Transmission. CVPR 2021. Du et al designed a dual relevance guided attention module (Fei Du, peng Liu, wei Zhao, and Xiaoanglong Tang. Correlation-defined authentication for core detector visual tracking. CVPR 2020). The tracker with the attention mechanism improves the target tracking precision, but only tracks by using spatial information and ignores a large amount of context information contained in the video. The visual tracking based on the twin network effectively utilizes the time-space information of cross-time frames, better learns the time-space appearance characteristics to perform target appearance modeling, and obtains the search characteristics with higher discrimination so as to realize more accurate tracking and positioning.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a high-performance video tracking method and a high-performance video tracking system based on a 3D twin convolutional network, which are used for receiving the input of a template sequence and a search sequence and extracting space-time associated characteristic information through the 3D twin network; a multi-template matching module is developed, and template features are transmitted to search features by using an attention mechanism to strengthen target features in a search block so as to obtain search features with higher discriminative power; in addition, a quality evaluation branch is introduced, ambiguity of classification branch positioning is avoided, and tracking accuracy is further improved. The tracker for processing video target tracking sequence by sequence defines a multi-template matching mechanism, enhances search characteristics by gathering space-time characteristics of different targets in sequence frames, obtains balance in speed and precision, and greatly improves the performance of the video target tracker.

In order to achieve the above object, the present invention provides a high performance video tracking method based on a 3D twin convolutional network, comprising the following steps;

s1, constructing a network architecture, wherein the network consists of a space-time feature extractor, a multi-template matching module and a target prediction module;

s2, respectively giving template sequence video frames and search sequence video frames, and cutting the template sequence video frames and the search sequence video frames into template sequence blocks and search sequence blocks which are used as input of the whole network architecture;

and S3, constructing a space-time feature extractor, wherein the module is a 3D twin full convolution network and comprises a template branch and a search branch, the 3D full convolution network is used as a basic network, and the weight is shared. Taking the template sequence block and the search sequence block as input, and extracting template space-time characteristics and search space-time characteristics from the template sequence block and the search sequence block by a space-time characteristic extractor;

s4, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain a search feature with more discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input;

and S5, constructing a target prediction module which mainly comprises a classification head, a quality evaluation head and a regression head. Taking the multi-channel related filtering features output by the classification branches as the input of a classification head and a quality evaluation head to obtain a classification score map and a quality evaluation score map; obtaining a regression graph by taking the multi-channel related filtering characteristics output by the regression branches as the input of a regression head;

s6, positioning the position of each video frame target in the sequence by utilizing the classification score map and the quality evaluation score map; estimating the target scale of each video frame in the sequence according to the regression graph; finally, obtaining a target prediction frame of each video frame in the search sequence;

s7, optimizing a network model through the minimized joint loss, wherein the network model comprises cross entropy loss of classification and quality evaluation and cross-over ratio loss of regression, and finally obtaining a high-performance video target tracker model;

and S8, performing target tracking on the given video sequence by taking the trained network model as a visual tracker. In order to ensure stable and accurate tracking, a confidence search region estimation strategy is defined, a search region of a next sequence is cut according to different target states in a current video sequence, error accumulation is reduced, and a target of each video frame in the search sequence is accurately positioned.

The invention provides an end-to-end trainable neural network architecture and a system for video object tracking, which comprise a video sequence input unit, a template sequence block and a search sequence block, wherein the video sequence input unit is used for cutting the template sequence block and the search sequence block; the model training unit is used for training a high-performance video target tracker, training target tracking through minimized combination loss, including cross entropy loss of classification and quality evaluation heads and cross-over ratio loss of regression heads, and finally realizing target tracking of video sequences; the video target tracking unit is used for respectively estimating the target state and predicting the scale in the video frame of the search sequence by utilizing the classification graph, the quality evaluation score graph and the regression graph output by the model, and calculating to obtain a target prediction frame in the search sequence; and calculating to obtain a confidence search area of the next group of video sequences by using the target prediction box of the current video sequence, and inputting the confidence search area into a search branch to track the target of the subsequent video sequence.

Compared with the prior art, the method has the following beneficial effects:

the invention utilizes the 3D twin full convolution network to extract the template space-time characteristics and search the space-time characteristics, and learns the abundant space-time information between a plurality of continuous video frames. And inputting the extracted template space-time characteristics and the search space-time characteristics into a multi-template matching sub-network to obtain multi-channel related filtering characteristics. Respectively processing the multi-channel related filtering characteristics of the classification branches by using the classification head and the quality evaluation head, and predicting the positioning of the target; and processing the multichannel correlation filtering characteristics of the regression branches by using the regression head to estimate the target scale. In the target tracking stage, in order to obtain a more accurate search sequence region, a confidence search region estimation strategy is defined, and a next search sequence region is estimated according to different states of a target in a current video sequence, so that the stability and the accuracy of target tracking are ensured. The method processes video target tracking in a video sequence mode, ensures the speed of video tracking, and can capture the space-time relevance between video frames; meanwhile, a multi-template matching mechanism is introduced, so that more discriminative search features are obtained, quality evaluation branches are added in classification, and the accuracy of video target tracking is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is an overall structure diagram of the network in the invention patent.

Fig. 2 is a schematic diagram of template sequence blocks and search sequence blocks in the patent of the present invention.

FIG. 3 is a schematic diagram of a spatiotemporal feature extractor according to the present invention.

FIG. 4 is a schematic diagram of a multi-template matching module according to the present invention

Fig. 5 is a diagram of confidence search area estimation in the present patent.

Fig. 6 is a schematic diagram of a part of video frames in the patent of the present invention.

Fig. 7 is a schematic diagram of a video target tracking result in the patent of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention is described in detail below with reference to the drawings and the detailed description.

The invention is described in detail below with reference to the drawings of the specification and specific embodiments, and a video target tracking method with deep space-time correlation comprises steps S1 to S8;

s4, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input;

s6, positioning the position of each video frame target in the sequence by utilizing the classification score map and the quality evaluation score map; estimating the target scale of each video frame in the sequence according to the regression graph; finally, a target prediction frame of each video frame in the search sequence is obtained;

and S8, performing target tracking of the given video sequence by taking the trained network model as a visual tracker. In order to ensure stable and accurate tracking, a confidence search region estimation strategy is defined, a search region of the next sequence is cut according to different target states in the current video sequence, error accumulation is reduced, and the target of each video frame in the search sequence is accurately positioned.

In step S1, a network architecture is constructed, as shown in fig. 1, the network is composed of a spatio-temporal feature extractor, a multi-template matching module, and a target prediction module. The method comprises the following steps:

s11, constructing a space-time feature extractor based on a 3D twin network, wherein the space-time feature extractor comprises a template branch and a search branch, takes a 3D full convolution neural network as a basic network and shares weight, and is used for extracting template space-time features and searching space-time features from an input video sequence block.

S12, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input.

And S13, the target prediction module comprises a classification head, a quality evaluation head and a regression head, and a classification score map, a quality evaluation score map and a regression map are respectively obtained through the classification head, the quality evaluation head and the regression head by taking the multi-channel related filtering characteristics as input.

In step 2, a template sequence video frame and a search sequence video frame are respectively given and cut into a template sequence block and a search sequence block, as shown in fig. 2, and are used as input of the whole network architecture. The method comprises the following steps:

and S21, giving a template sequence, acquiring the center position, the width and the height information of the target according to the real value information of the target in each video frame in the template sequence, and expressing as (x, y, w, h).

S211, calculating the extension value p = (w + h)/2 of the width and height of the target frame according to each piece of real target frame information given in S21, and calculating a scaling factor

For zooming extended objectsAnd marking a frame area. And if the target frame area added with the extended value exceeds the boundary value of the video frame, filling the target frame area with the average RGB value of the current video frame. Finally, each video frame in the template sequence is cropped to a template block of 127 × 127 size.

S212, cutting each video frame in the template sequence to obtain a template block

Where k represents the total number of video frames in the template sequence.

S22, giving a search sequence, acquiring the center position, the width and the height information of the target according to the real value information of the first frame video frame target in the template sequence, and expressing as (X, Y, W, H).

S221, calculating the expanded value P = (W + H)/2 of the width and the height of the target frame according to the real target frame information given in S22, and calculating the scaling factor

For scaling the expanded target frame area. If the target frame area added with the extended value exceeds the boundary value of the video frame, the average RGB value of the current video frame is used for filling, and finally, each video frame in the search sequence is cut into search blocks with the size of 255 multiplied by 255.

S222, cutting each video frame in the search sequence to obtain a search block

Where k represents the total number of video frames in the search sequence.

In step S3, the spatio-temporal feature extractor is a 3D twin full convolution network, which includes a template branch and a search branch, and uses the 3D full convolution network as a basic network and shares weights. The template sequence block and the search sequence block are used as input, and a space-time feature extractor extracts template space-time features and search space-time features from the input. The method comprises the following steps:

and S31, constructing a feature extraction network, wherein each branch is a Res2+1D network consisting of five residual blocks as shown in FIG. 3.

S32, modifying the padding attribute in the first residual block of Res2+1D into 1 multiplied by 4, respectively modifying the output channel of the fourth residual block and the input channel of the fifth block into 256, and removing the down sampling and the final classification layer of the fifth residual block. Thus, the output spatio-temporal features have the same temporal length as the input video sequence.

S33, inputting the template block and the search block obtained in the steps S212 and S222 into a time-space feature extractor to respectively obtain template time-space features

And search spatio-temporal features->

In step 4, a multi-template matching module is constructed, wherein the multi-template matching module comprises a template feature conversion submodule and a space-time feature matching submodule, and appearance and motion information in a template frame is transmitted to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input. The method comprises the following steps:

s41, as shown in the figure 4, the template feature conversion sub-module establishes a connection between the template features and the search features by means of an interactive attention mechanism; firstly, the template feature F _Z Is converted into dimensions of

Space-time matrix of (1), search feature conversion F _X Convert to dimension->

The space-time matrix of (1), wherein N _Z ＝k×h×w，N _X ＝k×H×W。

S411, calculating a cross attention moment array by using an attention mechanism

The method comprises the following specific steps:

where φ (·) is a 1 × 1 linear transformation operation, softmax denotes the normalization operation.

S412, giving template space-time characteristics F _Z Calculating each feature map

Gaussian mask m _i (y)＝exp(-||y-c|| ² /2σ ² ) Wherein y represents the position of a pixel point in each feature map, c represents the real central position of each frame target, and a Gaussian masking code set is obtained>

And convert it to have dimension->

S413, utilizing the cross attention matrix A _Z→X Taking M' as attention weight, calculating mask of transfer

And associated with the search feature>

Performing element-by-element multiplication to obtain mask search feature>

Wherein

Norm denotes the example normalization operation, which can more accurately find the potential location of the target in the search area. />

And S414, simultaneously, encoding the context information in the template sequence and transmitting the context information to the search feature. Calculating mask template characteristics according to Gaussian mask set M

To highlight the target position in the search area and weaken background interference; and transferred to the search branch to obtain the transferred mask template feature->

S415, further transmitting the mask template after the characteristics are transmitted

And search feature>

Adding up to obtain a transferred search feature>

The specific calculation is as follows:

s416, searching for features by using the mask

And the search feature passed pick>

Element by element addition, an enhanced search feature is calculated>

And converts it into the original characteristic dimension->

S42, the template characteristics obtained in the S3 are processed

Respectively inputting the classification and regression branches matched with the time-space characteristics to obtain the classification template characteristics->

And regression template features>

Likewise, the enhanced search feature in S41 is @>

Respectively input into the classification and regression branch matched with the time-space feature to obtain a classification search feature->

And regression search feature>

As shown in fig. 4.

S421, in order to effectively utilize the template space-time characteristics, the classification and regression template space-time characteristics are averaged in a time dimension; features calculated by averaging operations

And &>

And copying the characteristics k times to finally obtain the optimal classification template characteristics>

And an optimal regression template spatiotemporal feature>

And respectively as input for the classification branch and the regression branch.

S422, carrying out depth-related filtering operation on the optimal template features in the classification and regression branches and the enhanced search features, and specifically calculating as follows:

wherein,

indicates a classification branch, <' > or>

Regression branches are represented, and depth dependent filtering is represented.

S43, respectively outputting the multi-channel related filtering characteristics by the classification branch and the regression branch

And

in step 5, a target prediction module is constructed, which mainly consists of a classification head, a quality evaluation head and a regression head. Taking the multi-channel related filtering features output by the classification branches as the input of a classification head and a quality evaluation head to obtain a classification score map and a quality evaluation score map; and (4) taking the multichannel correlation filtering characteristics output by the regression branches as the input of the regression head to obtain a regression graph. The method comprises the following steps:

s51, the classification head is composed of a 1 × 1 convolution layer, and the classification branch in S42 is outputMultichannel correlation filter characteristic F _cls As input to the classification header, a classification score map is output:

s52, the quality evaluation head consists of a 1 multiplied by 1 convolution layer, and the multi-channel related filter characteristics F output by the classification branch in S42 _cls As inputs to the quality assessment head, a quality assessment score map is output:

s53, the regression head is composed of a 1 multiplied by 1 convolution layer, and the multi-channel related filter characteristic F output by the regression branch in S42 _reg As an input to the regression head, a regression graph is output:

in step 6, the position of each video frame target in the sequence is positioned by utilizing the classification score map and the quality evaluation score map; estimating the target scale of each video frame in the sequence according to the regression graph; and finally obtaining a target prediction frame of each video frame in the search sequence. The method comprises the following steps:

s61, the size of the classification score map is as follows: k × 17 × 17, the size of the quality assessment score map is: k is multiplied by 17 and 17, the quality evaluation score map is multiplied by the corresponding position of the classification score map to obtain a confidence classification map with the size of k is multiplied by 17 and 17, and for the ith frame in the video sequence, the confidence classification score map finds the point with the maximum response value

Expressed in the original video frame as:

where s =8 is the total step size of the entire network.

S62, the regression graph is a four-channel vector with the size: kX 4X 17; use of _i ,t _i ,r _i ,b _i Representing videoThe offset of the regression target in the ith frame of the sequence, and the predicted target frame coordinate information can be expressed as:

wherein

Representing the target prediction Box B ⁱ The upper left corner point and the lower right corner point coordinates.

In step 7, a high-performance video target tracker model is finally obtained by minimizing the joint loss optimization network model, including the cross entropy loss of classification and quality evaluation, and the cross-over ratio loss of regression. The method comprises the following steps:

s71, the total training loss is defined as:

wherein L is ⁱ Is the loss of the ith search frame. k is expressed as the total number of classification scores (quality assessment scores, regression scores).

Indicates the probability that the (x, y) position in the i-th search block belongs to the target. />

Indicating the probability that the (x, y) position in the ith search block is close to the center of the target. />

Indicates the offset of position (x, y) from the perimeter of the bounding box in the ith regression plot.

S72, training loss L ⁱ Cross entropy loss and cross-ratio loss of regression, including classification and quality assessment, are defined as:

wherein, 1 {. Is indicative function, which represents whether it belongs to the target, if it belongs to the target, it is assigned as 1, otherwise, it is assigned as 0.L is _cls Representing the cross entropy loss of the classification. L is _quailty Represents the cross entropy loss of the quality assessment. L is a radical of an alcohol _reg Represents the cross-over ratio loss of regression. If the current position (x) ⁱ ,y ⁱ ) Belong to positive samples, i.e. the current position belongs to the target, then it will be

The value is assigned to 1; if a negative sample is present, will->

The value is assigned to 0./>

The probability of the position from the center of the real target frame in the ith search block is represented, and the probability value is higher when the position is closer to the center of the target, and is lower when the position is deviated. />

Indicates the center position (x) of the real target in the ith search block ⁱ ,y ⁱ ) Offset from the perimeter of the bounding box. N is a radical of _pos Representing the total number of positive samples, λ is the weight value for each penalty.

In step 8, the trained network model is used as a visual tracker to track the target of a given video sequence by video sequence. In order to ensure stable and accurate tracking, a confidence search region estimation strategy is defined, a search region of a next sequence is cut according to different target states in a current video sequence, error accumulation is reduced, and a target of each video frame in the search sequence is accurately positioned. The method comprises the following steps:

s81, due to the fact that the target possibly has large position change in the video sequence, according to the prediction frame result { B ] of the current search sequence ^t-k ,..,B ^t-1 ,B ^t In which B is ^t Searching the target prediction frame of the t-th frame in the sequence according to the coordinates of the upper left corner point of each target frame

And the coordinates of the lower right corner &>

Calculated to give a minimum of Bao Weikuang b _m As shown in fig. 4.

S82, for the minimum bounding box b _m Extended search area b for cropping a set of video sequences _s The search area is guaranteed to cover objects in each video frame of the search sequence. The video target tracking results are shown in fig. 6.

According to another aspect of the present application, there is also provided a depth space-time correlated video target tracking system, including the following units:

video sequence input unit: given a set of template sequence video frames and search sequence video frames, they are cut into template sequence blocks and search sequence blocks of a specified size in the form of S2.

The model training unit is used for training a video target tracker based on a 3D twin network and comprises a space-time feature extractor module, a multi-template matching module and a target prediction module. The space-time characteristic extractor takes the template sequence block and the search sequence block as input, extracts the template space-time characteristic and the search space-time characteristic from the template sequence block and the search sequence block, and inputs the template space-time characteristic and the search space-time characteristic to the multi-template matching module. And the multi-template matching module comprises a feature conversion sub-module and a space-time feature matching sub-module. In the feature conversion submodule, the template time-space feature is transmitted to the search feature by utilizing an interactive attention mechanism so as to obtain the search feature with higher discriminative power; calculating by a space-time feature matching submodule to obtain an optimal template feature, and respectively inputting the optimal template feature and the enhanced search feature into a classification branch and a regression branch; then, by using the depth correlation filtering operation, the optimal template features and the enhanced search features are subjected to similarity matching in a high-dimensional feature space, and multi-channel correlation filtering features are respectively obtained. Inputting the output of the classification branch into a classification head and a quality evaluation head in a target prediction module to obtain a classification score map and a quality evaluation score map; transmitting the output of the regression branch to a regression head to obtain a regression graph; the target tracking is trained by minimizing cross entropy loss for classification and quality assessment, and cross-over ratio loss for regression.

The video target tracking unit combines the classification score map and the quality evaluation score map output by the model to obtain a confidence classification map, estimates the position of the target by using the confidence classification map, predicts the scale of the target by using a regression map output by the model, and obtains a target prediction frame in the search sequence; and obtaining a group of confidence search areas by utilizing the group of target prediction boxes, and inputting the confidence search areas into a search branch to track the target of the subsequent video sequence.

The system is used for implementing the functions of the method in the above embodiments, and the specific implementation steps of the method related in the system module have already been described in the method, and are not described again here.

In the embodiment of the application, firstly, a space-time feature extractor based on a 3D twin network is designed to extract space-time features of a template sequence and a search sequence so as to improve the discrimination capability of a tracker. Secondly, a multi-template matching module is designed, and target features in a search block are enhanced by transmitting template features to search features, wherein the module comprises a template feature conversion sub-module and a space-time feature matching sub-module. Transmitting the appearance and motion information in the template frame to a search branch by using a template feature conversion template to obtain a search feature with more discriminative power; the spatio-temporal feature matching module consists of two depth-related branches which are used for classification and regression respectively to realize more efficient information association. Then, a target prediction module, including classification, quality assessment, and regression branches, is employed to accurately predict the target location and bounding box of each video frame in the search sequence. Finally, the video tracking model is optimized by minimizing the defined joint loss for target tracking prediction from video sequence to video sequence. In a target tracking test, a confidence region estimation strategy is defined, a search region of a next video sequence is calculated according to a target tracking result of a current video sequence, and error accumulation is reduced as much as possible, so that robust and accurate target tracking is maintained in the video sequence.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for high performance video tracking based on a 3D twin convolutional network, the method being performed by a computer and comprising the steps of:

s3, constructing a space-time feature extractor, wherein the module is a 3D twin full convolution network and comprises a template branch and a search branch, the 3D full convolution network is used as a basic network, and weights are shared; taking the template sequence block and the search sequence block as input, and extracting template space-time characteristics and search space-time characteristics from the input by a space-time characteristic extractor;

s5, constructing a target prediction module which mainly comprises a classification head, a quality evaluation head and a regression head; taking the multi-channel related filtering features output by the classification branches as the input of a classification head and a quality evaluation head to obtain a classification score map and a quality evaluation score map; taking the multi-channel related filtering features output by the regression branches as the input of a regression head to obtain a regression graph;

s7, optimizing a network model through the minimized joint loss, wherein the network model comprises cross entropy loss of classification and quality evaluation and intersection-to-parallel ratio loss of regression, and finally obtaining a high-performance video target tracker model;

s8, performing target tracking on a given video sequence by taking the trained network model as a visual tracker; in order to ensure stable and accurate tracking, a confidence search region estimation strategy is defined, a search region of a next sequence is cut according to different target states in a current video sequence, error accumulation is reduced, and a target of each video frame in the search sequence is accurately positioned.

2. The high-performance video tracking method based on the 3D twin convolutional network as claimed in claim 1, wherein a video target tracking network structure based on the 3D twin convolutional network is constructed, and the specific implementation process is as follows:

s11, constructing a 3D twin network-based spatio-temporal feature extractor which comprises a template branch and a search branch, wherein the 3D full convolution neural network is used as a basic network and weight sharing is performed, and the template spatio-temporal feature extractor and the search spatio-temporal feature are extracted from an input video sequence block;

s12, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input;

3. The 3D twin convolutional network-based high-performance video tracking method of claim 1, wherein the template sequence block and the search sequence block are constructed by the following specific implementation processes:

s21, giving a template sequence, acquiring the center position, the width and the height information of a target according to the real value information of the target in each video frame in the template sequence, and expressing as (x, y, w, h);

s211, calculating the extended value p = (w + h)/2 of the width and the height of the target frame according to each piece of real target frame information given in S21, and calculating a scaling factor

The target frame area is used for zooming the expanded target frame area; if the target frame area added with the extended value exceeds the boundary value of the video frame, filling the target frame area with the average RGB value of the current video frame; finally, each video frame in the template sequence is cropped into a template block of 127 × 127 size;

Wherein k represents the total number of video frames in the template sequence;

s22, giving a search sequence, acquiring the center position, the width and the height information of a target according to the actual value information of the target of the first frame of video frame in the template sequence, and expressing the information as (X, Y, W and H);

s221, calculating the extended value P = (W + H)/2 of the width and the height of the target frame according to the real target frame information given in S22, and calculating a scaling factor

For scaling the expanded target frame area; if the target frame area added with the extended value exceeds the boundary value of the video frame, filling the target frame area with the average RGB value of the current video frame, and finally cutting each video frame in the search sequence into a search block with the size of 255 multiplied by 255;

s222, cutting each video frame in the search sequence to obtain a search block

Where k represents the total number of video frames in the search sequence.

4. The high-performance video tracking method based on the 3D twin convolutional network as claimed in claim 1, wherein a spatiotemporal feature extractor is constructed, and the specific implementation process is as follows:

s31, constructing a feature extraction network, wherein each branch is a Res2+1D network consisting of five residual blocks;

s32, modifying the padding attribute in the first residual block of Res2+1D into 1 multiplied by 4, respectively modifying the output channel of the fourth residual block and the input channel of the fifth block into 256, and removing the down sampling and the final classification layer of the fifth residual block; whereby the output spatio-temporal features and the input video sequence have the same temporal length;

And search for spatio-temporal features>

5. The 3D twin convolution network-based high-performance video tracking method as claimed in claim 1, wherein a multi-template matching module is constructed, and the specific implementation process is as follows:

s41, the template feature conversion sub-module establishes a connection between the template features and the search features by means of an interactive attention mechanism; firstly, the template feature F _Z Is converted into dimensions of

Space-time matrix of (1), search feature transfer F _X Is converted into dimensions of

The space-time matrix of (1), wherein N _Z ＝k×h×w，N _X ＝k×H×W；

The method comprises the following specific steps:

where φ (-) is a 1 × 1 linear transform operation, softmax denotes the normalization operation;

Gaussian mask of (m) _i (y)＝exp(-||y-c|| ² /2σ ² ) Wherein y represents the position of a pixel point in each feature map, c represents the real central position of each frame target, and a Gaussian masking code set is obtained>

And convert it to have dimension->

S413, utilizing a cross attention matrix A _Z→X Taking M' as a noteWeight of intention, calculating mask of transfer

And associated with the search feature>

Performing element-by-element multiplication to obtain mask search feature>

Wherein

The element-by-element multiplication is shown, and ins.norm shows the example normalization operation, so that the potential position of the target in the search area can be more accurately found;

s414, at the same time, encoding the context information in the template sequence and transmitting the encoded context information to the search feature; calculating mask template characteristics according to Gaussian mask set M

And search feature>

Adding up to obtain the passed search feature>

The specific calculation is as follows:

s416, searching characteristics by using the mask

And the search feature passed pick>

Element by element addition, an enhanced search feature is calculated>

And converts it into the original characteristic dimension>

S42, the template characteristics obtained in the S3 are used

Respectively inputting the classification and regression branches matched with the time-space characteristics to obtain the characteristics of a classification template>

And the regression template feature >>

Likewise, the enhanced search feature in S41 is @>

Respectively input to the space-time characteristicsMatching classification and regression branches to derive a classification search feature>

And a regression search feature>

And &>

And an optimal regression template spatiotemporal feature>

And are respectively used as the input of a classification branch and a regression branch;

wherein,

indicates a classification branch, <' > or>

Representing regression branches, representing depth dependent filtering;

And &>

6. The method as claimed in claim 1, wherein a video sequence target prediction module is constructed, and the implementation process is as follows:

s51, the classification head is composed of a 1 x 1 convolution layer and outputs the multi-channel related filter characteristic F by the classification branch in S42 _cls As input of the classification header, a classification score map is output:

s52, the quality evaluation head consists of a 1 multiplied by 1 convolution layer, and the multi-channel related filter characteristics F output by the classification branch in S42 _cls As an input of the quality assessment head, a quality assessment score map is output:

/>

7. the method as claimed in claim 1, wherein the target position is predicted and the bounding box size is estimated, and the method is implemented as follows:

s61, the size of the classification score map is as follows: k × 17 × 17, the size of the quality assessment score map is: k multiplied by 17, the corresponding position of the quality evaluation score map and the classification score map is multiplied to obtain a confidence classification map with the size of k multiplied by 17, and for the ith frame in the video sequence, the confidence classification score map finds the point with the maximum response value

Expressed in the original video frame as:

where s =8 is the total step size of the entire network;

s62, the regression graph is a four-channel vector with the size: kX 4X 17; use of _i ,t _i ,r _i ,b _i Representing the offset of the regression target in the ith frame of the video sequence, the predicted target frame coordinate information may be represented as:

wherein

8. The method for tracking high-performance video based on 3D twin convolutional network as claimed in claim 1, wherein the visual tracking model is trained by the following specific implementation process:

s71, the total training loss is defined as:

wherein L is ⁱ Loss for the ith search frame; k is expressed as the total number of classification scores (quality assessment scores, regression plots);

representing the probability that the (x, y) position in the ith search block belongs to the target; />

Representing the probability that the (x, y) position in the ith search block is close to the center of the target; />

Represents the offset of position (x, y) from the perimeter of the bounding box in the ith regression plot;

s72, training loss L ⁱ Cross entropy loss and cross specific loss of regression, including classification and quality assessment, are defined as:

wherein, 1 {. Is indicative function, which represents whether it belongs to the target, if it belongs to the target, it is assigned as 1, otherwise, it is assigned as 0; l is _cls Represents the cross-entropy loss of the classification; l is _quailty Cross entropy loss representing quality assessment; l is _reg Represents the cross-over ratio loss of regression; if the current position (x) ⁱ ,y ⁱ ) Belong to positive samples, i.e. the current position belongs to the target, then it will be

The value is assigned to 1; if it is a negative sample, it will

The value is assigned to 0; />

The probability of the position, which is closer to the center of the target, is higher, and the probability value is lower when the position is deviated; />

Indicates the center position (x) of the real target in the ith search block ⁱ ,y ⁱ ) An offset from the perimeter of the bounding box; n is a radical of hydrogen _pos Representing the total number of positive samples, λ is the weight value for each loss.

9. The method as claimed in claim 1, wherein the confidence search area is estimated by the following steps:

And the coordinate of the lower right corner point->

The minimum Bao Weikuang b is calculated _m ；

S82, for the minimum bounding box b _m Extended search area b for cropping a set of video sequences _s The search area is guaranteed to cover objects in each video frame of the search sequence.

10. A high-performance video tracking system based on a 3D twin convolution network is characterized by comprising the following units:

video sequence input unit: giving a group of template sequence video frames and search sequence video frames, and cutting the template sequence video frames and the search sequence video frames into template sequence blocks and search sequence blocks with specified sizes according to the form in S2;

the model training unit is used for training a video target tracker based on a 3D twin network and comprises a space-time feature extractor module, a multi-template matching module and a target prediction module; the space-time feature extractor takes the template sequence block and the search sequence block as input, extracts template space-time features and search space-time features from the template sequence block and the search sequence block, and inputs the template space-time features and the search space-time features into the multi-template matching module; the multi-template matching module comprises a feature conversion sub-module and a space-time feature matching sub-module; in the feature conversion submodule, the template space-time feature is transmitted to the search feature by using an interactive attention mechanism so as to obtain the search feature with higher discriminative power; calculating by a space-time feature matching submodule to obtain an optimal template feature, and respectively inputting the optimal template feature and the enhanced search feature into a classification branch and a regression branch; then, carrying out similarity matching on the optimal template features and the enhanced search features in a high-dimensional feature space by utilizing depth correlation filtering operation to respectively obtain multi-channel correlation filtering features; inputting the output of the classification branch into a classification head and a quality evaluation head in a target prediction module to obtain a classification score map and a quality evaluation score map; transmitting the output of the regression branch to a regression head to obtain a regression graph; training target tracking by minimizing cross entropy loss of classification and quality evaluation and cross-over ratio loss of regression;