CN115908500A - High-performance video tracking method and system based on 3D twin convolutional network - Google Patents

High-performance video tracking method and system based on 3D twin convolutional network Download PDF

Info

Publication number
CN115908500A
CN115908500A CN202211720733.3A CN202211720733A CN115908500A CN 115908500 A CN115908500 A CN 115908500A CN 202211720733 A CN202211720733 A CN 202211720733A CN 115908500 A CN115908500 A CN 115908500A
Authority
CN
China
Prior art keywords
search
template
target
sequence
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211720733.3A
Other languages
Chinese (zh)
Inventor
梁敏
桂彦
欧懿汝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha University of Science and Technology
Original Assignee
Changsha University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha University of Science and Technology filed Critical Changsha University of Science and Technology
Priority to CN202211720733.3A priority Critical patent/CN115908500A/en
Publication of CN115908500A publication Critical patent/CN115908500A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a high-performance video tracking method based on a 3D twin convolutional network. Firstly, a space-time feature extractor based on a 3D twin network is designed and used for extracting space-time features of a template sequence and a search sequence. Secondly, a multi-template matching module is designed, and target features in a search block are enhanced by transmitting template features to search features, wherein the module comprises a template feature conversion sub-module and a space-time feature matching sub-module. Transmitting the appearance and motion information in the template frame to a search branch by using a template characteristic conversion template; the spatio-temporal feature matching module consists of two depth-related branches, which are used for classification and regression respectively. Then, an object box prediction module is employed, including classification, quality assessment, and regression branches, to accurately predict the object location and bounding box of each video frame in the search sequence. Finally, the video tracking model is optimized by minimizing the defined joint loss for target tracking prediction from video sequence to video sequence.

Description

High-performance video tracking method and system based on 3D twin convolutional network
Technical Field
The invention relates to the field of computer vision, in particular to a high-performance video tracking method and system based on a 3D twin convolutional network.
Background
Video target tracking refers to a technology of utilizing context information of a video or an image sequence to model appearance and motion information of a target, so as to predict a motion state of the target and mark a position. Generally, according to a target designated in a first frame of a video, the specific target is continuously tracked in subsequent video frames, and the positioning and target scale estimation of the target are realized. The video target tracking has wide application value and can be used in the fields of video monitoring, unmanned driving, accurate guidance and the like.
With the rapid development of deep learning and convolutional networks, more and more video target trackers based on convolutional networks appear. Researchers prefer twin network-based trackers, which not only take advantage of tracking speed, but also achieve good accuracy. Such twin network based trackers view visual tracking as a similarity matching problem. In 2016, bertinetto et al first proposed a siamf tracker for visual tracking (Luca Bertinetto, jackval marmdre,
Figure BDA0004028416070000011
henriques, andrea Vedaldi, philip h.s.torr: full-conditional space Networks for Object tracking.eccv works phones (2) 2016: 850-865.) by means of a twin network, and calculating the degree of cross-correlation between the target template and the search area using correlation filtering.
In recent years, some twin network based approaches have attempted to update the template online to account for changes in the appearance of the target as it moves. Guo et al propose a dynamic twin network (Qing Guo, wei Feng, ce Zhou, rui Huang, liang Wan, and Song Wang. However, such methods use historical prediction for model updates and do not explicitly model the relationship between time and space. In addition, some methods introduce attention mechanisms and transformers into the twin network to improve discrimination of targets in complex scenes. Chen et al designed a feature fusion network using a Transformer (Xin Chen, bin Yan, jianwen Zhu, dong Wang, xiaoyun Yang, huchuan Lu. Transformer Transmission. CVPR 2021. Du et al designed a dual relevance guided attention module (Fei Du, peng Liu, wei Zhao, and Xiaoanglong Tang. Correlation-defined authentication for core detector visual tracking. CVPR 2020). The tracker with the attention mechanism improves the target tracking precision, but only tracks by using spatial information and ignores a large amount of context information contained in the video. The visual tracking based on the twin network effectively utilizes the time-space information of cross-time frames, better learns the time-space appearance characteristics to perform target appearance modeling, and obtains the search characteristics with higher discrimination so as to realize more accurate tracking and positioning.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a high-performance video tracking method and a high-performance video tracking system based on a 3D twin convolutional network, which are used for receiving the input of a template sequence and a search sequence and extracting space-time associated characteristic information through the 3D twin network; a multi-template matching module is developed, and template features are transmitted to search features by using an attention mechanism to strengthen target features in a search block so as to obtain search features with higher discriminative power; in addition, a quality evaluation branch is introduced, ambiguity of classification branch positioning is avoided, and tracking accuracy is further improved. The tracker for processing video target tracking sequence by sequence defines a multi-template matching mechanism, enhances search characteristics by gathering space-time characteristics of different targets in sequence frames, obtains balance in speed and precision, and greatly improves the performance of the video target tracker.
In order to achieve the above object, the present invention provides a high performance video tracking method based on a 3D twin convolutional network, comprising the following steps;
s1, constructing a network architecture, wherein the network consists of a space-time feature extractor, a multi-template matching module and a target prediction module;
s2, respectively giving template sequence video frames and search sequence video frames, and cutting the template sequence video frames and the search sequence video frames into template sequence blocks and search sequence blocks which are used as input of the whole network architecture;
and S3, constructing a space-time feature extractor, wherein the module is a 3D twin full convolution network and comprises a template branch and a search branch, the 3D full convolution network is used as a basic network, and the weight is shared. Taking the template sequence block and the search sequence block as input, and extracting template space-time characteristics and search space-time characteristics from the template sequence block and the search sequence block by a space-time characteristic extractor;
s4, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain a search feature with more discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input;
and S5, constructing a target prediction module which mainly comprises a classification head, a quality evaluation head and a regression head. Taking the multi-channel related filtering features output by the classification branches as the input of a classification head and a quality evaluation head to obtain a classification score map and a quality evaluation score map; obtaining a regression graph by taking the multi-channel related filtering characteristics output by the regression branches as the input of a regression head;
s6, positioning the position of each video frame target in the sequence by utilizing the classification score map and the quality evaluation score map; estimating the target scale of each video frame in the sequence according to the regression graph; finally, obtaining a target prediction frame of each video frame in the search sequence;
s7, optimizing a network model through the minimized joint loss, wherein the network model comprises cross entropy loss of classification and quality evaluation and cross-over ratio loss of regression, and finally obtaining a high-performance video target tracker model;
and S8, performing target tracking on the given video sequence by taking the trained network model as a visual tracker. In order to ensure stable and accurate tracking, a confidence search region estimation strategy is defined, a search region of a next sequence is cut according to different target states in a current video sequence, error accumulation is reduced, and a target of each video frame in the search sequence is accurately positioned.
The invention provides an end-to-end trainable neural network architecture and a system for video object tracking, which comprise a video sequence input unit, a template sequence block and a search sequence block, wherein the video sequence input unit is used for cutting the template sequence block and the search sequence block; the model training unit is used for training a high-performance video target tracker, training target tracking through minimized combination loss, including cross entropy loss of classification and quality evaluation heads and cross-over ratio loss of regression heads, and finally realizing target tracking of video sequences; the video target tracking unit is used for respectively estimating the target state and predicting the scale in the video frame of the search sequence by utilizing the classification graph, the quality evaluation score graph and the regression graph output by the model, and calculating to obtain a target prediction frame in the search sequence; and calculating to obtain a confidence search area of the next group of video sequences by using the target prediction box of the current video sequence, and inputting the confidence search area into a search branch to track the target of the subsequent video sequence.
Compared with the prior art, the method has the following beneficial effects:
the invention utilizes the 3D twin full convolution network to extract the template space-time characteristics and search the space-time characteristics, and learns the abundant space-time information between a plurality of continuous video frames. And inputting the extracted template space-time characteristics and the search space-time characteristics into a multi-template matching sub-network to obtain multi-channel related filtering characteristics. Respectively processing the multi-channel related filtering characteristics of the classification branches by using the classification head and the quality evaluation head, and predicting the positioning of the target; and processing the multichannel correlation filtering characteristics of the regression branches by using the regression head to estimate the target scale. In the target tracking stage, in order to obtain a more accurate search sequence region, a confidence search region estimation strategy is defined, and a next search sequence region is estimated according to different states of a target in a current video sequence, so that the stability and the accuracy of target tracking are ensured. The method processes video target tracking in a video sequence mode, ensures the speed of video tracking, and can capture the space-time relevance between video frames; meanwhile, a multi-template matching mechanism is introduced, so that more discriminative search features are obtained, quality evaluation branches are added in classification, and the accuracy of video target tracking is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is an overall structure diagram of the network in the invention patent.
Fig. 2 is a schematic diagram of template sequence blocks and search sequence blocks in the patent of the present invention.
FIG. 3 is a schematic diagram of a spatiotemporal feature extractor according to the present invention.
FIG. 4 is a schematic diagram of a multi-template matching module according to the present invention
Fig. 5 is a diagram of confidence search area estimation in the present patent.
Fig. 6 is a schematic diagram of a part of video frames in the patent of the present invention.
Fig. 7 is a schematic diagram of a video target tracking result in the patent of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention is described in detail below with reference to the drawings and the detailed description.
The invention is described in detail below with reference to the drawings of the specification and specific embodiments, and a video target tracking method with deep space-time correlation comprises steps S1 to S8;
s1, constructing a network architecture, wherein the network consists of a space-time feature extractor, a multi-template matching module and a target prediction module;
s2, respectively giving template sequence video frames and search sequence video frames, and cutting the template sequence video frames and the search sequence video frames into template sequence blocks and search sequence blocks which are used as input of the whole network architecture;
and S3, constructing a space-time feature extractor, wherein the module is a 3D twin full convolution network and comprises a template branch and a search branch, the 3D full convolution network is used as a basic network, and the weight is shared. Taking the template sequence block and the search sequence block as input, and extracting template space-time characteristics and search space-time characteristics from the template sequence block and the search sequence block by a space-time characteristic extractor;
s4, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input;
and S5, constructing a target prediction module which mainly comprises a classification head, a quality evaluation head and a regression head. Taking the multi-channel related filtering features output by the classification branches as the input of a classification head and a quality evaluation head to obtain a classification score map and a quality evaluation score map; obtaining a regression graph by taking the multi-channel related filtering characteristics output by the regression branches as the input of a regression head;
s6, positioning the position of each video frame target in the sequence by utilizing the classification score map and the quality evaluation score map; estimating the target scale of each video frame in the sequence according to the regression graph; finally, a target prediction frame of each video frame in the search sequence is obtained;
s7, optimizing a network model through the minimized joint loss, wherein the network model comprises cross entropy loss of classification and quality evaluation and cross-over ratio loss of regression, and finally obtaining a high-performance video target tracker model;
and S8, performing target tracking of the given video sequence by taking the trained network model as a visual tracker. In order to ensure stable and accurate tracking, a confidence search region estimation strategy is defined, a search region of the next sequence is cut according to different target states in the current video sequence, error accumulation is reduced, and the target of each video frame in the search sequence is accurately positioned.
In step S1, a network architecture is constructed, as shown in fig. 1, the network is composed of a spatio-temporal feature extractor, a multi-template matching module, and a target prediction module. The method comprises the following steps:
s11, constructing a space-time feature extractor based on a 3D twin network, wherein the space-time feature extractor comprises a template branch and a search branch, takes a 3D full convolution neural network as a basic network and shares weight, and is used for extracting template space-time features and searching space-time features from an input video sequence block.
S12, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input.
And S13, the target prediction module comprises a classification head, a quality evaluation head and a regression head, and a classification score map, a quality evaluation score map and a regression map are respectively obtained through the classification head, the quality evaluation head and the regression head by taking the multi-channel related filtering characteristics as input.
In step 2, a template sequence video frame and a search sequence video frame are respectively given and cut into a template sequence block and a search sequence block, as shown in fig. 2, and are used as input of the whole network architecture. The method comprises the following steps:
and S21, giving a template sequence, acquiring the center position, the width and the height information of the target according to the real value information of the target in each video frame in the template sequence, and expressing as (x, y, w, h).
S211, calculating the extension value p = (w + h)/2 of the width and height of the target frame according to each piece of real target frame information given in S21, and calculating a scaling factor
Figure BDA0004028416070000061
For zooming extended objectsAnd marking a frame area. And if the target frame area added with the extended value exceeds the boundary value of the video frame, filling the target frame area with the average RGB value of the current video frame. Finally, each video frame in the template sequence is cropped to a template block of 127 × 127 size.
S212, cutting each video frame in the template sequence to obtain a template block
Figure BDA0004028416070000062
Where k represents the total number of video frames in the template sequence.
S22, giving a search sequence, acquiring the center position, the width and the height information of the target according to the real value information of the first frame video frame target in the template sequence, and expressing as (X, Y, W, H).
S221, calculating the expanded value P = (W + H)/2 of the width and the height of the target frame according to the real target frame information given in S22, and calculating the scaling factor
Figure BDA0004028416070000063
For scaling the expanded target frame area. If the target frame area added with the extended value exceeds the boundary value of the video frame, the average RGB value of the current video frame is used for filling, and finally, each video frame in the search sequence is cut into search blocks with the size of 255 multiplied by 255.
S222, cutting each video frame in the search sequence to obtain a search block
Figure BDA0004028416070000064
Where k represents the total number of video frames in the search sequence.
In step S3, the spatio-temporal feature extractor is a 3D twin full convolution network, which includes a template branch and a search branch, and uses the 3D full convolution network as a basic network and shares weights. The template sequence block and the search sequence block are used as input, and a space-time feature extractor extracts template space-time features and search space-time features from the input. The method comprises the following steps:
and S31, constructing a feature extraction network, wherein each branch is a Res2+1D network consisting of five residual blocks as shown in FIG. 3.
S32, modifying the padding attribute in the first residual block of Res2+1D into 1 multiplied by 4, respectively modifying the output channel of the fourth residual block and the input channel of the fifth block into 256, and removing the down sampling and the final classification layer of the fifth residual block. Thus, the output spatio-temporal features have the same temporal length as the input video sequence.
S33, inputting the template block and the search block obtained in the steps S212 and S222 into a time-space feature extractor to respectively obtain template time-space features
Figure BDA0004028416070000071
And search spatio-temporal features->
Figure BDA0004028416070000072
In step 4, a multi-template matching module is constructed, wherein the multi-template matching module comprises a template feature conversion submodule and a space-time feature matching submodule, and appearance and motion information in a template frame is transmitted to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input. The method comprises the following steps:
s41, as shown in the figure 4, the template feature conversion sub-module establishes a connection between the template features and the search features by means of an interactive attention mechanism; firstly, the template feature F Z Is converted into dimensions of
Figure BDA0004028416070000073
Space-time matrix of (1), search feature conversion F X Convert to dimension->
Figure BDA0004028416070000074
The space-time matrix of (1), wherein N Z =k×h×w,N X =k×H×W。
S411, calculating a cross attention moment array by using an attention mechanism
Figure BDA0004028416070000075
The method comprises the following specific steps:
Figure BDA0004028416070000076
where φ (·) is a 1 × 1 linear transformation operation, softmax denotes the normalization operation.
S412, giving template space-time characteristics F Z Calculating each feature map
Figure BDA0004028416070000077
Gaussian mask m i (y)=exp(-||y-c|| 2 /2σ 2 ) Wherein y represents the position of a pixel point in each feature map, c represents the real central position of each frame target, and a Gaussian masking code set is obtained>
Figure BDA0004028416070000081
And convert it to have dimension->
Figure BDA0004028416070000082
S413, utilizing the cross attention matrix A Z→X Taking M' as attention weight, calculating mask of transfer
Figure BDA0004028416070000083
And associated with the search feature>
Figure BDA0004028416070000084
Performing element-by-element multiplication to obtain mask search feature>
Figure BDA0004028416070000085
Figure BDA0004028416070000086
Wherein
Figure BDA0004028416070000087
Norm denotes the example normalization operation, which can more accurately find the potential location of the target in the search area. />
And S414, simultaneously, encoding the context information in the template sequence and transmitting the context information to the search feature. Calculating mask template characteristics according to Gaussian mask set M
Figure BDA0004028416070000088
To highlight the target position in the search area and weaken background interference; and transferred to the search branch to obtain the transferred mask template feature->
Figure BDA0004028416070000089
S415, further transmitting the mask template after the characteristics are transmitted
Figure BDA00040284160700000810
And search feature>
Figure BDA00040284160700000811
Adding up to obtain a transferred search feature>
Figure BDA00040284160700000812
The specific calculation is as follows:
Figure BDA00040284160700000813
s416, searching for features by using the mask
Figure BDA00040284160700000814
And the search feature passed pick>
Figure BDA00040284160700000815
Element by element addition, an enhanced search feature is calculated>
Figure BDA00040284160700000816
And converts it into the original characteristic dimension->
Figure BDA00040284160700000817
S42, the template characteristics obtained in the S3 are processed
Figure BDA00040284160700000818
Respectively inputting the classification and regression branches matched with the time-space characteristics to obtain the classification template characteristics->
Figure BDA00040284160700000819
And regression template features>
Figure BDA00040284160700000820
Likewise, the enhanced search feature in S41 is @>
Figure BDA00040284160700000821
Respectively input into the classification and regression branch matched with the time-space feature to obtain a classification search feature->
Figure BDA00040284160700000822
And regression search feature>
Figure BDA00040284160700000823
As shown in fig. 4.
S421, in order to effectively utilize the template space-time characteristics, the classification and regression template space-time characteristics are averaged in a time dimension; features calculated by averaging operations
Figure BDA00040284160700000824
And &>
Figure BDA00040284160700000825
And copying the characteristics k times to finally obtain the optimal classification template characteristics>
Figure BDA0004028416070000091
And an optimal regression template spatiotemporal feature>
Figure BDA0004028416070000092
And respectively as input for the classification branch and the regression branch.
S422, carrying out depth-related filtering operation on the optimal template features in the classification and regression branches and the enhanced search features, and specifically calculating as follows:
Figure BDA0004028416070000093
Figure BDA0004028416070000094
wherein,
Figure BDA0004028416070000095
indicates a classification branch, <' > or>
Figure BDA0004028416070000096
Regression branches are represented, and depth dependent filtering is represented.
S43, respectively outputting the multi-channel related filtering characteristics by the classification branch and the regression branch
Figure BDA0004028416070000097
And
Figure BDA0004028416070000098
in step 5, a target prediction module is constructed, which mainly consists of a classification head, a quality evaluation head and a regression head. Taking the multi-channel related filtering features output by the classification branches as the input of a classification head and a quality evaluation head to obtain a classification score map and a quality evaluation score map; and (4) taking the multichannel correlation filtering characteristics output by the regression branches as the input of the regression head to obtain a regression graph. The method comprises the following steps:
s51, the classification head is composed of a 1 × 1 convolution layer, and the classification branch in S42 is outputMultichannel correlation filter characteristic F cls As input to the classification header, a classification score map is output:
Figure BDA0004028416070000099
s52, the quality evaluation head consists of a 1 multiplied by 1 convolution layer, and the multi-channel related filter characteristics F output by the classification branch in S42 cls As inputs to the quality assessment head, a quality assessment score map is output:
Figure BDA00040284160700000910
s53, the regression head is composed of a 1 multiplied by 1 convolution layer, and the multi-channel related filter characteristic F output by the regression branch in S42 reg As an input to the regression head, a regression graph is output:
Figure BDA00040284160700000911
in step 6, the position of each video frame target in the sequence is positioned by utilizing the classification score map and the quality evaluation score map; estimating the target scale of each video frame in the sequence according to the regression graph; and finally obtaining a target prediction frame of each video frame in the search sequence. The method comprises the following steps:
s61, the size of the classification score map is as follows: k × 17 × 17, the size of the quality assessment score map is: k is multiplied by 17 and 17, the quality evaluation score map is multiplied by the corresponding position of the classification score map to obtain a confidence classification map with the size of k is multiplied by 17 and 17, and for the ith frame in the video sequence, the confidence classification score map finds the point with the maximum response value
Figure BDA00040284160700000912
Expressed in the original video frame as:
Figure BDA00040284160700000913
where s =8 is the total step size of the entire network.
S62, the regression graph is a four-channel vector with the size: kX 4X 17; use of i ,t i ,r i ,b i Representing videoThe offset of the regression target in the ith frame of the sequence, and the predicted target frame coordinate information can be expressed as:
Figure BDA0004028416070000101
wherein
Figure BDA0004028416070000108
Representing the target prediction Box B i The upper left corner point and the lower right corner point coordinates.
In step 7, a high-performance video target tracker model is finally obtained by minimizing the joint loss optimization network model, including the cross entropy loss of classification and quality evaluation, and the cross-over ratio loss of regression. The method comprises the following steps:
s71, the total training loss is defined as:
Figure BDA0004028416070000102
wherein L is i Is the loss of the ith search frame. k is expressed as the total number of classification scores (quality assessment scores, regression scores).
Figure BDA0004028416070000109
Indicates the probability that the (x, y) position in the i-th search block belongs to the target. />
Figure BDA00040284160700001010
Indicating the probability that the (x, y) position in the ith search block is close to the center of the target. />
Figure BDA00040284160700001011
Indicates the offset of position (x, y) from the perimeter of the bounding box in the ith regression plot.
S72, training loss L i Cross entropy loss and cross-ratio loss of regression, including classification and quality assessment, are defined as:
Figure BDA0004028416070000103
wherein, 1 {. Is indicative function, which represents whether it belongs to the target, if it belongs to the target, it is assigned as 1, otherwise, it is assigned as 0.L is cls Representing the cross entropy loss of the classification. L is quailty Represents the cross entropy loss of the quality assessment. L is a radical of an alcohol reg Represents the cross-over ratio loss of regression. If the current position (x) i ,y i ) Belong to positive samples, i.e. the current position belongs to the target, then it will be
Figure BDA0004028416070000104
The value is assigned to 1; if a negative sample is present, will->
Figure BDA0004028416070000105
The value is assigned to 0./>
Figure BDA0004028416070000106
The probability of the position from the center of the real target frame in the ith search block is represented, and the probability value is higher when the position is closer to the center of the target, and is lower when the position is deviated. />
Figure BDA0004028416070000107
Indicates the center position (x) of the real target in the ith search block i ,y i ) Offset from the perimeter of the bounding box. N is a radical of pos Representing the total number of positive samples, λ is the weight value for each penalty.
In step 8, the trained network model is used as a visual tracker to track the target of a given video sequence by video sequence. In order to ensure stable and accurate tracking, a confidence search region estimation strategy is defined, a search region of a next sequence is cut according to different target states in a current video sequence, error accumulation is reduced, and a target of each video frame in the search sequence is accurately positioned. The method comprises the following steps:
s81, due to the fact that the target possibly has large position change in the video sequence, according to the prediction frame result { B ] of the current search sequence t-k ,..,B t-1 ,B t In which B is t Searching the target prediction frame of the t-th frame in the sequence according to the coordinates of the upper left corner point of each target frame
Figure BDA0004028416070000111
And the coordinates of the lower right corner &>
Figure BDA0004028416070000112
Calculated to give a minimum of Bao Weikuang b m As shown in fig. 4.
S82, for the minimum bounding box b m Extended search area b for cropping a set of video sequences s The search area is guaranteed to cover objects in each video frame of the search sequence. The video target tracking results are shown in fig. 6.
According to another aspect of the present application, there is also provided a depth space-time correlated video target tracking system, including the following units:
video sequence input unit: given a set of template sequence video frames and search sequence video frames, they are cut into template sequence blocks and search sequence blocks of a specified size in the form of S2.
The model training unit is used for training a video target tracker based on a 3D twin network and comprises a space-time feature extractor module, a multi-template matching module and a target prediction module. The space-time characteristic extractor takes the template sequence block and the search sequence block as input, extracts the template space-time characteristic and the search space-time characteristic from the template sequence block and the search sequence block, and inputs the template space-time characteristic and the search space-time characteristic to the multi-template matching module. And the multi-template matching module comprises a feature conversion sub-module and a space-time feature matching sub-module. In the feature conversion submodule, the template time-space feature is transmitted to the search feature by utilizing an interactive attention mechanism so as to obtain the search feature with higher discriminative power; calculating by a space-time feature matching submodule to obtain an optimal template feature, and respectively inputting the optimal template feature and the enhanced search feature into a classification branch and a regression branch; then, by using the depth correlation filtering operation, the optimal template features and the enhanced search features are subjected to similarity matching in a high-dimensional feature space, and multi-channel correlation filtering features are respectively obtained. Inputting the output of the classification branch into a classification head and a quality evaluation head in a target prediction module to obtain a classification score map and a quality evaluation score map; transmitting the output of the regression branch to a regression head to obtain a regression graph; the target tracking is trained by minimizing cross entropy loss for classification and quality assessment, and cross-over ratio loss for regression.
The video target tracking unit combines the classification score map and the quality evaluation score map output by the model to obtain a confidence classification map, estimates the position of the target by using the confidence classification map, predicts the scale of the target by using a regression map output by the model, and obtains a target prediction frame in the search sequence; and obtaining a group of confidence search areas by utilizing the group of target prediction boxes, and inputting the confidence search areas into a search branch to track the target of the subsequent video sequence.
The system is used for implementing the functions of the method in the above embodiments, and the specific implementation steps of the method related in the system module have already been described in the method, and are not described again here.
In the embodiment of the application, firstly, a space-time feature extractor based on a 3D twin network is designed to extract space-time features of a template sequence and a search sequence so as to improve the discrimination capability of a tracker. Secondly, a multi-template matching module is designed, and target features in a search block are enhanced by transmitting template features to search features, wherein the module comprises a template feature conversion sub-module and a space-time feature matching sub-module. Transmitting the appearance and motion information in the template frame to a search branch by using a template feature conversion template to obtain a search feature with more discriminative power; the spatio-temporal feature matching module consists of two depth-related branches which are used for classification and regression respectively to realize more efficient information association. Then, a target prediction module, including classification, quality assessment, and regression branches, is employed to accurately predict the target location and bounding box of each video frame in the search sequence. Finally, the video tracking model is optimized by minimizing the defined joint loss for target tracking prediction from video sequence to video sequence. In a target tracking test, a confidence region estimation strategy is defined, a search region of a next video sequence is calculated according to a target tracking result of a current video sequence, and error accumulation is reduced as much as possible, so that robust and accurate target tracking is maintained in the video sequence.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for high performance video tracking based on a 3D twin convolutional network, the method being performed by a computer and comprising the steps of:
s1, constructing a network architecture, wherein the network consists of a space-time feature extractor, a multi-template matching module and a target prediction module;
s2, respectively giving template sequence video frames and search sequence video frames, and cutting the template sequence video frames and the search sequence video frames into template sequence blocks and search sequence blocks which are used as input of the whole network architecture;
s3, constructing a space-time feature extractor, wherein the module is a 3D twin full convolution network and comprises a template branch and a search branch, the 3D full convolution network is used as a basic network, and weights are shared; taking the template sequence block and the search sequence block as input, and extracting template space-time characteristics and search space-time characteristics from the input by a space-time characteristic extractor;
s4, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input;
s5, constructing a target prediction module which mainly comprises a classification head, a quality evaluation head and a regression head; taking the multi-channel related filtering features output by the classification branches as the input of a classification head and a quality evaluation head to obtain a classification score map and a quality evaluation score map; taking the multi-channel related filtering features output by the regression branches as the input of a regression head to obtain a regression graph;
s6, positioning the position of each video frame target in the sequence by utilizing the classification score map and the quality evaluation score map; estimating the target scale of each video frame in the sequence according to the regression graph; finally, obtaining a target prediction frame of each video frame in the search sequence;
s7, optimizing a network model through the minimized joint loss, wherein the network model comprises cross entropy loss of classification and quality evaluation and intersection-to-parallel ratio loss of regression, and finally obtaining a high-performance video target tracker model;
s8, performing target tracking on a given video sequence by taking the trained network model as a visual tracker; in order to ensure stable and accurate tracking, a confidence search region estimation strategy is defined, a search region of a next sequence is cut according to different target states in a current video sequence, error accumulation is reduced, and a target of each video frame in the search sequence is accurately positioned.
2. The high-performance video tracking method based on the 3D twin convolutional network as claimed in claim 1, wherein a video target tracking network structure based on the 3D twin convolutional network is constructed, and the specific implementation process is as follows:
s11, constructing a 3D twin network-based spatio-temporal feature extractor which comprises a template branch and a search branch, wherein the 3D full convolution neural network is used as a basic network and weight sharing is performed, and the template spatio-temporal feature extractor and the search spatio-temporal feature are extracted from an input video sequence block;
s12, constructing a multi-template matching module which comprises a template feature conversion submodule and a space-time feature matching submodule, and transmitting appearance and motion information in a template frame to a search branch by using the template feature conversion submodule to obtain search features with higher discriminative power; the space-time feature matching sub-module consists of two depth related branches which are respectively used for classification and regression, and the multi-channel related filtering features are obtained by taking the template space-time features and the enhanced search features as input;
and S13, the target prediction module comprises a classification head, a quality evaluation head and a regression head, and a classification score map, a quality evaluation score map and a regression map are respectively obtained through the classification head, the quality evaluation head and the regression head by taking the multi-channel related filtering characteristics as input.
3. The 3D twin convolutional network-based high-performance video tracking method of claim 1, wherein the template sequence block and the search sequence block are constructed by the following specific implementation processes:
s21, giving a template sequence, acquiring the center position, the width and the height information of a target according to the real value information of the target in each video frame in the template sequence, and expressing as (x, y, w, h);
s211, calculating the extended value p = (w + h)/2 of the width and the height of the target frame according to each piece of real target frame information given in S21, and calculating a scaling factor
Figure FDA0004028416060000021
The target frame area is used for zooming the expanded target frame area; if the target frame area added with the extended value exceeds the boundary value of the video frame, filling the target frame area with the average RGB value of the current video frame; finally, each video frame in the template sequence is cropped into a template block of 127 × 127 size;
s212, cutting each video frame in the template sequence to obtain a template block
Figure FDA0004028416060000022
Wherein k represents the total number of video frames in the template sequence;
s22, giving a search sequence, acquiring the center position, the width and the height information of a target according to the actual value information of the target of the first frame of video frame in the template sequence, and expressing the information as (X, Y, W and H);
s221, calculating the extended value P = (W + H)/2 of the width and the height of the target frame according to the real target frame information given in S22, and calculating a scaling factor
Figure FDA0004028416060000023
For scaling the expanded target frame area; if the target frame area added with the extended value exceeds the boundary value of the video frame, filling the target frame area with the average RGB value of the current video frame, and finally cutting each video frame in the search sequence into a search block with the size of 255 multiplied by 255;
s222, cutting each video frame in the search sequence to obtain a search block
Figure FDA0004028416060000031
Where k represents the total number of video frames in the search sequence.
4. The high-performance video tracking method based on the 3D twin convolutional network as claimed in claim 1, wherein a spatiotemporal feature extractor is constructed, and the specific implementation process is as follows:
s31, constructing a feature extraction network, wherein each branch is a Res2+1D network consisting of five residual blocks;
s32, modifying the padding attribute in the first residual block of Res2+1D into 1 multiplied by 4, respectively modifying the output channel of the fourth residual block and the input channel of the fifth block into 256, and removing the down sampling and the final classification layer of the fifth residual block; whereby the output spatio-temporal features and the input video sequence have the same temporal length;
s33, inputting the template block and the search block obtained in the steps S212 and S222 into a time-space feature extractor to respectively obtain template time-space features
Figure FDA0004028416060000032
And search for spatio-temporal features>
Figure FDA0004028416060000033
5. The 3D twin convolution network-based high-performance video tracking method as claimed in claim 1, wherein a multi-template matching module is constructed, and the specific implementation process is as follows:
s41, the template feature conversion sub-module establishes a connection between the template features and the search features by means of an interactive attention mechanism; firstly, the template feature F Z Is converted into dimensions of
Figure FDA0004028416060000034
Space-time matrix of (1), search feature transfer F X Is converted into dimensions of
Figure FDA0004028416060000035
The space-time matrix of (1), wherein N Z =k×h×w,N X =k×H×W;
S411, calculating a cross attention moment array by using an attention mechanism
Figure FDA0004028416060000036
The method comprises the following specific steps:
Figure FDA0004028416060000037
where φ (-) is a 1 × 1 linear transform operation, softmax denotes the normalization operation;
s412, giving template space-time characteristics F Z Calculating each feature map
Figure FDA0004028416060000038
Gaussian mask of (m) i (y)=exp(-||y-c|| 2 /2σ 2 ) Wherein y represents the position of a pixel point in each feature map, c represents the real central position of each frame target, and a Gaussian masking code set is obtained>
Figure FDA0004028416060000039
And convert it to have dimension->
Figure FDA0004028416060000041
S413, utilizing a cross attention matrix A Z→X Taking M' as a noteWeight of intention, calculating mask of transfer
Figure FDA0004028416060000042
And associated with the search feature>
Figure FDA0004028416060000043
Performing element-by-element multiplication to obtain mask search feature>
Figure FDA0004028416060000044
Figure FDA0004028416060000045
Wherein
Figure FDA0004028416060000046
The element-by-element multiplication is shown, and ins.norm shows the example normalization operation, so that the potential position of the target in the search area can be more accurately found;
s414, at the same time, encoding the context information in the template sequence and transmitting the encoded context information to the search feature; calculating mask template characteristics according to Gaussian mask set M
Figure FDA0004028416060000047
To highlight the target position in the search area and weaken background interference; and transferred to the search branch to obtain the transferred mask template feature->
Figure FDA0004028416060000048
S415, further transmitting the mask template after the characteristics are transmitted
Figure FDA0004028416060000049
And search feature>
Figure FDA00040284160600000410
Adding up to obtain the passed search feature>
Figure FDA00040284160600000411
The specific calculation is as follows:
Figure FDA00040284160600000412
s416, searching characteristics by using the mask
Figure FDA00040284160600000413
And the search feature passed pick>
Figure FDA00040284160600000414
Element by element addition, an enhanced search feature is calculated>
Figure FDA00040284160600000415
And converts it into the original characteristic dimension>
Figure FDA00040284160600000416
S42, the template characteristics obtained in the S3 are used
Figure FDA00040284160600000417
Respectively inputting the classification and regression branches matched with the time-space characteristics to obtain the characteristics of a classification template>
Figure FDA00040284160600000418
And the regression template feature >>
Figure FDA00040284160600000419
Likewise, the enhanced search feature in S41 is @>
Figure FDA00040284160600000420
Respectively input to the space-time characteristicsMatching classification and regression branches to derive a classification search feature>
Figure FDA00040284160600000421
And a regression search feature>
Figure FDA00040284160600000422
S421, in order to effectively utilize the template space-time characteristics, the classification and regression template space-time characteristics are averaged in a time dimension; features calculated by averaging operations
Figure FDA00040284160600000423
And &>
Figure FDA00040284160600000424
And copying the characteristics k times to finally obtain the optimal classification template characteristics>
Figure FDA00040284160600000425
And an optimal regression template spatiotemporal feature>
Figure FDA0004028416060000051
And are respectively used as the input of a classification branch and a regression branch;
s422, carrying out depth-related filtering operation on the optimal template features in the classification and regression branches and the enhanced search features, and specifically calculating as follows:
Figure FDA0004028416060000052
Figure FDA0004028416060000053
wherein,
Figure FDA0004028416060000054
indicates a classification branch, <' > or>
Figure FDA0004028416060000055
Representing regression branches, representing depth dependent filtering;
s43, respectively outputting the multi-channel related filtering characteristics by the classification branch and the regression branch
Figure FDA0004028416060000056
And &>
Figure FDA0004028416060000057
6. The method as claimed in claim 1, wherein a video sequence target prediction module is constructed, and the implementation process is as follows:
s51, the classification head is composed of a 1 x 1 convolution layer and outputs the multi-channel related filter characteristic F by the classification branch in S42 cls As input of the classification header, a classification score map is output:
Figure FDA0004028416060000058
s52, the quality evaluation head consists of a 1 multiplied by 1 convolution layer, and the multi-channel related filter characteristics F output by the classification branch in S42 cls As an input of the quality assessment head, a quality assessment score map is output:
Figure FDA0004028416060000059
/>
s53, the regression head is composed of a 1 multiplied by 1 convolution layer, and the multi-channel related filter characteristic F output by the regression branch in S42 reg As an input to the regression head, a regression graph is output:
Figure FDA00040284160600000510
7. the method as claimed in claim 1, wherein the target position is predicted and the bounding box size is estimated, and the method is implemented as follows:
s61, the size of the classification score map is as follows: k × 17 × 17, the size of the quality assessment score map is: k multiplied by 17, the corresponding position of the quality evaluation score map and the classification score map is multiplied to obtain a confidence classification map with the size of k multiplied by 17, and for the ith frame in the video sequence, the confidence classification score map finds the point with the maximum response value
Figure FDA00040284160600000511
Expressed in the original video frame as:
Figure FDA00040284160600000512
where s =8 is the total step size of the entire network;
s62, the regression graph is a four-channel vector with the size: kX 4X 17; use of i ,t i ,r i ,b i Representing the offset of the regression target in the ith frame of the video sequence, the predicted target frame coordinate information may be represented as:
Figure FDA00040284160600000513
wherein
Figure FDA0004028416060000061
Representing the target prediction Box B i The upper left corner point and the lower right corner point coordinates.
8. The method for tracking high-performance video based on 3D twin convolutional network as claimed in claim 1, wherein the visual tracking model is trained by the following specific implementation process:
s71, the total training loss is defined as:
Figure FDA0004028416060000062
wherein L is i Loss for the ith search frame; k is expressed as the total number of classification scores (quality assessment scores, regression plots);
Figure FDA0004028416060000063
representing the probability that the (x, y) position in the ith search block belongs to the target; />
Figure FDA0004028416060000064
Representing the probability that the (x, y) position in the ith search block is close to the center of the target; />
Figure FDA0004028416060000065
Represents the offset of position (x, y) from the perimeter of the bounding box in the ith regression plot;
s72, training loss L i Cross entropy loss and cross specific loss of regression, including classification and quality assessment, are defined as:
Figure FDA0004028416060000066
wherein, 1 {. Is indicative function, which represents whether it belongs to the target, if it belongs to the target, it is assigned as 1, otherwise, it is assigned as 0; l is cls Represents the cross-entropy loss of the classification; l is quailty Cross entropy loss representing quality assessment; l is reg Represents the cross-over ratio loss of regression; if the current position (x) i ,y i ) Belong to positive samples, i.e. the current position belongs to the target, then it will be
Figure FDA0004028416060000067
The value is assigned to 1; if it is a negative sample, it will
Figure FDA0004028416060000068
The value is assigned to 0; />
Figure FDA0004028416060000069
The probability of the position, which is closer to the center of the target, is higher, and the probability value is lower when the position is deviated; />
Figure FDA00040284160600000610
Indicates the center position (x) of the real target in the ith search block i ,y i ) An offset from the perimeter of the bounding box; n is a radical of hydrogen pos Representing the total number of positive samples, λ is the weight value for each loss.
9. The method as claimed in claim 1, wherein the confidence search area is estimated by the following steps:
s81, due to the fact that the target possibly has large position change in the video sequence, according to the prediction frame result { B ] of the current search sequence t-k ,..,B t-1 ,B t In which B is t Searching the target prediction frame of the t-th frame in the sequence according to the coordinates of the upper left corner point of each target frame
Figure FDA0004028416060000071
And the coordinate of the lower right corner point->
Figure FDA0004028416060000072
The minimum Bao Weikuang b is calculated m
S82, for the minimum bounding box b m Extended search area b for cropping a set of video sequences s The search area is guaranteed to cover objects in each video frame of the search sequence.
10. A high-performance video tracking system based on a 3D twin convolution network is characterized by comprising the following units:
video sequence input unit: giving a group of template sequence video frames and search sequence video frames, and cutting the template sequence video frames and the search sequence video frames into template sequence blocks and search sequence blocks with specified sizes according to the form in S2;
the model training unit is used for training a video target tracker based on a 3D twin network and comprises a space-time feature extractor module, a multi-template matching module and a target prediction module; the space-time feature extractor takes the template sequence block and the search sequence block as input, extracts template space-time features and search space-time features from the template sequence block and the search sequence block, and inputs the template space-time features and the search space-time features into the multi-template matching module; the multi-template matching module comprises a feature conversion sub-module and a space-time feature matching sub-module; in the feature conversion submodule, the template space-time feature is transmitted to the search feature by using an interactive attention mechanism so as to obtain the search feature with higher discriminative power; calculating by a space-time feature matching submodule to obtain an optimal template feature, and respectively inputting the optimal template feature and the enhanced search feature into a classification branch and a regression branch; then, carrying out similarity matching on the optimal template features and the enhanced search features in a high-dimensional feature space by utilizing depth correlation filtering operation to respectively obtain multi-channel correlation filtering features; inputting the output of the classification branch into a classification head and a quality evaluation head in a target prediction module to obtain a classification score map and a quality evaluation score map; transmitting the output of the regression branch to a regression head to obtain a regression graph; training target tracking by minimizing cross entropy loss of classification and quality evaluation and cross-over ratio loss of regression;
the video target tracking unit combines the classification score map and the quality evaluation score map output by the model to obtain a confidence classification map, estimates the position of the target by using the confidence classification map, predicts the scale of the target by using a regression map output by the model, and obtains a target prediction frame in the search sequence; and obtaining a group of confidence search areas by utilizing the group of target prediction boxes, and inputting the confidence search areas into a search branch to track the target of the subsequent video sequence.
CN202211720733.3A 2022-12-30 2022-12-30 High-performance video tracking method and system based on 3D twin convolutional network Pending CN115908500A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211720733.3A CN115908500A (en) 2022-12-30 2022-12-30 High-performance video tracking method and system based on 3D twin convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211720733.3A CN115908500A (en) 2022-12-30 2022-12-30 High-performance video tracking method and system based on 3D twin convolutional network

Publications (1)

Publication Number Publication Date
CN115908500A true CN115908500A (en) 2023-04-04

Family

ID=86472977

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211720733.3A Pending CN115908500A (en) 2022-12-30 2022-12-30 High-performance video tracking method and system based on 3D twin convolutional network

Country Status (1)

Country Link
CN (1) CN115908500A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168216A (en) * 2023-04-21 2023-05-26 中国科学技术大学 Single-target tracking method based on scene prompt
CN116402858A (en) * 2023-04-11 2023-07-07 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402858A (en) * 2023-04-11 2023-07-07 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method
CN116402858B (en) * 2023-04-11 2023-11-21 合肥工业大学 Transformer-based space-time information fusion infrared target tracking method
CN116168216A (en) * 2023-04-21 2023-05-26 中国科学技术大学 Single-target tracking method based on scene prompt
CN116168216B (en) * 2023-04-21 2023-07-18 中国科学技术大学 Single-target tracking method based on scene prompt

Similar Documents

Publication Publication Date Title
CN115908500A (en) High-performance video tracking method and system based on 3D twin convolutional network
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN112560656B (en) Pedestrian multi-target tracking method combining attention mechanism end-to-end training
CN109934846A (en) Deep integrating method for tracking target based on time and spatial network
CN113313736A (en) Online multi-target tracking method for unified target motion perception and re-identification network
CN113744311A (en) Twin neural network moving target tracking method based on full-connection attention module
CN110969648A (en) 3D target tracking method and system based on point cloud sequence data
CN111640138A (en) Target tracking method, device, equipment and storage medium
CN116563337A (en) Target tracking method based on double-attention mechanism
CN113888629A (en) RGBD camera-based rapid object three-dimensional pose estimation method
CN114038059A (en) Dynamic gesture recognition method based on double-frame rate divide and conquer behavior recognition network
CN116128944A (en) Three-dimensional point cloud registration method based on feature interaction and reliable corresponding relation estimation
CN109344712B (en) Road vehicle tracking method
CN117213470B (en) Multi-machine fragment map aggregation updating method and system
CN116051601A (en) Depth space-time associated video target tracking method and system
Agrawal et al. YOLO Algorithm Implementation for Real Time Object Detection and Tracking
Li et al. Novel Rao-Blackwellized particle filter for mobile robot SLAM using monocular vision
CN115731517B (en) Crowded Crowd detection method based on crown-RetinaNet network
CN111681264A (en) Real-time multi-target tracking method for monitoring scene
Hu et al. Loop closure detection algorithm based on attention mechanism
CN116311518A (en) Hierarchical character interaction detection method based on human interaction intention information
CN116485894A (en) Video scene mapping and positioning method and device, electronic equipment and storage medium
CN115830643A (en) Light-weight pedestrian re-identification method for posture-guided alignment
CN114022520B (en) Robot target tracking method based on Kalman filtering and twin network
CN113379794B (en) Single-target tracking system and method based on attention-key point prediction model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination