CN113538519B

CN113538519B - Target tracking method and device, electronic equipment and storage medium

Info

Publication number: CN113538519B
Application number: CN202110826777.3A
Authority: CN
Inventors: 战赓; 庄博涵; 孙书洋; 欧阳万里
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2024-05-17
Anticipated expiration: 2039-06-25
Also published as: CN110287874A; CN113538517B; CN110287874B; CN113538519A; CN113538517A

Abstract

The disclosure relates to a target tracking method and device, an electronic device and a storage medium, wherein the method comprises the following steps: for any current frame image after an initial frame image in a video stream, acquiring a first position of a target object in a previous frame image of the current frame image; and obtaining the position information of the target object in the current frame image based on the first position and the prediction characteristic of the target object in the current frame image, wherein the prediction characteristic of the target object in the current frame image is obtained based on the initial frame image of the video stream and the previous frame image of the current frame. The disclosed embodiments can accurately achieve target tracking.

Description

Target tracking method and device, electronic equipment and storage medium

The application discloses a method and a device for tracking targets, electronic equipment and storage media, which are filed in China patent office, application number 201910555741.9 and application name of China patent application on 25 days of 06 months in 2019.

Technical Field

The disclosure relates to the technical field of computer vision, and in particular relates to a target tracking method and device, electronic equipment and a storage medium.

Background

Video object tracking is a key problem in computer vision that has been explored for decades. Video object tracking has important applications in many computer vision sub-fields, such as video pose tracking, video image segmentation, video object detection.

In recent years, tracking algorithms based on deep learning have achieved a certain level of performance, but the conventional methods are difficult to quickly adapt to the appearance of a drastic change of an object in a video, so that the effect is affected.

Disclosure of Invention

The disclosure provides a technical scheme for target tracking.

According to a first aspect of the present disclosure, there is provided a target tracking method, comprising:

for any current frame image after an initial frame image in a video stream, acquiring a first position of a target object in a previous frame image of the current frame image;

And obtaining the position information of the target object in the current frame image based on the first position and the prediction characteristic of the target object in the current frame image, wherein the prediction characteristic of the target object in the current frame image is obtained based on the initial frame image of the video stream and the previous frame image of the current frame.

In some possible embodiments, obtaining the prediction features of the current frame image includes:

and obtaining the prediction characteristic of the target object in the current frame image based on the first characteristic corresponding to the first position of the target object in the previous frame image of the current frame image and the second characteristic corresponding to the second position of the target object in the initial frame image.

In some possible implementations, before the acquiring, for any current frame image after the initial frame image in the video stream, a first position where the target object is located in a frame image preceding the current frame image, the method further includes:

And obtaining a second position of the target object in the initial frame image and a second feature corresponding to the second position.

In some possible embodiments, the acquiring the second position of the target object in the initial frame image includes at least one of the following ways:

acquiring a position mask map for the target object in the initial frame image, and determining a second position of the target object based on the mask map;

receiving a frame selection operation aiming at the initial frame image, and determining a second position of the target object based on a position area corresponding to the frame selection operation;

and executing a target detection operation on the initial frame image, and determining a second position of the target object based on a detection result of the target detection operation.

In some possible embodiments, the obtaining the prediction feature of the target object in the current frame image based on the first feature corresponding to the first position of the target object in the previous frame image of the current frame image and the second feature corresponding to the second position of the target object in the initial frame image includes:

performing convolution processing on the first feature and the second feature respectively to obtain a first transition feature of the first feature and a second transition feature of the second feature;

Performing first cross-correlation coding processing and graph convolution processing on the first transition feature and the second transition feature to obtain a third feature;

and obtaining the prediction feature based on feature fusion processing of the third feature, the first transition feature and the second feature.

In some possible implementations, the performing a first cross-correlation encoding process and a graph rolling process on the first transition feature and the second transition feature to obtain a third feature includes:

Performing first cross-correlation coding processing on the first transition feature and the second transition feature to obtain a first coding feature;

and inputting the first coding feature into a graph neural network to execute graph convolution processing to obtain the third feature.

In some possible embodiments, performing a first cross-correlation encoding process on the first transition feature and the second transition feature to obtain a first encoded feature includes:

and performing matrix multiplication operation on the first transition feature and the second transition feature to obtain the first coding feature.

In some possible embodiments, the obtaining the predicted feature based on feature fusion processing of the third feature, the first transition feature, and the second feature includes:

Performing cross-correlation decoding processing of the third feature based on the first transition feature to obtain a fourth feature;

And adding the fourth feature and the second feature to obtain the predicted feature.

In some possible embodiments, the obtaining the location information of the target object in the current frame image based on the first location and the prediction feature of the target object in the current frame image includes:

Determining a search area for the target object in the current frame image and a fifth feature corresponding to the search area based on the first position;

Taking the prediction feature as a convolution kernel, and executing second cross-correlation coding processing of the fifth feature to obtain a second coding feature;

And executing target detection processing of the target object based on the second coding feature to obtain the position information of the target object in the current frame image.

In some possible embodiments, determining a search area for the target object in the any frame of image based on the first location includes:

and amplifying the first position by a preset multiple by taking the first position as a center to obtain a search area aiming at the target object in the current frame image.

In some possible implementations, the second cross-correlation encoding process of the fifth feature is performed with the prediction feature as a convolution kernel, including:

and taking the prediction feature as a convolution kernel, and executing convolution processing on the fifth feature to obtain the second coding feature.

In some possible implementations, the performing, based on the second coding feature, a target detection process of the target object, to obtain location information of the target object in the current frame image includes:

And inputting the second coding feature into a target detection network to obtain the position information of the target object in the search area.

In some possible embodiments, the object tracking method is applied in a twin neural network comprising a first branch network, a second branch network, and a feature update network and an object detection network, wherein the first branch network and the second branch network are identical;

the first branch network is used for detecting a second position of the target object in the initial frame image and a second characteristic corresponding to the second position;

The second branch network is used for detecting a first position of a target object in a previous frame image of any current frame image after the initial frame image and a first feature corresponding to the first position;

the feature updating network is used for obtaining prediction features based on the initial frame image and the previous frame image of the current frame image;

the target detection network is used for obtaining the position information of the target object in the current frame image based on the first position and the prediction characteristic of the current frame image.

In some possible embodiments, the method further comprises:

the position information of the target object is highlighted in an image frame of the video stream.

According to a second aspect of the present disclosure, there is provided a target tracking apparatus comprising:

the detection module is used for acquiring a first position of a target object in a previous frame image of a current frame image aiming at any current frame image after an initial frame image in a video stream;

The tracking module is used for obtaining the position information of the target object in the current frame image based on the first position and the prediction characteristic of the target object in the current frame image, wherein the prediction characteristic of the target object in the current frame image is obtained based on the initial frame image of the video stream and the previous frame image of the current frame.

In some possible embodiments, the tracking module comprises:

The prediction unit is used for obtaining the prediction characteristic of the current frame image, and the prediction characteristic of the target object in the current frame image is obtained based on the first characteristic corresponding to the first position of the target object in the previous frame image of the current frame image and the second characteristic corresponding to the second position of the target object in the initial frame image.

In some possible implementations, the detection module is further configured to obtain a second location of the target object in the initial frame image, and a second feature corresponding to the second location.

In some possible embodiments, the detecting module obtains a second position of the target object in the initial frame image, including at least one of the following modes:

In some possible implementations, the prediction unit is further configured to perform convolution processing on the first feature and the second feature, to obtain a first transition feature of the first feature, and to obtain a second transition feature of the second feature;

In some possible implementations, the prediction unit is further configured to perform a first cross-correlation encoding process on the first transition feature and the second transition feature to obtain a first encoding feature;

In some possible implementations, the prediction unit is further configured to perform a matrix multiplication operation on the first transition feature and the second transition feature to obtain the first coding feature.

In some possible implementations, the prediction unit is further configured to perform a cross-correlation decoding process of the third feature based on the first transition feature, to obtain a fourth feature;

In some possible implementations, the tracking module further includes a tracking unit configured to determine, based on the first location, a search area for the target object in the current frame image, and a fifth feature corresponding to the search area;

In some possible implementations, the tracking unit is further configured to amplify the first position by a preset multiple with the first position as a center, to obtain a search area for the target object in the current frame image.

In some possible implementations, the tracking unit is further configured to perform convolution processing on the fifth feature with the prediction feature as a convolution kernel, to obtain the second encoded feature.

In some possible implementations, the tracking unit is further configured to input the second encoding feature to a target detection network, and obtain location information for the target object in the search area.

In some possible embodiments, the target tracking device comprises a twin neural network, the detection module comprises a first branch network and a second branch network of the twin neural network, the tracking module comprises a feature update network and a target detection network of the twin neural network, and the first branch network and the second branch network are the same;

In some possible implementations, a display module is used to highlight the location information of the target object in an image frame of the video stream.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

A processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of the first aspects.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any one of the first aspects.

In the embodiment of the disclosure, the position of the target object in the subsequent image may be sequentially obtained according to the position information of the target object in the initial frame image, where the predicted feature of the target object in the current frame image may be obtained according to the previous frame image of any current frame image and the initial frame image, and the position of the target object in the current frame image may be determined according to the first position in the previous frame image and the obtained predicted feature, where the target object may be accurately tracked by an effective forward propagation manner, and meanwhile, the appearance of the drastic change of the object may be rapidly adapted.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

FIG. 1 illustrates a flow chart of a target tracking method according to an embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of predicted features for obtaining a target object in one target tracking method of step S20 in accordance with an embodiment of the present disclosure;

FIG. 3 shows a flowchart of step S32 in a target tracking method according to an embodiment of the present disclosure;

FIG. 4 shows a schematic structural diagram of deriving predictive features in accordance with an embodiment of the present disclosure;

FIG. 5 shows a flowchart of step S20 in a target tracking method according to an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of a process for implementing target tracking in accordance with an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of a target tracking device, according to an embodiment of the present disclosure;

FIG. 8 illustrates a block diagram of an electronic device, according to an embodiment of the present disclosure;

Fig. 9 illustrates another block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Embodiments of the present disclosure provide a target tracking method that may be used to track a target object in successive image frames. The method of the embodiments of the present disclosure may be applied to any image processing apparatus, for example, the image processing method may be performed by a terminal device or a server or other processing device, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like. In some possible implementations, the image processing method may be implemented by way of a processor invoking computer readable instructions stored in a memory.

Fig. 1 shows a flowchart of a target tracking method according to an embodiment of the present disclosure, as shown in fig. 1, the target tracking method includes:

S10: and acquiring a first position of a target object in a previous frame image of the current frame image aiming at any current frame image after the initial frame image in the video stream.

Embodiments of the present disclosure may be used to track a target object in a video stream, which may be any type of object, such as a particular person, animal, or any other object appearing in an image, and the type of target object is not particularly limited by the present disclosure, and may be determined according to a particular application purpose.

In some possible implementations, the multi-frame image obtained by performing the frame selection operation on the video stream may perform target object tracking in the embodiment of the disclosure, or all the images in the video stream may be directly used as the multi-frame image for performing target object tracking. Wherein the frames of images may be ordered in the order of the time frames.

In some possible embodiments, the position of the target object in the previous frame image may be used to predict the position of the target object in the next frame image, so when the detection of the target object in any current frame image after the initial frame image in the video stream is performed, the detection result of the target object in the previous frame image of the current frame image, that is, the first position where the target object in the previous frame image is located, may be obtained first, and then the position of the current frame image may be further predicted according to the first position.

The method comprises the steps of firstly obtaining the position of a target object in an initial frame image of a video stream, wherein the position of the target object in the initial frame image can be obtained through target detection or input by a user, the method is not limited in the method, then predicting the position of the target object in a second frame image according to the position of the target object in the initial frame image, and the like to obtain the positions of the target objects in the rest frame images.

S20: and obtaining the position information of the target object in the current frame image based on the first position and the prediction characteristic of the target object in the current frame image, wherein the prediction characteristic of the current frame image is obtained based on the initial frame image of the video stream and the previous frame image of the current frame.

In some possible embodiments, the search area corresponding to the first position in the current frame image may be determined based on the first position of the target object in the previous frame image of the current frame image, and the position area matched with the prediction feature may be determined in the search area according to the prediction feature, that is, the position of the target object in the current frame image.

Based on the above configuration, the embodiment of the present disclosure may predict the position of the target object of the current frame image according to the position of the target object of the previous frame image of the current frame and the obtained prediction feature of the target object in the current frame image, where the target object of the current frame may be quickly tracked by a forward propagation manner.

Embodiments of the present disclosure are described in detail below with reference to the attached drawings.

The embodiment of the present disclosure may first obtain the position (second position) of the target object in the initial frame image and the second feature corresponding to the second position. In some possible embodiments, the second position may be expressed as coordinates of two diagonal vertices of a rectangular frame corresponding to the position area of the target object, or may be expressed as coordinates of one vertex, and length and width information. The location area corresponding to the target object in the initial frame image can be determined through the above information, and in other embodiments, the second location may be represented in other forms, or in embodiments of the disclosure, the location information about the target object may be represented in the above manner, or may also be represented in other manners, which is not specifically limited in the disclosure.

The method for acquiring the second position of the target object in the initial frame image of the video stream may include at least one of the following:

a) Acquiring a position mask map for the target object in the initial frame image, and determining a second position of the target object based on the mask map;

In some possible embodiments, the mask map may be represented in a matrix form corresponding to a dimension of the initial frame image, each mask value in the mask map corresponds to a pixel point in the initial frame image one by one, and the mask value may be represented as a first code value or a second code value, where the first code value represents an area where the target object is located, for example, the first code value may be "1", the second code value may be "0", and at this time, a set of pixel points corresponding to the first code value "1" is a location area of the second location where the target object is located. In addition, the mask map may be information input by the user, or may be a mask map obtained by a target object detection processing operation.

B) Receiving a frame selection operation aiming at the initial frame image, and determining a second position of the target object based on a position area corresponding to the frame selection operation;

In some possible embodiments, the input component may receive a frame selection operation on the initial frame image, where the input component may include a mouse, a touch pad, a keyboard, and other devices capable of receiving the frame selection operation, where the frame selection operation is an operation of selecting an area of the target object in the initial frame image, where the frame selection operation may obtain a selection area, and a position corresponding to the selection area is the second position.

The selection area obtained by the frame selection operation may be a regular square, and at this time, the second position may be determined to be the position of the selection area corresponding to the frame selection operation, or the selection area obtained by the frame selection operation may also be an irregular line, at this time, the smallest square area including the irregular image may be determined based on the irregular pattern, and at this time, the second position may be determined to be the position of the square area.

C) And executing a target detection operation on the initial frame image, and determining a second position of the target object based on a detection result of the target detection operation.

In some possible embodiments, the initial frame image may be input to a neural network capable of performing detection of the target object, for example, may be input to Mask-RCNN (a convolutional neural network based on Mask target recognition), to obtain a Mask map of the location of the target object, so as to determine the second location.

And obtaining a second characteristic corresponding to the second position of the target object in the initial frame image. The image region corresponding to the second position may be cut from the initial frame image, and feature extraction processing may be performed on the image region to obtain the second feature, or feature extraction processing may be performed on the initial frame image to obtain the image feature of the initial frame image, and then the second feature corresponding to the second position in the image feature of the initial frame image may be obtained based on the second position. The first feature, the second feature, the first transition feature, the second transition feature, the third feature, the fourth feature and the fifth feature in the embodiment of the present disclosure are respectively represented as image features of the target object, and by detecting the features and performing fusion optimization and other processes on the features, feature information with higher accuracy can be obtained, so that the position of the target object in each frame of image can be detected more accurately.

Further, in the case of obtaining the second position of the target object and the corresponding second feature in the initial frame image, the positions of the target objects in the remaining frame images may be sequentially obtained in the order of the image frames. The prediction characteristics of the target object in the current frame image can be obtained according to the initial frame image and the previous frame image of the current frame image.

Fig. 2 shows a flow chart of obtaining predicted features of a target object in a target tracking method according to an embodiment of the disclosure. As shown in fig. 2, the obtaining the prediction feature of the current frame image includes:

S31: obtaining a second feature corresponding to a second position in the initial frame image and a first feature corresponding to the first position in a previous frame image of any frame image;

In some possible embodiments, when predicting the characteristics of the target object in any current frame image after the initial frame image in the video stream, the characteristics of the target object in any current frame image may be predicted according to the detection result (i.e. the second position) of the target object in the initial frame and the detection result (i.e. the first position) of the target object in the previous frame image before the current frame image.

For example, feature extraction may be performed on the image areas corresponding to the second position in the initial frame image and the first position in the previous frame image, respectively, to obtain corresponding feature information, that is, the second feature and the first feature. Alternatively, the feature extraction process may be performed on the initial frame image and the previous frame image, respectively, and then the second position corresponding to the second position may be obtained from the image features of the initial frame image, and the first feature corresponding to the first position may be obtained from the image features of the previous frame image. And obtaining the prediction characteristic of the current frame image through the obtained first characteristic and the second characteristic.

The feature extraction may be performed through a residual network, resulting in a first feature and a second feature, respectively. In other embodiments the feature extraction process may also be performed by other feature extraction networks.

And responding to the current frame image as a second frame image, wherein the first position of the target object in the previous frame image of the current frame image is the second position of the target object in the initial frame image. Correspondingly, the second feature of the second location is the first feature of the first location. That is, for the second frame image in the video stream, that is, the next frame image of the initial frame image, the previous frame image is the initial frame image, the first position of the target object of the previous frame image is the second position of the target object in the initial frame, and the first feature corresponding to the first position is the second feature corresponding to the second position. The predicted features of the target object in the second frame image may be determined from the second position of the target object in the initial frame image. For the nth frame image after the second frame image, the prediction feature can be obtained according to the second feature corresponding to the second position of the target object in the initial frame image and the first feature corresponding to the first position of the target object in the n-1 th frame image. n is an integer greater than 2, which represents the number of frames of the current frame.

S32: and obtaining the prediction characteristic of the target object in the current frame image based on the first characteristic of the target object in the previous frame image of the current frame image and the second characteristic corresponding to the second position of the target object in the initial frame image.

In some possible embodiments, the predicted feature of the target object in the current frame image may be predicted based on the second feature of the target object in the initial frame image and the first feature of the target object in the previous frame image of the current frame image. The predicted feature may be obtained, for example, by means of a cross-correlation process, a convolution process, or the like for the first feature and the second feature.

Fig. 3 shows a flowchart of step S32 in a target tracking method according to an embodiment of the present disclosure. The obtaining the prediction feature of the target object in the current frame image based on the first feature of the target object in the previous frame image of the current frame image and the second feature corresponding to the second position of the target object in the initial frame image includes:

S321: respectively performing convolution processing on the first feature and the second feature, and respectively and correspondingly obtaining a first transition feature of the first feature and a second transition feature of the second feature;

In some possible embodiments, convolution processing may be performed on the first feature and the second feature to obtain a first transition feature corresponding to the first feature and a second transition feature corresponding to the second feature. Wherein the convolution process may enable more accurate feature information about the target object included in the first transition feature relative to the first feature and enable more accurate feature information about the target object included in the second transition feature relative to the second feature. The convolution kernels for performing the convolution processing on the first feature and the second feature may be the same or different, for example, may be 1*1 convolution kernels, or may be other convolution kernels.

Fig. 4 shows a schematic structural diagram of deriving predictive features according to an embodiment of the disclosure. The second feature of the second position corresponding to the target object in the initial frame image may be represented as F ₀, the first feature of the first position corresponding to the target object in the previous frame image of any frame image may be represented as F _t-1, t represents the number of frames corresponding to the image frame, and t is a positive integer.

In the embodiment of the present disclosure, the dimensions of the first feature and the second feature are the same and may be expressed as c×w×h, where C represents the number of channels, W represents the width of the feature, and H represents the height of the feature. Wherein the first feature and the second feature are each in the form of a matrix. Correspondingly, a first transition feature obtained by convolution processingMay have dimensions C ₁ W H, and a second transition feature/>The dimension of (C) may be C ₂ ×w×h, where C ₁ and C ₂ are used to represent the number of channels of the corresponding transition feature, and may be the same value or different values, and W and H may represent the width and height of the transition feature, respectively.

S322: performing first cross-correlation coding processing and graph convolution processing on the first transition feature and the second transition feature to obtain a third feature;

In the case of obtaining the first transition feature and the second transition feature, a first cross-correlation encoding process (cross correlation) and a graph rolling process (conv 1d and conv2 d) may be performed on the first transition feature and the second transition feature to fuse the feature information of the two to obtain a third feature fused with the feature information of the two, where a dimension of the third feature may be denoted as C ₂×C₁.

In some possible embodiments, the first cross-correlation encoding process may be represented as a matrix multiplication operation, that is, the first cross-correlation encoding process may be performed by performing the matrix multiplication operation on the first transition feature and the second transition feature, to obtain a corresponding third transition feature E, where the dimension of the third transition feature is C ₂×C₁. And then inputting the third transition characteristic into the graph neural network to execute graph convolution processing to obtain the third characteristic. Wherein the third feature also has a dimension of C ₂×C₁. The neural network of the embodiment of the present disclosure may perform the convolution processing (conv 1d and conv2 d) on the third transition feature twice to obtain the third feature E ^ref, and may perform the convolution processing for other times in other embodiments, which is not specifically limited in this disclosure.

S323: and obtaining the prediction feature based on feature fusion processing of the third feature, the first transition feature and the second feature.

In some possible embodiments, the feature fusion process may be performed on the third feature and the first transition feature first, that is, the cross-correlation decoding process may be performed on the third feature through the first transition feature to obtain a decoded feature, and then the convolution process may be performed on the decoded feature to obtain a fourth feature M', where the feature information in the first transition feature and the third feature is fused. The dimensions of the fourth feature are the same as the dimensions of the second feature. Wherein the cross-correlation decoding process is performed on the third feature by the second transition feature, and the convolution process may be performed on the second transition feature and the third feature to obtain the decoded feature.

And then, adding the fourth feature and the second feature, such as adding feature values of corresponding elements, so as to obtain a predicted feature F ^final, wherein the predicted feature is further fused with feature information of the second feature. And the prediction characteristic can be used for representing characteristic information of the target object in the current frame image.

The dimension of the fourth feature may be the same as the dimension of the second feature, i.e. c×w×h. Or in some embodiments, the feature obtained by performing the feature fusion processing on the third feature and the first transition feature, that is, performing the convolution processing on the third feature and the first transition feature may be an intermediate feature, where the dimension of the intermediate feature may be C ₂ ×w×h, and further performing the convolution processing on the intermediate feature may obtain a fourth feature, that is, a feature with a dimension of c×w×h.

And then, adding the fourth feature and the second feature to obtain a predicted feature. The dimension of the prediction feature is also c×w×h.

In case of obtaining the prediction feature of the current frame, the position of the target object in the current frame image can be detected according to the prediction feature.

Fig. 5 shows a flowchart of step S20 in a target tracking method according to an embodiment of the present disclosure. The obtaining the position information of the target object in the current frame image based on the first position and the prediction feature of the target object in the current frame image includes:

s201: determining a search area for the target object in the current frame image and a fifth feature corresponding to the search area based on the first position;

In some possible embodiments, the search area in the current frame image with respect to the target object may be determined according to a first position of the target object in a frame image previous to the current frame image. The position area corresponding to the first position can be amplified according to a preset multiple, and the amplified position area can be used as a search area in the current frame image. By this configuration, it is possible to ensure that the target object is within the search area.

The preset multiple may be a preset value, which may be determined according to the type of the target object or the application scenario, for example, may be 2, and in other embodiments, may also be other values.

In some possible embodiments, after determining the search area, the fifth feature corresponding to the search area may be obtained, where feature extraction processing of an image of the search area may be performed by using the feature extraction network to obtain the fifth feature of the target object, or feature extraction processing may be performed on the current frame image, so that feature information corresponding to the search area, that is, the fifth feature, is selected from image features of the current frame image.

After the embodiment of the disclosure may obtain the feature information of the search area, matching between the predicted feature and the feature information of the search area may be performed, where the feature extraction process may also be implemented through a residual network, and the disclosure is not limited specifically.

S202: taking the prediction feature as a convolution kernel, and executing second cross-correlation coding processing of the fifth feature to obtain a second coding feature;

In the case where the fifth feature and the predicted feature are obtained, a cross-correlation encoding process may be performed on the fifth feature and the predicted feature to obtain a second encoded feature. Wherein the cross-correlation encoding process may be performed by performing a convolution process using the predicted feature as a convolution check fifth feature, thereby obtaining a second encoded feature. The dimensions of the second encoding feature are the same as the dimensions of the fifth feature. The second coding feature of the embodiments of the present disclosure may represent a degree of matching between the prediction feature and each pixel point in the cable region.

S203: and executing target detection processing of the target object based on the second coding feature to obtain the position information of the target object in the current image.

In the case of obtaining the second coding feature, a target detection process may be performed on the second coding feature, and the embodiment of the disclosure may perform the target detection process operation by using the area candidate network, so as to obtain a candidate frame of the target object corresponding to the second coding feature, that is, a position of the target object.

In some possible implementations, multiple candidate boxes may be available for the target object, and embodiments of the present disclosure may determine the location of the target object based on the candidate box with the highest confidence. The target detection processing can be realized through a region candidate network, and the position of a candidate frame aiming at a target object is obtained.

According to the embodiment of the disclosure, the position of the target object in each image frame of the video stream can be obtained in a forward propagation mode, so that the target object can be tracked rapidly and accurately.

In some possible embodiments, when the position of the target object in the image is detected, the position information of the target object may be further highlighted, for example, the position area where the target object is located is marked in a detection frame manner, so that the area where the target object is located can be conveniently known, and the highlighting manner is not specifically limited in this disclosure.

The following illustrates the process of object tracking for a clearer display of the disclosed embodiments. Fig. 6 shows a schematic diagram of a process for implementing target tracking according to an embodiment of the present disclosure.

The object tracking method of the embodiment of the disclosure can be realized through a twin network. Fig. 6 is a schematic diagram of a network architecture. The object tracking method of the embodiment of the disclosure can be applied to a twin neural network. The twin neural network may include a network for the first branch network, a network for the second branch, and a network for feature updating and a network for object detection, wherein the first branch network and the second branch network are the same; the first branch network is used for detecting a second position of the target object in the initial frame image and a second characteristic corresponding to the second position; the second branch network is used for detecting a first position of a target object in a previous frame image of any current frame image after the initial frame image and a first feature corresponding to the first position; the feature updating network is used for obtaining prediction features based on an initial frame image and a previous frame image of a current frame image; the target detection network is used for obtaining the position information of the target object in the current frame image based on the first position and the prediction characteristic of the current frame image. A third branch network may be further included, where the third branch network is configured to obtain a fifth feature corresponding to the search area of the current frame image. The third branch network may be identical to the first branch network and the second branch network. Wherein, for any frame image (hereinafter referred to as a current frame image) after the initial frame image of the video stream, the position information of the target object of the current frame image may be determined based on the position of the target object in the initial frame image and the position of the target object of the previous frame image of the current frame.

Specifically, first, feature extraction processing is performed on an image area corresponding to a first position of an initial frame image and an image area corresponding to a second position of a previous frame image through a first branch network and a second branch network respectively, for example, corresponding first features and second features are obtained through the feature extraction network respectively. The first branch network and the second branch network may be respectively implemented as a network for implementing feature extraction of the target object, the feature extraction network may include a residual error module (Res) and a convolution module (T), the residual error module may be formed by residual error neural networks, for example Resnet-18, residual error processing of the image region at the first position and the image region at the second position is respectively performed through the two residual error neural networks, and then convolution operation of a structure of the residual error processing is performed through the convolution module, so as to obtain more accurate first features and second features of the target object. The characteristic information of the target object in the image area corresponding to the first position and the second position can be extracted more accurately through the residual error processing and the convolution processing. In other embodiments, feature extraction may also be implemented through the residual network alone, or through other feature extraction networks as well.

In the case of obtaining the first feature and the second feature, the first feature and the second feature are processed by using a feature update network (such as a feature update module Template Update Modlue shown in fig. 6) to obtain a predicted feature of the target object in the current frame image. Here, the convolution process, the first cross-correlation encoding and the graph convolution process may be performed on the first feature and the second feature, and then feature fusion may be performed to obtain the predicted feature (refer to the embodiment shown in fig. 4), and the specific process may refer to the above embodiment, which is not repeated herein.

Under the condition that the predicted feature is obtained, a search area of the current frame image can be determined based on the first position area, feature extraction processing is carried out on the feature corresponding to the search area through a third branch network to obtain the feature corresponding to the search area, and the final position of the target object in the current frame image is obtained through cross-correlation coding and target detection of the target detection network based on the predicted feature and feature information corresponding to the search area. Wherein, the embodiment of the disclosure obtains the prediction feature by accurately executing the update of the feature by taking the appearance change of the object into consideration through the twin network.

The embodiment of the disclosure mainly comprises the following parts of extracting a target feature template (second feature extraction) of an initial frame, extracting a target feature template (first feature extraction) of a previous frame, updating a template online (obtaining a predicted feature), extracting a feature of a current frame searching area, and obtaining the position of a tracking target of a current frame by template matching. The following subsections illustrate the modular implementation.

Target feature template extraction for initial frame (first branch network):

Input: the object is at the coordinate position of the initial frame, and the image of the initial frame;

And (3) outputting: a target feature template (second feature) of the initial frame;

the method comprises the following specific steps: the image block (image area corresponding to the second position) is acquired by taking the object position as the center, and the feature extraction is performed through a neural network to be used as a feature template (second feature) of the object in the initial frame.

Target feature template extraction (second branch network) of the previous frame:

Input: coordinate position of object in previous frame, image of previous frame

And (3) outputting: a target feature template (first feature) of a previous frame;

the method comprises the following specific steps: an image block (image area corresponding to the first position) is acquired with the object position as the center, and feature extraction is performed through a neural network, and the image block is used as a feature template (first feature) of the object in the previous frame.

Template online updating module, obtain the predictive feature, (feature updating network):

Input: a feature template (second feature) of an initial frame target, a feature template (first feature) of a previous frame target;

And (3) outputting: a feature template (predictive feature) applicable to the current frame;

The method comprises the following specific steps: and performing feature transition (obtaining a first transition feature and a second transition feature) on the feature template (a second feature) of the initial frame target and the feature template (a first feature) of the previous frame target by using a convolution layer respectively, and then performing first cross-correlation coding on the two feature templates after transition by using a cross-correlation operation. The obtained first coding feature can be understood as a Graph (Graph), then feature interaction and feature update of each node are realized by using a Graph neural network, and two steps of Graph convolution processing are realized by using one convolution (obtaining a second feature). The obtained updated second feature is then decoded by a cross-correlation operation into the same feature space as the input of the present module as updated feature information (predicted feature). This information is added to the template features of the original frame object as output of the module, i.e., updated template features.

Search area feature extraction of current frame (third branch network):

input: coordinate position of object in current frame, image of current frame

And (3) outputting: search area characteristics of the current frame.

The method comprises the following specific steps: and acquiring an image block by taking the first position as the center, namely acquiring a search area (for example, the size of the search area is twice as large as that of a template), and extracting features through a neural network to serve as the features of the search area of the current frame.

Template matching obtains the position of a tracking target of a current frame (target detection network):

Input: updated feature templates (predicted features), features of the current frame search area;

And (3) outputting: the position of the target object in the current frame image.

The method comprises the following specific steps: and performing similarity comparison between the feature template and each position of the search area through cross-correlation operation (convolution processing), taking a similarity result as input, and taking the area with the highest classification score as the position of the object in the current frame through classification and regression of two neural network modules, wherein the position is corrected through a regression network.

In the embodiment of the disclosure, the position of the target object in the subsequent image may be sequentially obtained according to the position information of the target object in the initial frame image, where the predicted feature of the target object in the current frame image may be obtained according to the previous frame image of the current frame image and the initial frame image, and the position of the target object in the current frame image may be determined according to the first position in the previous frame image and the obtained predicted feature, where the target object may be tracked by an effective forward propagation manner, and meanwhile, the appearance of the drastic change of the object may be rapidly adapted.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure.

In addition, the disclosure further provides a target tracking device, an electronic device, a computer readable storage medium, and a program, where the foregoing may be used to implement any one of the target tracking methods provided in the disclosure, and corresponding technical schemes and descriptions and corresponding descriptions referring to method parts are not repeated.

Fig. 7 shows a block diagram of a target tracking device according to an embodiment of the present disclosure, as shown in fig. 7, the target tracking device includes:

The detection module 10 is configured to acquire, for any current frame image after an initial frame image in a video stream, a first position where a target object is located in a frame image previous to the current frame image;

the tracking module 20 is configured to obtain location information of a target object in the current frame image based on the first location and a predicted feature of the target object in the current frame image, where the predicted feature of the target object in the current frame image is obtained based on an initial frame image of the video stream and a previous frame image of the current frame.

In some possible embodiments, the tracking module 20 includes:

In some possible embodiments, the detection module 10 is further configured to obtain a second location of the target object in the initial frame image, and a second feature corresponding to the second location.

In some possible embodiments, the detecting module 10 obtains the second position of the target object in the initial frame image, including at least one of the following manners:

In some possible embodiments, the apparatus further comprises: and a display module for highlighting the position information of the target object in the image frame of the video stream.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a non-volatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the method described above.

The electronic device may be provided as a terminal, server or other form of device.

Fig. 8 shows a block diagram of an electronic device, according to an embodiment of the disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 8, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of electronic device 800 to perform the above-described methods.

Fig. 9 illustrates another block diagram of an electronic device according to an embodiment of the present disclosure. For example, electronic device 1900 may be provided as a server. Referring to FIG. 9, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A target tracking method, wherein the target tracking method is applied in a twin neural network, the twin neural network comprising a first branch network, a second branch network, a feature update network, and a target detection network, wherein the first branch network and the second branch network are the same, the method comprising:

detecting a second position of a target object in an initial frame image in a video stream and a second feature corresponding to the second position through the first branch network;

For any current frame image after the initial frame image, detecting a first position of a target object in a previous frame image of the current frame image and a first feature corresponding to the first position through the second branch network;

obtaining a prediction feature of the target object in the current frame image based on a first feature corresponding to a first position of the target object in a previous frame image of the current frame image and a second feature corresponding to a second position of the target object in the initial frame image through the feature updating network;

And obtaining the position information of the target object in the current frame image based on the first position and the prediction characteristic through the target detection network.

2. The method of claim 1, wherein detecting a second location of a target object within the initial frame image comprises at least one of:

3. The method according to claim 1, wherein the obtaining the predicted feature of the target object in the current frame image based on the first feature corresponding to the first position of the target object in the previous frame image of the current frame image and the second feature corresponding to the second position of the target object in the initial frame image includes:

4. A method according to claim 3, wherein performing a first cross-correlation encoding process and a graph rolling process on the first and second transition features results in a third feature, comprising:

5. The method of claim 4, wherein performing a first cross-correlation encoding process on the first transition feature and the second transition feature results in a first encoded feature, comprising:

6. A method according to claim 3, wherein deriving the predicted feature based on a feature fusion process of the third feature, the first transition feature, and the second feature comprises:

7. The method according to any one of claims 1-6, wherein the obtaining location information of the target object in the current frame image based on the first location and a predicted feature of the target object in the current frame image includes:

8. The method of claim 7, wherein determining a search area for the target object in the current frame image based on the first location comprises:

9. The method of claim 7, wherein performing the second cross-correlation encoding process of the fifth feature using the predicted feature as a convolution kernel comprises:

10. The method according to claim 7, wherein the performing the object detection process of the object based on the second encoding feature to obtain the position information of the object in the current frame image includes:

11. The method according to any one of claims 1-6, further comprising:

12. The target tracking device is characterized by comprising a detection module and a tracking module, wherein the detection module comprises a first branch network and a second branch network of a twin neural network, the first branch network and the second branch network are the same, and the tracking module comprises a characteristic updating network and a target detection network of the twin neural network;

the first branch network detects a second position of a target object in an initial frame image in a video stream and a second feature corresponding to the second position;

The second branch network is used for detecting a first position of a target object in a previous frame image of the current frame image and a first feature corresponding to the first position aiming at any current frame image after the initial frame image;

The feature updating network is used for obtaining the prediction feature of the target object in the current frame image based on a first feature corresponding to the first position of the target object in the previous frame image of the current frame image and a second feature corresponding to the second position of the target object in the initial frame image;

the feature updating network is used for obtaining the prediction feature of the target object in the current frame image based on the initial frame image and the previous frame image of the current frame image;

the target detection network is used for obtaining the position information of the target object in the current frame image based on the first position and the prediction characteristic.

13. An electronic device, comprising:

A processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 11.

14. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 11.