CN112241967A

CN112241967A - Target tracking method, device, medium and equipment

Info

Publication number: CN112241967A
Application number: CN201910640796.XA
Authority: CN
Inventors: 胡涛; 申晗; 黄李超
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2021-01-19
Anticipated expiration: 2039-07-16
Also published as: CN112241967B

Abstract

A target tracking method, apparatus, medium, and device are disclosed. The method comprises the following steps: determining a plurality of image areas in a current frame according to position information of a target in a historical frame before the current frame to obtain a plurality of first image blocks; respectively extracting second image blocks in at least one historical frame before the current frame and the multi-layer image features of the plurality of first image blocks to obtain the multi-layer second image features corresponding to the second image blocks and the multi-layer first image features corresponding to the first image blocks; performing feature aggregation processing according to the second image features and the first image features to obtain aggregated image features corresponding to the first image blocks, wherein the aggregated image features have multiple levels; and determining the position information of the target in the current frame according to the correlation between the image features of the reference image blocks in the reference frame and the features of the aggregated images. The method and the device are beneficial to improving the accuracy of target tracking.

Description

Target tracking method, device, medium and equipment

Technical Field

The present disclosure relates to computer vision technologies, and in particular, to a target tracking method, a target tracking apparatus, a storage medium, and an electronic device.

Background

The target tracking technology has been applied in various fields such as unmanned driving, navigation, security and the like. The tasks of object tracking techniques generally include: given an object and the position of the given object in the initial video frame are known, the given object is identified from the video frames of the video sequence and the position of the given object is located.

In practical application scenarios, the target tracking technology faces challenges, such as target appearance change, target deformation, target occlusion, image blur caused by target motion, low image resolution, illumination change, and the like, which all affect the accuracy of target tracking. How to quickly and accurately realize target tracking is a technical problem worthy of attention.

Disclosure of Invention

The present disclosure is proposed to solve the above technical problems. Embodiments of the present disclosure provide a target tracking method, a target tracking apparatus, a storage medium, and an electronic device.

According to an aspect of the embodiments of the present disclosure, there is provided a target tracking method, including: determining a plurality of image areas in a current frame according to position information of a target in a historical frame before the current frame to obtain a plurality of first image blocks; respectively extracting second image blocks in at least one historical frame before the current frame and the multi-layer image features of the plurality of first image blocks to obtain the multi-layer second image features corresponding to the second image blocks and the multi-layer first image features corresponding to the first image blocks; performing feature aggregation processing according to the second image features and the first image features to obtain aggregated image features corresponding to the first image blocks, wherein the aggregated image features have multiple levels; and determining the position information of the target in the current frame according to the correlation between the image features of the reference image blocks in the reference frame and the features of the aggregated images.

According to another aspect of the embodiments of the present disclosure, there is provided a target tracking apparatus including: the image block processing module is used for determining a plurality of image areas in the current frame according to the position information of a target in a historical frame before the current frame to obtain a plurality of first image blocks; the main feature extraction module is used for respectively extracting a second image block in at least one historical frame before a current frame and the multi-layer image features of the first image blocks obtained by the image block processing module, and obtaining the multi-layer second image features corresponding to the second image blocks and the multi-layer first image features corresponding to the first image blocks; the feature processing module is used for performing feature aggregation processing according to the second image features and the first image features obtained by the trunk feature extraction module to obtain aggregated image features corresponding to the first image blocks, wherein the aggregated image features have multiple levels; and the target positioning module is used for determining the position information of the target in the current frame according to the correlation between the image characteristics of the reference image block in the reference frame and the aggregated image characteristics obtained by the characteristic processing module.

According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-described object tracking method.

According to still another aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; and the processor is used for reading the executable instruction from the memory and executing the instruction to realize the target tracking method.

Based on the target tracking method and the target tracking device provided by the above embodiments of the present disclosure, feature aggregation processing is performed on the basis of the second image features of multiple levels of the second image block in the historical frame and the first image features of multiple levels of the first image blocks in the current frame, and since the second image features and the first image features can reflect temporal changes of the image features, and the multiple levels of the image features can reflect characteristics of the image features on different scales, the aggregate image features obtained by the present disclosure can be regarded as image features with spatial perception and temporal perception, and the aggregate image features have better expression capability. By utilizing the correlation of the aggregate image characteristics and the image characteristics of the reference image block, the position information of the target in the current frame can be accurately determined. Therefore, the technical scheme provided by the disclosure is beneficial to improving the accuracy of target tracking.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.

The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:

FIG. 1 is a schematic view of a scenario in which the present disclosure is applicable;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a target tracking method of the present disclosure;

FIG. 3 is a flowchart illustrating an embodiment of obtaining a plurality of first image blocks according to the present disclosure;

fig. 4 is a schematic flowchart of an embodiment of obtaining an aggregated image feature corresponding to each first image block according to the present disclosure;

FIG. 5 is a flowchart illustrating an embodiment of an aggregation process using weight values according to the present disclosure;

FIG. 6 is a flowchart illustrating an embodiment of the present disclosure for obtaining a first weight value for any first pyramid image feature and a second weight value for any deformed second pyramid image feature;

FIG. 7 is a flowchart illustrating an embodiment of determining position information of a target in a current frame according to correlations between image features of reference image blocks in a reference frame and respective aggregated image features according to the present disclosure;

FIG. 8 is a schematic diagram illustrating one embodiment of the present disclosure for utilizing a neural network to achieve target tracking;

FIG. 9 is a schematic diagram illustrating the structure of one embodiment of the target tracking apparatus of the present disclosure;

fig. 10 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more than two and "at least one" may refer to one, two or more than two.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, such as a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Embodiments of the present disclosure may be implemented in electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with an electronic device, such as a terminal device, computer system, or server, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be implemented in a distributed cloud computing environment. In a distributed cloud computing environment, tasks may be performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Book of JapaneseBrief description of the drawings

In carrying out the present disclosure, the inventors found that: target tracking techniques typically require the use of image features extracted from the current frame to locate the position of a target in the video frame. The shallow feature extracted from the current frame can effectively keep the edge, texture, position information and the like of the target, while the deep feature extracted from the current frame can better perform abstract modeling on the target, and has the characteristic of consistency on class-internal transformation.

In an actual application scene, when phenomena such as target appearance change, target deformation, target occlusion, image blurring caused by target motion, low image resolution, illumination change and the like occur, although the phenomena can affect the edge, texture, position information and the like of an object in a shallow feature, if the shallow feature and a deep feature of a current frame and interframe information (such as time information, motion information and the like) between the current frame and a historical frame thereof can be effectively utilized in a target tracking process, the influence of the phenomena on target tracking can be avoided to a certain extent, and therefore the accuracy of target tracking is favorably improved.

Brief description of the drawings

An application scenario of the object tracking technology provided by the present disclosure is described below with reference to fig. 1.

Fig. 1 is a video frame of a video, the picture of the video frame is a motorcycle racing field, and a motorcycle and a driver thereof in the picture are objects to be tracked. The lighting on the field in fig. 1 is not ideal and the motorcycle runs very fast. The illumination on the racing field is not ideal, the definition of the whole picture of the video frame can be influenced, and the driving speed of the motorcycle is very high, so that the content of partial pictures in the video frame is fuzzy.

The position information of the target in the video frame is determined using the target tracking techniques provided by the present disclosure, as shown in target detection block 100 in fig. 1. And the position information of the object in the video frame determined by other object tracking techniques is shown as an object detection block 101 and an object detection block 102 in fig. 1. Comparing the target detection frame 100, the target detection frame 101, and the target detection frame 102, it can be seen that the target tracking technology provided by the present disclosure has better target tracking accuracy.

Exemplary method

Fig. 2 is a schematic flow chart diagram of an embodiment of a target tracking method according to the present disclosure. As shown in fig. 2, the method of this embodiment includes: s200, S201, S202, and S203. The following describes each step.

S200, determining a plurality of image areas in the current frame according to the position information of the target in the historical frame before the current frame, and obtaining a plurality of first image blocks.

The current frame in the present disclosure refers to a video frame of a current target to be searched in a video, and may also be referred to as a current video frame, a video frame to be processed, or a search frame. Historical frames in this disclosure generally refer to video frames in the video that are earlier in time than the current frame. The position information of the target in the history frame before the current frame in the present disclosure may be generally: position information of the target in a frame previous to the current frame. Of course, the position information of the target in the historical frame before the current frame in the present disclosure may also be: position information of the target in a second frame or a third frame preceding the current frame.

The object in the present disclosure may refer to an object such as a human, an animal, or a vehicle that needs to be subjected to position tracking.

The position information of the target in the history frame in the present disclosure is known information. For example, the position information of the target in the history frame in the present disclosure may be position information formed based on the initialization setting. For another example, the position information of the target in the history frame in the present disclosure may be position information obtained by using the target tracking method of the present disclosure.

The plurality of image areas in the present disclosure are each determined based on the position information of the target in the history frame, that is, each of the plurality of image areas is generally within a certain area range around the target position in the history frame. The first image block in the present disclosure may be referred to as a search image block.

S201, extracting the second image block and the multi-layer image features of the first image blocks in at least one historical frame before the current frame respectively, and obtaining the multi-layer second image features corresponding to the second image blocks and the multi-layer first image features corresponding to the first image blocks respectively.

The second image block in the present disclosure may refer to: and image blocks segmented from the historical frames based on the position information of the target in the historical frames. The second image block in this disclosure may be referred to as a historical image block.

The image features of the various levels in the present disclosure refer to: in the case where image blocks are provided to a neural network (e.g., a trunk feature extraction module) for extracting image features, the image features are formed of multiple layers of image features that are output by different convolutional layers in the trunk feature extraction module, respectively. That is to say, the image features of the multiple levels corresponding to any image block include: and multiple layers of image features, wherein each layer of image features corresponds to one convolution layer in the trunk feature extraction module, and different layers of image features correspond to different convolution layers. In addition, spatial resolutions of image features of different layers in the image features of multiple layers corresponding to any image block are usually different, and channel numbers of image features of different layers in the image features of multiple layers corresponding to any image block are usually different. The image features of multiple layers corresponding to any image block in the present disclosure may be referred to as pyramid image features of the image block. The number of layers of image features included in the pyramid image features may be set according to the actual situation.

The number of history frames in S201 and the number of history frames in S200 may not be the same. The number of history frames in S200 is generally 1, and the number of history frames in S201 may be one or more. In the case that the number of the history frames in S201 is 1, the present disclosure may obtain second image features (i.e., pyramid image features) of multiple levels corresponding to one history image block in the history frame. In the case that the number of the history frames in S201 is greater than 1, each history frame corresponds to one history image block, and the present disclosure may obtain second image features of multiple layers corresponding to the history image blocks. The second image features of the multiple levels corresponding to the historical image blocks may be referred to as second pyramid image features. The first image features of the plurality of levels corresponding to the search image block may be referred to as first pyramid image features.

And S202, performing feature aggregation according to the second image features and the first image features to obtain aggregated image features corresponding to the first image blocks.

The characteristic polymerization treatment in the present disclosure means: and (3) aggregating the image characteristics of different image blocks to form a new image characteristic processing process. When feature aggregation processing is performed on the second pyramid image features and the first pyramid image features, the feature aggregation processing is typically performed separately for each layer in the pyramid image features. For example, feature aggregation processing is performed on a first layer of image features in the second pyramid image features and a first layer of image features in the first pyramid image features to obtain first layer aggregated image features, feature aggregation processing is performed on a second layer of image features in the second pyramid image features and a second layer of image features in the first pyramid image features to obtain second layer aggregated image features, and so on until a last layer of aggregated image features is obtained. As can be seen, the aggregate image feature in the present disclosure is also a pyramid image feature.

And S203, determining the position information of the target in the current frame according to the correlation between the image features of the reference image block in the reference frame and the features of the aggregated images.

The reference frame in this disclosure typically belongs to the same video as the current frame and the historical frame. The reference frame may be the first frame in the video. The reference frame may also be referred to as a template frame. The reference image block in the reference frame generally refers to an image block including an object, for example, the reference image block may be an image block cut out from the reference frame according to an artificially labeled object bounding box. The image features of the reference image block in this disclosure may not be pyramid image features. The correlation between image features in the present disclosure may be regarded as consistency or correlation between image features, or the like.

According to the method, the feature aggregation processing is carried out on the basis of the second image features of multiple layers of the second image block in the historical frame and the first image features of multiple layers of the first image blocks in the current frame, and the second image features and the first image features can reflect the change of the image features in time, and the multiple layers of the image features can reflect the characteristics of the image features in different scales, so that the aggregated image features obtained by the method can be regarded as image features with space perception and time perception, and have good expression capacity. The present disclosure facilitates accurate determination of location information of an object in a current frame by utilizing the correlation of aggregate image features and image features of reference image blocks. Therefore, the technical scheme provided by the disclosure is beneficial to improving the accuracy of target tracking.

In an alternative example, one implementation of the present disclosure to obtain multiple first image blocks is shown in fig. 3.

In fig. 3, in the current frame, the target detection frame is enlarged around the center point of the target detection frame in the history frame to obtain an enlarged region S300.

Optionally, the specific position of the target detection frame in the historical frame in the current frame is determined, then the target detection frame is amplified by n times by taking the central point of the target detection frame in the current frame as a center, and the area of the amplified target detection frame is the amplification area. Where n is typically greater than 1, e.g., n is 3.

When the enlargement area exceeds the size of the history frame, the pixel values in the excess area may be padded, for example, by 0.

S301, determining a plurality of sub-regions in the amplified region according to a preset scale factor.

The preset scale factor in the present disclosure may be expressed in the form of the following equation (1):

in the above formula (1), a^SFor the preset scale factor, S (capital S) is an arithmetic progression, and a and S (lowercase S) may be hyper-parameters of the neural network, that is, a and S are preset known values, for example, a may take a value of 1.03, and S may be a positive integer greater than 2.

Optionally, since S is an arithmetic series, the disclosure may obtain a plurality of values a using the above formula (1)^SThe present disclosure may utilize the length and width of the target detection frame in the history frame and a of different values respectively^sThe center point of the enlarged region is taken as the center point of each enlarged target detection frame, and the position of each enlarged target detection frame in the enlarged region is determined, so that each enlarged target detection frame corresponds to one sub-region, and a plurality of sub-regions are obtained.

S302, the sizes of the sub-areas are respectively adjusted to be preset sizes, and a plurality of first image blocks are obtained.

Alternatively, the predetermined size in the present disclosure is a known value set in advance. The predetermined size may be set according to actual requirements, for example, the predetermined size may be, for example: 125 x 125.

The target detection frame in the historical frame is amplified, so that the target in the current frame is in the amplification area. By determining a plurality of sub-regions from within the magnified region using the preset scale factor, a plurality of possible target detection frames are predicted for a target in the current frame. By adjusting the size of each sub-region to a predetermined size, subsequent processing of each sub-region is facilitated.

In an optional example, the manner of obtaining the multiple levels of second image features corresponding to the respective second image blocks by the present disclosure may be: according to the target detection frame (i.e. the position information of the target in the historical frame) in at least one historical frame of the current frame, the second image blocks are cut out from the corresponding historical frame, the cut-out second image blocks are respectively provided to a neural network (such as a trunk feature extraction module described in the following embodiment) for extracting image features, the main feature extraction module is used for respectively carrying out image feature extraction operation on each second image block, for any second image block, the present disclosure may obtain image features output by each of a plurality of layers (e.g., a plurality of convolutional layers) in the skeleton feature extraction module, since the spatial resolution and the number of channels of the image features output by each of the plurality of layers are different, the image features output by each of the plurality of layers may form pyramid image features, that is, second pyramid image features corresponding to the second image block. The second pyramid image feature may comprise image features having a number of layers of 3, 4, or more.

In an optional example, the manner of obtaining the multiple levels of first image features corresponding to the first image blocks may be: the first image blocks are respectively provided to a neural network (such as a trunk feature extraction module described in the following embodiments) for extracting image features, and the image feature extraction operation is respectively performed on each first image block through the trunk feature extraction module. Likewise, the first pyramid image feature may contain image features with a number of layers of 3, 4, or more.

Optionally, the number of layers of the first pyramid image feature is the same as the number of layers of the second pyramid image feature, that is, the number of layers of the image feature included in the first pyramid image feature is generally the same as the number of layers of the image feature included in the second pyramid image feature. In addition, the number of channels and the spatial resolution of image features of the same layer in the first pyramid image feature and the second pyramid image feature are generally the same. That is, the channel number and spatial resolution of the top-level image feature in the first pyramid image feature are the same as the channel number and spatial resolution of the top-level image feature in the second pyramid image feature, the channel number and spatial resolution of the next-to-top-level image feature in the first pyramid image feature are the same as the channel number and spatial resolution of the next-to-top-level image feature in the second pyramid image feature, and so on, the channel number and spatial resolution of the bottom-level image feature in the first pyramid image feature are the same as the channel number and spatial resolution of the bottom-level image feature in the second pyramid image feature.

The present disclosure facilitates subsequent feature aggregation processing for the first pyramid image feature and the second pyramid image feature by making the number of layers of image features included in the first pyramid image feature the same as the number of layers of image features included in the second pyramid image feature, and making the number of channels and spatial resolution of image features of the same layer in the first pyramid image feature and the second pyramid image feature the same.

In an alternative example, one embodiment of the present disclosure to obtain the aggregate image feature corresponding to each first image block is shown in fig. 4.

In fig. 4, in S400, a plurality of deformed second image features (i.e., deformed second pyramid image features) are obtained by performing deformation processing on each second image feature (i.e., second pyramid image feature) based on each first image feature (i.e., first pyramid image feature).

Optionally, the deformation process in the present disclosure may refer to: a process of mapping the second image feature into the first image feature. The morphing process may also be considered as feature aligning the first image feature and the second image feature. The deformation process in the present disclosure may also be referred to as an alignment process, a mapping process, or the like.

Optionally, for any second pyramid image feature, the present disclosure may perform deformation processing on each layer of image feature in the second pyramid image feature respectively. Since the deformed second image features in this disclosure are still pyramid image features, the deformed second image features may be referred to as deformed second pyramid image features.

Optionally, the present disclosure may implement a deformation process of the image feature using the motion information. That is, the present disclosure may first acquire motion information between each first image feature and each of the plurality of second image features, and then perform deformation processing on each of the plurality of second image features according to the acquired motion information, so as to acquire a plurality of deformed second image features. The second image characteristics are subjected to deformation processing by utilizing the motion information, so that the deformation second image characteristics can be accurately and conveniently obtained.

Optionally, the motion information in the present disclosure may refer to: inter-frame motion information for two video frames in a video. The present disclosure may acquire inter-frame motion information in various ways. The present disclosure may obtain inter-frame motion information in various ways, for example, for any first pyramid image feature and any second pyramid image feature, first, based on channel connection, perform feature splicing on the first pyramid image feature and the second pyramid image feature to obtain a spliced image feature, that is, a spliced pyramid image feature, and then perform convolution processing on the spliced pyramid image feature to obtain inter-frame motion information between the first pyramid image feature and the second pyramid image feature.

Specifically, for any first pyramid image feature and any second pyramid image feature, the spatial resolution and the number of channels of any layer of image feature in the first pyramid image feature are respectively the same as the spatial resolution and the number of channels of the same-layer image feature in the second pyramid image feature, for any layer of image feature (such as the nth layer of image feature), first, based on the channel connection of the nth layer of image feature, feature splicing is performed on the nth layer of image feature in the first pyramid image feature and the nth layer of image feature in the second pyramid image feature to obtain the spliced image feature of the nth layer, and the spliced image feature of the nth layer is convolution-processed, for example, the spliced image feature of the nth layer is provided to a convolution processing unit including at least one volume of layers, and according to the output of the convolution processing unit, the image feature of the nth layer of image feature in the first pyramid image feature and the second pyramid image feature can be obtained Inter-frame motion information between the nth layer image features in (1). By adopting the method, the inter-frame motion information between each layer of image features in the first pyramid image features and the corresponding layer of image features in the second pyramid image features can be obtained. According to the method and the device, the inter-frame motion information is obtained by adopting the characteristic splicing and convolution processing modes, so that the inter-frame motion information can be conveniently and rapidly obtained, and the deformation processing efficiency can be improved.

Optionally, the present disclosure may also obtain inter-frame motion information in other manners, for example, obtaining inter-frame motion information based on optical flow; as another example, inter-frame motion information is obtained based on inter-pixel correlation. The present disclosure does not limit the manner in which inter-frame motion information is obtained.

S401, feature aggregation processing is carried out on each first pyramid image feature and the plurality of deformed second pyramid image features respectively, and aggregation image features corresponding to each first image block are obtained.

According to the method and the device, the second pyramid image features are subjected to deformation processing, so that the second pyramid image features are aligned with the features of the first pyramid image features, and therefore when the first pyramid image features and the deformed second pyramid image features are subjected to polymerization processing, the feature expression of the polymerized image features is enabled to be more accurate, and the accuracy of target tracking is improved.

Optionally, the present disclosure may perform aggregation processing according to respective weight values corresponding to the first image feature and the second image feature. An example of the aggregation processing using the weight value is shown in fig. 5.

In fig. 5, S500, a first weight value of each first pyramid image feature and a second weight value of each deformed second pyramid image feature are obtained according to each first image feature (i.e., the first pyramid image feature) and a plurality of deformed second image features (i.e., the deformed second pyramid image features).

Optionally, the first weight value in this disclosure includes a weight value layer corresponding to each layer of image features of the first pyramid image feature. That is to say, the first weight value in the present disclosure includes a plurality of weight value layers, the number of layers of the weight value layer included in the first weight value is the same as the number of layers of the image feature included in the first pyramid image feature, and the size of each weight layer in the first weight value is generally related to the spatial resolution of the image feature of the corresponding layer in the first pyramid image feature corresponding to the weight layer, so that the plurality of layers of weight values included in the first weight value are pyramid-shaped.

Likewise, the second weight value in the present disclosure includes that each layer of image features of the deformed second pyramid image features corresponds to a weight value layer. That is to say, the second weight value in the present disclosure includes a plurality of weight value layers, the number of layers of the weight value layer included in the second weight value is the same as the number of layers of the image feature included in the deformed second pyramid image feature, and the size of each weight layer in the second weight value is generally related to the spatial resolution of the image feature of the corresponding layer in the deformed second pyramid image feature corresponding to the weight layer, so that the plurality of layers of weight values included in the second weight value are pyramid-shaped.

Optionally, an example of obtaining the first weight value of any first pyramid image feature and the second weight value of any deformed second pyramid image feature by the present disclosure is shown in fig. 6 below.

S501, determining the aggregation image characteristics corresponding to the first image blocks according to the first pyramid image characteristics, the first weight values, the deformed second pyramid image characteristics and the second weight values.

Optionally, the present disclosure may perform feature aggregation processing on the nth layer of image features in the first pyramid image features and the nth layer of image features in the second pyramid image features according to the nth weight layer of the first weight values and the nth weight layer of the second weight values, so as to obtain aggregated nth layer of image features. The spatial resolution and the number of channels of the aggregated nth layer image features are respectively the same as the spatial resolution and the number of channels of the nth layer image features of the first pyramid image features. After feature aggregation processing is performed on all layer image features in the first pyramid image feature and all layer image features in the second pyramid image feature respectively in the above manner, aggregated multilayer image features can be obtained. The aggregated multi-layer image features are also pyramid-shaped, and thus may be referred to as aggregated pyramid image features.

According to the method and the device, aggregation processing operation is executed by utilizing the weight value, so that the feature expression of the aggregated image features is more accurate, and the accuracy of target tracking is improved.

In addition, the present disclosure may also perform feature aggregation processing in other ways. For example, the present disclosure may employ an RNN (Recurrent Neural Networks) to perform feature aggregation processing on the first pyramid image features and the deformed second pyramid image features. For another example, the present disclosure may perform feature aggregation processing on the first pyramid image features and the deformed second pyramid image features in a 3D convolution manner. The present disclosure is not limited thereto.

In an alternative example, the present disclosure obtains an example of a first weight value for a first image feature (i.e., a first pyramid image feature) and a second weight value for a warped second image feature (i.e., a warped second pyramid image feature) as shown in fig. 6.

In fig. 6, S600, a first embedded feature of each first image feature (i.e., a first pyramid image feature) and a second embedded feature of each deformed second image feature (i.e., a deformed second pyramid image feature) are obtained.

Alternatively, the embedded Feature in the present disclosure may refer to a Feature obtained by performing a Feature Embedding (Feature Embedding) process on an input Feature. The number of channels of the feature before the feature embedding process is performed is generally different from the number of channels of the feature after the feature embedding process is performed, and the feature embedding process may be considered as a dimension reduction process performed on the feature.

Alternatively, the present disclosure may obtain the first embedded feature and the second embedded feature using an embedded neural network unit. Specifically, the first pyramid image feature is input into the embedded neural network unit, and the first embedded feature is obtained according to the output of the embedded neural network unit. And inputting the second pyramid image features into the embedded neural network unit, and obtaining second embedded features according to the output of the embedded neural network unit.

Optionally, the network structure adopted by the embedded neural network unit includes but is not limited to: a bottle neck structure. For example, the embedded neural network unit may include three convolutional layers, and the convolutional kernels of the three convolutional layers may be: 1 × 1 × 32, 3 × 2 × 32, and 1 × 1 × 64.

S601, respectively calculating the similarity between each first embedding feature and each second embedding feature to obtain a plurality of similarities.

Optionally, the present disclosure may calculate a cosine distance between the first embedded feature and the second embedded feature, and represent a similarity between the first embedded feature and the second embedded feature using the cosine distance.

S602, according to the plurality of similarities, a first weight value of each first pyramid image feature and a second weight value of each second pyramid image feature are determined.

Optionally, the present disclosure may perform normalization processing on the obtained similarity, so as to obtain a first weight value and a second weight value. For example, the first weight and the second weight may be calculated using the following equation (2):

in the above formula (2), if τ is 0, w_t-τ→tIs w_t→t，w_t→tRepresents a first weight value; if τ is not 0, w_t-τ→tRepresents a second weight value; EXP (x) denotes an index of x; if τ is 0, then e_t-τ→tIs e_t，e_tDenotes a first embedding characteristic, e is not 0_t-τ→tRepresenting a second embedded feature; tau ranges from 0 to T; t denotes the number of second image blocks.

According to the method and the device, the first weight value of the first pyramid image characteristic and the second weight value of the deformed second pyramid image characteristic can be conveniently obtained by utilizing the similarity among the embedded characteristics, so that the efficiency of characteristic polymerization processing is favorably improved.

In an alternative example, the present disclosure provides an example of determining position information of a target in a current frame according to correlations between image features of reference image blocks in a reference frame and respective aggregated image features, as shown in fig. 7.

In fig. 7, in S700, the image features of the reference image block are extracted to obtain the reference image features.

Optionally, an example of the reference image feature obtaining manner in the present disclosure may be:

firstly, extracting the image features of multiple levels of a reference image block to obtain first reference image features of multiple levels. For example, the reference image block is provided to a neural network (e.g., a main feature extraction module) for extracting image features, and the main feature extraction module performs an image feature extraction operation on the reference image block, the present disclosure may obtain image features output by each of a plurality of layers (e.g., convolutional layers) in the main feature extraction module, where the image features output by each of the plurality of layers form pyramid image features, that is, the first reference image feature is a pyramid image feature, because the spatial resolutions and the channel numbers of the image features output by each of the plurality of layers are different. The first reference picture feature may comprise a number of picture features of layers of 3, 4 or more. Typically, the number of layers of image features included in the first reference image feature is the same as the number of layers of image features included in the first pyramid image feature and the second pyramid image feature. In addition, the first reference image feature may be the same in channel number and spatial resolution as image features of the same layer in the first pyramid image feature and the second pyramid image feature.

Secondly, feature fusion processing is carried out on the first reference image features of multiple levels, and the image features after fusion processing are determined. For example, the image feature after the fusion process may be directly used as the reference image feature.

Optionally, the performing of the feature fusion processing on the first reference image feature in the present disclosure may specifically be: firstly, performing 1 × 1 convolution processing on all layer image features in the first reference image feature respectively to make each layer image feature have the same channel number. Secondly, performing upsampling processing on the uppermost layer image feature after the convolution processing, wherein the spatial resolution and the channel number of the image feature obtained after the upsampling processing and the next upper layer image feature after the convolution processing are the same, and the image feature obtained after the upsampling processing and the next upper layer image feature after the convolution processing are overlapped. And thirdly, performing up-sampling processing on the image features after the previous superposition, wherein the spatial resolution and the channel number of the image features obtained after the up-sampling processing are the same as those of the image features of the next upper layer after the convolution processing, and the image features obtained after the up-sampling processing are superposed with the image features of the next upper layer after the convolution processing. And repeating the steps until the image features are overlapped with the image features of the last layer, and finally obtaining the reference image features.

The method and the device have the advantages that the feature fusion processing is carried out on the image features of multiple layers of the reference image block, so that the feature expression of the reference image features is more accurate, and the position information of the subsequently determined target in the current frame is more accurate.

And S701, respectively calculating Gaussian responses between the reference image features and the aggregated image features.

Optionally, the present disclosure may perform feature fusion processing on the aggregate image feature, and calculate a gaussian response by using the obtained reference image feature and the aggregate image feature after the feature fusion processing.

Optionally, the process of performing feature fusion processing on the aggregated image features in the present disclosure may be: firstly, performing 1 × 1 convolution processing on all layer image features in the aggregated image features respectively to enable the image features of each layer to have the same channel number. Secondly, performing upsampling processing on the uppermost layer image feature after the convolution processing, wherein the spatial resolution and the channel number of the image feature obtained after the upsampling processing at this time are the same as those of the next upper layer image feature after the convolution processing, and the image feature obtained after the upsampling processing at this time is superposed with the next upper layer image feature after the convolution processing. And thirdly, performing up-sampling processing on the image features after the previous superposition, wherein the spatial resolution and the channel number of the image features obtained after the up-sampling processing are the same as those of the image features of the next upper layer after the convolution processing, and the image features obtained after the up-sampling processing are superposed with the image features of the next upper layer after the convolution processing. And repeating the steps until the image features are overlapped with the image features of the last layer, and finally obtaining the aggregated image features after feature fusion processing.

Optionally, the present disclosure may use a correlation filtering method to obtain a gaussian response between the reference image feature and the aggregated image feature after the feature fusion processing. The operation performed by the correlation filtering method can be expressed by the following formula (3):

in the above formula (3), g represents a gaussian response; f^-1(x) means inverse fourier transforming x;

to represent

The conjugate complex number of (a);

the parameters used for the relevant filtering are,

can be obtained by calculation by the following formula (4); indicates a Hadamard multiplication;

to represent

Fourier transform of (1);

and representing the characteristic value of the channel d in the aggregated image characteristic after the characteristic fusion processing of the current frame.

In the above formula (4), λ represents a penalty coefficient, which is a known value, for example, λ may be 0.0001;

to represent

Fourier transform of (1);

representing a reference image feature; y is^*A complex conjugate representing y;

represents a pair y^*Performing Fourier transform; y represents a standard gaussian distribution formed based on the reference image block in the reference frame; indicating a hadamard multiplication.

The present disclosure can obtain a gaussian response between the reference image feature and the aggregated image feature after each feature fusion process using the above equation (3) and equation (4).

S702, determining the position information of the target in the current frame according to the first image block corresponding to the Gaussian response maximum value.

The method and the device for determining the position information of the target in the current frame by utilizing the Gaussian response are beneficial to quickly selecting one first image block from a plurality of first image blocks; because the corresponding Gaussian response of the first image block is the maximum, the position information of the first image block is most likely to be the position information of the target in the current frame, so that the position information of the target in the current frame can be accurately determined, and the accuracy of target tracking can be improved.

In one optional example, the present disclosure may utilize a neural network to implement the above-described target tracking method. The neural Network of the present disclosure may be referred to as a Spatial-Aware Temporal Aggregation Network (SATAN) based Temporal Aggregation.

An example of target tracking using SATAN is shown in FIG. 8.

In fig. 8, the tth frame in the video (i.e., the search frame t in fig. 8) is assumed to be the current frame 800. Assume that the number of history frames used to provide the second image block is k (k is an integer greater than 0), such as t-k th history frame 801 (i.e., t-k frame in fig. 8), t-k +1 th history frame (not shown in fig. 8), … …, and t-1 th history frame (not shown in fig. 8). The reference frame 802 is the bottom image on the left side of fig. 8 (i.e., the template frame in fig. 8). The reference frame 802 has a target detection frame set therein. The reference frame 802, the current frame 800 and each of the historical frames embody temporal information.

The present disclosure may segment reference image blocks from the reference frame 802 according to the target detection box in the reference frame 802 and adjust the reference image blocks to a predetermined size (e.g., 125 × 125). The reference image block may be provided to a trunk feature extraction module (not shown in fig. 8) in the SATAN, and the trunk feature extraction module extracts image features of multiple levels of the input reference image block to obtain first reference image features of multiple levels. After performing convolution processing of 1 × 1 on all layer image features in the first reference image features of multiple layers, each layer image feature has the same channel number. After the convolution processing, the four-layer image features having the same number of channels are formed as shown in the left four-layer image feature in the lowermost box 803 of fig. 8 (i.e., the left four small boxes in the box 803). The four layers of image features are sequentially subjected to upsampling processing and superposition processing to finally form the rightmost image feature in the lowermost box 803 in fig. 8, which is the reference image feature 804.

The present disclosure may perform an enlargement process (e.g., 3 times enlargement, etc.) on a target detection frame in a history frame (e.g., t-1 th history frame) adjacent to the current frame 800, and determine an area of the enlarged target detection frame in the current frame 800, i.e., an enlarged area. The position of the center point of the target detection frame in the history frame is the position of the amplification area in the current frame 800. If the partial area of the enlarged target detection frame is beyond the range of the current frame 800, the partial area may be processed in a conventional manner, for example, zero padding or the like. Then, the present disclosure may determine a plurality of sub-regions in the enlarged region according to a preset scale factor, and adjust the sizes of the plurality of sub-regions to predetermined sizes (e.g., 125 × 125), respectively, so as to obtain a plurality of first image blocks. Then, each first image block is provided to a trunk feature extraction module in the SATAN, and the trunk feature extraction module extracts the image features of the first image block at multiple levels to obtain first image features at multiple levels, such as the first image features 805 at four levels shown at the left side of fig. 8.

The present disclosure may cut out the second image block from the corresponding historical frame according to the target detection frame in the historical frame, and adjust the second image block to a predetermined size (e.g., 125 × 125). The second image block may be provided to a trunk feature extraction module in the SATAN, and the trunk feature extraction module extracts the image features of the input second image block at multiple levels to obtain the second image features at multiple levels. The present disclosure performs the above-described processing on each of the k history frames, so that k second image features can be obtained. An example of the multiple levels of the second image features corresponding to the t-k th history frame 801 is a four-level second image feature 806 shown on the left side of fig. 8.

For each layer of image features in each second image feature, the present disclosure may utilize an alignment unit 807 in the SATAN to perform an alignment process with the same layer of image features in the first image feature, such that deformed second image features may be obtained by the alignment unit 807, the deformed one layer of image features in the second image features being shown as the rightmost small block 809 in the uppermost block 808 of fig. 8. Specifically, for any first image feature and any second image feature, the alignment unit 807 may perform feature stitching on the same layer image feature in the first image feature and the second image feature based on channel connection to obtain a layer of stitched image feature, and perform convolution processing on the layer of stitched image feature (for example, the alignment unit 807 performs convolution processing using a convolution processing unit included therein) to obtain inter-frame motion information (inter-frame motion information may also be referred to as an offset) between the first image feature and the layer of image feature in the second image feature. The alignment unit 807 may obtain one layer of image features of the deformed second image features by performing deformation processing on the layer of image features in the second image features using the inter-frame motion information, for example, the alignment unit 807 may perform deformable convolution processing on the corresponding layer of image features of the second image features according to the inter-frame motion information, thereby obtaining one layer of image features of the deformed second image features. With the above method, the present disclosure can obtain each layer image feature among all deformed second image features.

For each layer of image features in the first image features, the present disclosure may utilize the aggregation unit 810 in the SATAN to perform a feature aggregation process on one layer of image features in the first image features and the same layer of image features in the deformed second image features, so as to obtain an aggregated image feature of the first image block corresponding to the first image features. For example, the aggregation unit 810 may include: an embedded neural network unit, which may include at least one embedded layer (e.g., three convolutional layers with convolutional kernels of 1 × 1 × 32, 3 × 2 × 32, and 1 × 1 × 64, respectively), and the aggregation unit 810 may obtain first embedded features of each first image feature and second embedded features of each deformed second image feature by using the embedded neural network unit; then, the aggregation unit 810 calculates cosine similarities between the first embedded features and the second embedded features, respectively, so as to obtain a plurality of cosine similarities; then, the aggregating unit 810 obtains an aggregated image feature of the first image block corresponding to the first image feature according to the plurality of similarities and the corresponding first image feature and second image feature. With the above method, the present disclosure may obtain an aggregated image feature of each first image block. An example of the aggregate image feature of the first image block may refer to four blocks 811 in fig. 8, i.e. the aggregate image feature comprising four-layer image features. The present disclosure may implement the feature aggregation processing by using weights, and may refer to the description of fig. 5 above. Since the multi-level of the second image feature and the first image feature may reflect the characteristics of the image feature in different scales, the present disclosure may embody the spatial information of the image feature in the x-axis direction of fig. 8.

The method and the device can utilize the up-sampling unit to perform image fusion processing on the aggregate image feature of each first image block, so as to obtain the aggregate image feature of each image block after the feature fusion processing. The method and the device can utilize the up-sampling unit to perform feature fusion processing on the first reference image features of multiple levels, and the image features after fusion processing are used as the reference image features of the reference image block.

The present disclosure may utilize a correlation filtering unit to obtain a gaussian response between the reference image feature and the aggregated image feature after the feature fusion process. The correlation filtering unit may include a correlation filtering layer. According to the method and the device, a first image block is selected from a plurality of first image blocks according to the maximum Gaussian response value output by the relevant filtering unit, the selected first image block corresponds to the maximum Gaussian response value, and the coordinate information of the selected first image block can indicate the position of the target in the current frame.

The training process of the trunk feature extraction module in the SATAN of the present disclosure is prior to the training process of other modules, that is, the present disclosure can utilize the image sample to train the trunk feature extraction module alone, and the network parameters of the successfully trained trunk feature extraction module are not changed in the subsequent training process of other modules and units.

The present disclosure may train other modules and units in the SATAN except for the trunk feature extraction module, and the loss function used for training may be shown in the following formula (5):

loss＝||g-y||²formula (5)

In the above formula (5), g may be obtained by calculation using the above formula (3) and formula (4), y represents a standard gaussian distribution formed based on labeling information of the image sample, that is, position information of the target detection frame in the image sample; l |. electrically ventilated margin²Denotes the euclidean distance of.

Exemplary devices

Fig. 9 is a schematic structural diagram of an embodiment of the target tracking device of the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 9, the apparatus of this embodiment includes: an image block processing module 900, a main feature extraction module 901, a feature processing module 902, and an object positioning module 903.

The image block processing module 900 is configured to determine a plurality of image areas in the current frame according to the position information of the target in the historical frame before the current frame, so as to obtain a plurality of first image blocks.

Optionally, the image block processing module 900 in the present disclosure may first amplify the target detection frame in the current frame with the central point of the target detection frame in the historical frame as the center, so as to obtain an amplified region; then, the image block processing module 900 determines a plurality of sub-areas in the enlarged area according to the preset scale factor; the image block processing module 900 adjusts the sizes of the sub-areas to predetermined sizes, respectively, to obtain a plurality of first image blocks.

The main feature extraction module 901 is configured to extract a second image block in at least one historical frame before a current frame and multiple levels of image features of multiple first image blocks obtained by the image block processing module 900, respectively, to obtain multiple levels of second image features corresponding to each second image block and multiple levels of first image features corresponding to each first image block.

Optionally, the trunk feature extraction module 901 may extract image features of multiple layers of the second image block in each historical frame, to obtain second pyramid image features corresponding to each second image block; the trunk feature extraction module 901 extracts image features of multiple layers of each first image block to obtain first pyramid image features corresponding to each first image block; the number of layers of the first pyramid image feature is the same as that of the second pyramid image feature, and the number of channels and the spatial resolution of the same layer in the first pyramid image feature and the second pyramid image feature are the same.

Optionally, the trunk feature extraction module 901 in this disclosure is further configured to extract image features of multiple levels of the reference image block, so as to obtain first reference image features of multiple levels.

The feature processing module 902 is configured to perform feature aggregation processing according to the second image features and the first image features obtained by the trunk feature extraction module 901, so as to obtain aggregated image features corresponding to the first image blocks. Wherein the aggregate image features have multiple levels.

Optionally, the feature processing module 902 may include: an alignment unit 9021 and an aggregation unit 9022. The alignment unit 9021 is configured to perform deformation processing on each second image feature according to each first image feature obtained by the main feature extraction module 901, so as to obtain a plurality of deformed second image features. For example, the alignment unit 9021 may first acquire motion information between each of the first image features and the plurality of second image features; then, the alignment unit 9021 performs deformation processing on the plurality of second image features according to the obtained motion information, thereby obtaining a plurality of deformed second image features. The manner of acquiring the motion information between each first image feature and the plurality of second image features by the alignment unit 9021 may be: the alignment unit 9021 performs feature splicing on each first image feature and each second image feature respectively based on channel connection to obtain a plurality of spliced image features; the alignment unit 9021 performs convolution processing on the multiple stitched image features respectively to obtain motion information between each first image feature and the multiple second image features. The aggregation unit 9022 is configured to perform feature aggregation processing on each first image feature and the plurality of deformed second image features obtained by the alignment unit 9021, respectively, to obtain an aggregated image feature corresponding to each first image block. For example, the aggregating unit 9022 obtains a first weight value of each first image feature and a second weight value of each deformed second image feature according to each first image feature and the plurality of deformed second image features; the first weight value comprises a weight value layer corresponding to each level of image features of the first image features, and the second weight value comprises a weight value layer corresponding to each level of image features of the second image features; the aggregation unit 9022 determines, according to each first image feature, each first weight value, each deformed second image feature, and each second weight value, an aggregated image feature corresponding to each first image block. One way for the aggregation unit 9022 to obtain the first weight value of each first image feature and the second weight value of each deformed second image feature according to each first image feature and the plurality of deformed second image features may be: the aggregation unit 9022 first obtains (e.g., obtains by using an embedded neural network unit included in the aggregation unit) a first embedded feature of each first image feature and a second embedded feature of each deformed second image feature; then, the aggregation unit 9022 calculates similarities (such as cosine similarities, that is, cosine distances) between the first embedded features and the second embedded features, respectively, so as to obtain a plurality of similarities; then, the aggregating unit 9022 determines a first weight value of each first image feature and a second weight value of each deformed second image feature according to the plurality of similarities.

The target positioning module 903 is configured to determine position information of a target in a current frame according to correlations between image features of reference image blocks in a reference frame and the aggregate image features obtained by the feature processing module 902.

Optionally, the target positioning module 903 may include: a correlation filtering unit 9031, and a determine location unit 9032. The correlation filtering unit 9031 is configured to calculate gaussian responses between the reference image features and the aggregated image features respectively. For example, the correlation filtering unit 9031 may be specifically configured to calculate gaussian responses between the reference image feature and the fusion-processed aggregate image feature respectively. The correlation filtering unit 9031 may include a correlation filtering layer. The operation specifically performed by the relevant filtering unit 9031 may be as described in the foregoing method embodiment with respect to S701. The position determining unit 9032 is configured to determine, according to the first image block corresponding to the maximum gaussian response value, position information of the target in the current frame.

Optionally, the apparatus in the present disclosure further includes: an upsampling unit 904, configured to perform feature fusion processing on the first reference image features of multiple levels, and determine image features after fusion processing; the reference image features are generated from the image features after the fusion process. For example, the image feature after the fusion process may be directly used as the reference image feature. In addition, the upsampling unit 904 may be further configured to perform feature fusion processing on each aggregate image feature, and determine an aggregate image feature after the fusion processing. The operation specifically performed by the upsampling unit 904 can be referred to the description of S700 and S701 in the above method embodiment.

Exemplary electronic device

An electronic device according to an embodiment of the present disclosure is described below with reference to fig. 10. FIG. 10 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 10, the electronic device 101 includes one or more processors 1011 and memory 1012.

The processor 1011 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 101 to perform desired functions.

Memory 1012 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory, for example, may include: random Access Memory (RAM) and/or cache memory (cache), etc. The nonvolatile memory, for example, may include: read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 1011 to implement the target tracking methods of the various embodiments of the present disclosure described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 101 may further include: an input device 1013, an output device 1014, etc., which are interconnected by a bus system and/or other form of connection mechanism (not shown). Further, the input device 1013 may include, for example, a keyboard, a mouse, and the like. The output device 1014 can output various kinds of information to the outside. The output devices 1014 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.

Of course, for simplicity, only some of the components of the electronic device 101 relevant to the present disclosure are shown in fig. 10, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 101 may include any other suitable components, depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the target tracking method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in a target tracking method according to various embodiments of the present disclosure described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium may include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, and systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," comprising, "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects, and the like, will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A target tracking method, comprising:

determining a plurality of image areas in a current frame according to position information of a target in a historical frame before the current frame to obtain a plurality of first image blocks;

respectively extracting second image blocks in at least one historical frame before the current frame and the multi-layer image features of the plurality of first image blocks to obtain the multi-layer second image features corresponding to the second image blocks and the multi-layer first image features corresponding to the first image blocks;

performing feature aggregation processing according to the second image features and the first image features to obtain aggregated image features corresponding to the first image blocks, wherein the aggregated image features have multiple levels;

and determining the position information of the target in the current frame according to the correlation between the image features of the reference image blocks in the reference frame and the features of the aggregated images.

2. The method of claim 1, wherein the determining a plurality of image areas in the current frame according to the position information of the target in the historical frame before the current frame to obtain a plurality of first image blocks comprises:

in the current frame, the center point of a target detection frame in the historical frame is taken as the center, and the target detection frame is amplified to obtain an amplified area;

determining a plurality of sub-regions in the amplification region according to a preset scale factor;

and respectively adjusting the sizes of the plurality of sub-areas to preset sizes to obtain a plurality of first image blocks.

3. The method according to claim 1 or 2, wherein the extracting the second image blocks and the multiple levels of image features of the first image blocks in at least one historical frame before the current frame respectively to obtain the multiple levels of second image features corresponding to the second image blocks and the multiple levels of first image features corresponding to the first image blocks respectively comprises:

extracting the image features of multiple layers of the second image blocks in each historical frame to obtain second pyramid image features corresponding to the second image blocks respectively;

extracting the image features of multiple layers of each first image block to obtain first pyramid image features corresponding to the first image blocks;

the number of layers of the first pyramid image feature is the same as that of the second pyramid image feature, and the number of channels and the spatial resolution of the same layer in the first pyramid image feature and the second pyramid image feature are the same.

4. The method according to any one of claims 1 to 3, wherein the performing feature aggregation processing according to the second image feature and the first image feature to obtain an aggregated image feature corresponding to each first image block includes:

according to the first image features, respectively carrying out deformation processing on the second image features to obtain a plurality of deformed second image features;

and respectively carrying out feature aggregation processing on each first image feature and the plurality of deformed second image features to obtain an aggregated image feature corresponding to each first image block.

5. The method according to claim 4, wherein the performing deformation processing on each second image feature according to each first image feature to obtain a plurality of deformed second image features comprises:

acquiring motion information between each first image feature and a plurality of second image features;

and according to the motion information, respectively carrying out deformation processing on the plurality of second image characteristics to obtain a plurality of deformed second image characteristics.

6. The method of claim 5, wherein the obtaining motion information between each first image feature and each of the plurality of second image features comprises:

respectively performing feature splicing on each first image feature and each second image feature based on channel connection to obtain a plurality of spliced image features;

and performing convolution processing on the plurality of spliced image features respectively to obtain motion information between each first image feature and the plurality of second image features respectively.

7. The method according to any one of claims 4 to 6, wherein the performing feature aggregation processing on each first image feature and the plurality of deformed second image features respectively to obtain an aggregated image feature corresponding to each first image block includes:

obtaining a first weight value of each first image feature and a second weight value of each deformed second image feature according to each first image feature and the plurality of deformed second image features; the first weight value comprises a weight value layer corresponding to each level of image features of the first image features, and the second weight value comprises a weight value layer corresponding to each level of image features of the second image features;

and determining the aggregation image characteristics corresponding to the first image blocks according to the first image characteristics, the first weight values, the deformed second image characteristics and the second weight values.

8. The method of claim 7, wherein the obtaining a first weight value for each first image feature and a second weight value for each warped second image feature from each first image feature and the plurality of warped second image features comprises:

acquiring first embedded features of the first image features and second embedded features of the deformed second image features;

respectively calculating the similarity between each first embedding feature and each second embedding feature to obtain a plurality of similarities;

and determining a first weight value of each first image feature and a second weight value of each deformed second image feature according to the plurality of similarities.

9. The method according to any one of claims 1 to 8, wherein the determining the position information of the target in the current frame according to the correlation between the image features of the reference image blocks in the reference frame and the respective aggregated image features comprises:

extracting image features of the reference image block to obtain reference image features;

respectively calculating Gaussian responses between the reference image features and the aggregated image features;

and determining the position information of the target in the current frame according to the first image block corresponding to the Gaussian response maximum value.

10. The method according to claim 9, wherein the extracting image features of the reference image block to obtain reference image features comprises:

extracting image features of multiple levels of a reference image block to obtain first reference image features of multiple levels;

performing feature fusion processing on the first reference image features of the multiple levels, and determining image features after fusion processing; the reference image features are generated by the image features after the fusion processing;

the separately calculating the gaussian responses between the reference image features and the aggregated image features comprises:

performing feature fusion processing on the feature of each aggregated image, and determining the feature of the aggregated image after fusion processing;

and respectively calculating Gaussian responses between the reference image features and the fused aggregated image features.

11. An object tracking device, comprising:

the image block processing module is used for determining a plurality of image areas in the current frame according to the position information of a target in a historical frame before the current frame to obtain a plurality of first image blocks;

the main feature extraction module is used for respectively extracting a second image block in at least one historical frame before a current frame and the multi-layer image features of the first image blocks obtained by the image block processing module, and obtaining the multi-layer second image features corresponding to the second image blocks and the multi-layer first image features corresponding to the first image blocks;

the feature processing module is used for performing feature aggregation processing according to the second image features and the first image features obtained by the trunk feature extraction module to obtain aggregated image features corresponding to the first image blocks, wherein the aggregated image features have multiple levels;

and the target positioning module is used for determining the position information of the target in the current frame according to the correlation between the image characteristics of the reference image block in the reference frame and the aggregated image characteristics obtained by the characteristic processing module.

12. The apparatus of claim 11, wherein the feature processing module comprises:

the alignment unit is used for respectively carrying out deformation processing on each second image feature according to each first image feature obtained by the main feature extraction module to obtain a plurality of deformed second image features;

and the aggregation unit is used for respectively carrying out feature aggregation processing on each first image feature and the plurality of deformed second image features obtained by the alignment unit to obtain an aggregated image feature corresponding to each first image block.

13. The apparatus of claim 11 or 12, wherein the stem feature extraction module is further configured to:

the target location module, comprising:

a correlation filtering unit: the system comprises a plurality of image characteristics, a plurality of clustering image characteristics and a plurality of image processing units, wherein the image characteristics are used for clustering the image characteristics of the reference image;

and the position determining unit is used for determining the position information of the target in the current frame according to the first image block corresponding to the Gaussian response maximum value.

14. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-10.

15. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-10.