CN112686928A

CN112686928A - Moving target visual tracking method based on multi-source information fusion

Info

Publication number: CN112686928A
Application number: CN202110015551.5A
Authority: CN
Inventors: 傅应锴; 杨鑫; 张吉庆; 尹宝才; 魏小鹏
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-04-20
Anticipated expiration: 2041-01-07
Also published as: CN112686928B

Abstract

The invention belongs to the technical field of computer vision, and provides a moving target visual tracking method based on multi-source information fusion. Aiming at a moving target visual tracking task under the scenes of rapid movement and poor illumination, the invention firstly makes a moving target tracking data set based on an event camera, and simultaneously provides a visual target tracking algorithm based on cross-domain attention for accurately tracking a visual target based on the data set. The invention can take advantage of the respective advantages of combining the frame image and the event data: the frame image can provide rich texture information, and the event data can still provide clear object edge information in a challenging scene. By respectively setting the weights of the two kinds of domain information in different scenes, the method can effectively integrate the advantages of the two kinds of sensors so as to solve the problem of target tracking under complex conditions.

Description

Moving target visual tracking method based on multi-source information fusion

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method for visually tracking a moving target by utilizing a frame image and an event stream output by an event camera based on deep learning.

Background

Moving object tracking is an important topic in computer vision that requires tracking of an object in the remaining frames of a video based on the size and position of the object in the first frame of a given video. The Convolutional Neural Network (CNNs) based methods have excellent performance in this field, most methods rely on traditional frame images (RGB images or grayscale images) to perform tracking, but the tracking effect of the frame image based tracker is reduced sharply under severe conditions (e.g., low light, fast motion conditions, etc.). To improve the robustness of the tracker in harsh environments, methods based on multiple modalities are gradually proposed, such as depth sensors and thermal infrared sensors, which can provide valuable additional information to improve the tracking effect. However, as with conventional frame-based sensors, depth and thermal infrared sensors also have limited frame rates and suffer from motion blur.

An event camera is a bionic sensor that asynchronously measures light intensity changes in a scene and outputs events. It therefore provides very high time resolution (up to 1MHz) and very little power consumption. Since the intensity variation is calculated in a logarithmic scale, it can operate at a very high dynamic range (140 dB). The event camera triggers the formation of "ON" and "OFF" events when the log scale pixel intensity variation is above or below a threshold. The event camera also has limitations, one of which is the inability to provide texture information and color information of the object. However, the frame image can easily acquire rich texture and semantic information of the object under normal conditions. The advantage of fusing the two domain data thus provides the possibility to solve visual target tracking in challenging scenarios. The related background art in this field is described in detail below.

(1) Single domain tracking based on frame images

At present, the mainstream frame image single-domain tracking method is mainly based on the principle of deep learning, and comprises a tracking model based on pre-training depth features, a tracking model based on off-line training features, and a tracking model fusing relevant filtering to a neural network. In the existing algorithm, features of an image are extracted by designing a convolution module or manually constructed image information by means of information such as texture and color of a target in a frame image, and finally, an observation feature is obtained so as to output the position of the target. Although single-domain tracking algorithms have achieved good results on relevant data sets, for some challenging scenes, such as low illumination, high dynamic contrast, fast motion, etc., the single-domain target tracking algorithm based on frame images still has difficulty achieving satisfactory results.

(2) Multi-domain tracking

The currently mainstream multi-domain tracking algorithm comprises the fusion of a frame image and depth information and the fusion of the frame image and thermal infrared information. The problem of mutual shielding between targets can be effectively solved according to the depth information of the environment. The thermal infrared imaging can effectively image in severe weather such as low illumination, rainy days, foggy days and the like, so that the tracking accuracy can be improved by combining with the frame image. The current research for fusing the event domain and the frame domain to track the target is still in the early exploration stage because the output data form of the event camera is completely different from that of the above-mentioned two sensors. Considering that the event camera can provide contour information of a moving object in a complex scene, how to combine the contour clue with abundant texture information in the frame image to improve the accuracy of the object tracking task is a problem worthy of research.

(3) Event tracking data set

Some studies have attempted to track using event data because the event camera captures a moving object well, but since the event data is stored in a form that is much different from the conventional frame image form, the event data is often superimposed and marked on the superimposed event frame. Hu et al collected a large event-based tracking dataset that collected event data by placing an event camera in front of the display screen and recorded a labeled gray scale, but since the display screen played discrete frames, the dataset failed to represent events between discrete frames. Mitrokhin et al collected two event-based tracking data sets: EED data set and EV-IMO data set. The EED data set only contains two tracking target classes, namely 179 frames (7.8 seconds) of gray scale images and corresponding labels, and the EV-IMO additionally provides masks of objects and increases the labeling frequency of event data to 200Hz, but only contains three tracking target classes in the data set.

Disclosure of Invention

Aiming at a moving target visual tracking task under the scenes of rapid movement and poor illumination, the invention firstly makes a moving target tracking data set based on an event camera, and simultaneously provides a visual target tracking algorithm based on cross-domain attention for accurately tracking a visual target based on the data set. The invention can take advantage of the respective advantages of combining the frame image and the event data: the frame image can provide rich texture information, and the event data can still provide clear object edge information in a challenging scene.

The technical scheme of the invention is as follows:

a moving target visual tracking method based on multi-source information fusion comprises the following steps:

(1) building a data set

The data set comprises 108 sequences, and 21 object categories in total, and covers various complex scenes such as high exposure, motion blur, HDR and the like; the data set provides event data and frame images taken in synchronization, and a target annotation box up to 240 HZ; according to the target category, the data set is divided into: animals, vehicles and everyday items (e.g. bottles and boxes). According to scene classification, the data set is divided into: low light, High Dynamic Range (HDR), fast motion with and without motion blur; according to whether the camera is moving and the number of objects, the data set is divided into: four scenes of single object motion in which the camera is stationary, single object motion in which the camera is moving, multiple object motion in which the camera is stationary, and multiple object motion in which the camera is moving;

(2) frame image feature extractor (FFE)

The frame image feature extractor is used for extracting features from the frame images, adopting ResNet18 as the frame image feature extractor, and taking the outputs of Block 4 and Block 5 in ResNet18 as low-level frame features and high-level frame features respectively;

(3) event image feature extraction module (EFE)

The event image feature extraction module is used for extracting features from event images of the event data stack, and the event data is expressed by the following formula:

wherein (x)_k,y_k) Is the pixel coordinate of the event, t_kIs the time stamp of the event, p_k± 1 is the polarity of the event; to input a captured asynchronous event into the event data feature extraction module, the event images are first superimposed to form an event image by: (a) aggregating events between two adjacent frame images into N three-dimensional images, and discretizing the events; (b) for each set of events, an event image is generated according to the following method:

where i denotes the ith branch, δ denotes the Dirac function, T_jIs the time stamp corresponding to the jth frame image in the frame field, and B is the slice size in the time field, and its value is defined as B ═ (T)_j+1-T_j)/2；

The event image feature extraction module also extracts low-level frame features and high-level frame features of the event image; meanwhile, in order to effectively aggregate event images in different time domains, the method fuses the feature maps by setting learnable parameters w which are the same as N, and the method specifically comprises the following steps:

wherein the content of the first and second substances,

denotes pixel addition, e_iIs the output of the ith branch, w_iIs the weight of the output, obtained by the training of the network;

meanwhile, in order to better extract effective information of an event, the method provides an edge attention module (EAB), and efficient extraction of frame edge information is completed through an adaptive attention mechanism, and the process is expressed as follows:

where σ is the Sigmoid activation function, ψ_1×1Denotes a 1 × 1 convolution, κ_iRefers to the output characteristics of the edge attention module in the ith branch,

refers to the multiplication of elements and the multiplication of elements,

and

channel addition and adaptive average pooling are referred to, respectively;

(4) cross-domain information modulation selection module (CDMS)

The cross-domain information modulation selection module is used for fusing information between an event domain and a frame domain, and is designed based on the following observation: 1. texture information and semantic information are effectively captured through a traditional frame imaging camera, and an event camera can well extract object edge information, so that the advantages of the texture information and the semantic information can be simultaneously exerted. 2. For the traditional frame imaging camera, under the scenes of low illumination and high object motion speed, the imaging quality can be sharply reducedWhereas the event camera is not affected by such a scene, in which case the event information is more reliable than the frame information. 3. In a scene where a plurality of objects move simultaneously, the objects are difficult to distinguish only through edge information, and at the moment, texture differences between the objects can provide effective assistance. The cross-domain information modulation selection module provided by the method fuses the characteristics of two different domains through a cross-domain attention mechanism, and particularly, for two kinds of information D from different domains₁And D₂The two kinds of information can be fused through the following process:

wherein D is₁Refers to features from the frame domain, D₂Refer to features from the event domain, [.]The operation refers to a channel splicing operation;

a feature extraction process that represents the frame domain,

a feature extraction process representing an event domain; psi_3×3Representing a convolution of 3 x 3, #_5×5Representing a convolution of 5 x 5.

The invention has the beneficial effects that:

(1) event data set in large challenging scenarios

The target tracking task based on deep learning relies on a large number of datasets with annotations, which makes the event data difficult to label since the output of the event camera is an asynchronous stream. This patent uses the motion capture system VICON to capture the motion of an object to obtain a high frequency moving object calibration frame and in this way creates a single object tracking data set based on the DAVIS346 event camera. This data set will facilitate the relevant study of subsequent event camera based tracking algorithms.

(2) Fusion of frame and event fields

Due to the asynchrony of the event data, the method is different from the current method of performing feature fusion by using RGB-D and RGB-T, and the method for fusing the RGB data and the event data is explored for the first time. The CDFI module provided by the patent effectively extracts features from a frame domain and an event domain through a mutual aid mechanism, and performs fusion according to the reliability of the features. By respectively setting the weights of the two kinds of domain information in different scenes, the method can effectively integrate the advantages of the two kinds of sensors so as to solve the problem of target tracking under complex conditions.

Drawings

Fig. 1 is a structural diagram of a cross-domain feature integration module (CDFI) according to the present invention, which includes a gray data feature extraction module (FFE), an event image feature extraction module (EFE), and a cross-domain information modulation selection module (CDMS).

FIG. 2 is a flow diagram of target tracking based on event and frame image fusion.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments, but the present invention is not limited to the specific embodiments.

A visual tracking method for a moving target based on fusion event and frame domain information features comprises the steps of manufacturing a data set, training a network model and testing.

(1) Training data set generation

In order to label the moving target for the gray frame and the event stream of the event camera, two steps need to be completed: and (4) transforming a camera coordinate system and transforming the coordinates of the target positioning point.

To achieve coordinate system conversion between an event camera and a VICON system, we first determine the event camera matrix K and the distortion coefficient d using a calibration plate. Then, we can obtain the rotation vector r and the translation vector t of DAVIS346 by the following formula,

r,t＝S(K,d,p_i,P_i),i＝1,2,…,25(10)

wherein S represents the SolvePnP method, p_iIs a set of 2D points on a gray scale image from APS, and P_iIs a set of 3D points on the target from Vicon. To obtain p_iAnd P_iWe use a T-shape orthotic frame that both Vicon and APS can track. The correcting rack is provided with 5 infrared luminous points, the correcting rack is placed at 5 different positions, and a total of 25 paired p can be collected_iAnd P_i. The Vicon captured 3D points are converted to 2D point coordinates of the gray frame image by the following equation:

wherein [ x ]_j y_j 1]^TAnd [ X ]_j Y_j Z_j]^TRepresenting the 2D and 3D coordinates of the jth mark on the object, respectively. From the above information, the label bounding box of the target can be obtained by calculating the maximum and minimum values of all 2D points.

Wherein (x)_l,y_l) Is the coordinate of the point in the upper left corner of the label box, (x)_r,y_r) Is the coordinate of the point in the lower right-hand corner of the label box. We can obtain the width w and height h of the bounding box by the following equations,

w＝x_r-x_l,h＝y_r-y_l (13)

(2) network training

For FFE, its parameters are initialized using the Resnet18 model pre-trained on the ImageNet dataset. For EFE, the size of N is set to 3. The batch size of the model was set to 26. To train this network, the model parameters were updated using Adam as the optimizer, the number of iterations was set to 50, the decay factor for the learning rate was set to 0.2, and the decay was once every 15 iterations. The learning rates of the classifier, the frame regressor, and the CDFI are set to 0.001, 0.001, and 0.0001, respectively.

Claims

1. A moving target visual tracking method based on multi-source information fusion is characterized by comprising the following steps:

(1) building a data set

The data set comprises 108 sequences, and 21 object categories in total, and covers various complex scenes such as high exposure, motion blur and HDR; the data set provides event data and frame images taken in synchronization, and a target annotation box up to 240 HZ; according to the target category, the data set is divided into: animals, vehicles and everyday goods; according to scene classification, the data set is divided into: low light, high dynamic range, fast motion with motion blur and fast motion without motion blur; according to whether the camera is moving and the number of objects, the data set is divided into: four scenes of single object motion in which the camera is stationary, single object motion in which the camera is moving, multiple object motion in which the camera is stationary, and multiple object motion in which the camera is moving;

(2) frame image feature extractor FFE

(3) event image feature extraction module EFE

wherein (x)_k，y_k) Is the pixel coordinate of the event, t_kIs the time stamp of the event, p_k± 1 is the polarity of the event; to input a captured asynchronous event into the event data feature extraction module, the event images are first superimposed to form an event image by: (a) for two adjacentAggregating events among the frame images into N three-dimensional images, and discretizing the events; (b) for each set of events, an event image is generated according to the following method:

wherein the content of the first and second substances,

meanwhile, in order to better extract effective information of an event, the method provides an edge attention module EAB, and completes efficient extraction of frame edge information through a self-adaptive attention mechanism, wherein the process is expressed as follows:

refers to the multiplication of elements and the multiplication of elements,

and

channel addition and adaptive average pooling are referred to, respectively;

(4) cross-domain information modulation selection module CDMS

The cross-domain information modulation selection module is used for fusing information between an event domain and a frame domain; the cross-domain information modulation selection module provided by the method fuses the characteristics of two different domains through a cross-domain attention mechanism, and particularly, for two kinds of information D from different domains₁And D₂The two kinds of information are fused through the following process:

a feature extraction process that represents the frame domain,