CN112686928A - Moving target visual tracking method based on multi-source information fusion - Google Patents
Moving target visual tracking method based on multi-source information fusion Download PDFInfo
- Publication number
- CN112686928A CN112686928A CN202110015551.5A CN202110015551A CN112686928A CN 112686928 A CN112686928 A CN 112686928A CN 202110015551 A CN202110015551 A CN 202110015551A CN 112686928 A CN112686928 A CN 112686928A
- Authority
- CN
- China
- Prior art keywords
- event
- frame
- domain
- information
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of computer vision, and provides a moving target visual tracking method based on multi-source information fusion. Aiming at a moving target visual tracking task under the scenes of rapid movement and poor illumination, the invention firstly makes a moving target tracking data set based on an event camera, and simultaneously provides a visual target tracking algorithm based on cross-domain attention for accurately tracking a visual target based on the data set. The invention can take advantage of the respective advantages of combining the frame image and the event data: the frame image can provide rich texture information, and the event data can still provide clear object edge information in a challenging scene. By respectively setting the weights of the two kinds of domain information in different scenes, the method can effectively integrate the advantages of the two kinds of sensors so as to solve the problem of target tracking under complex conditions.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a method for visually tracking a moving target by utilizing a frame image and an event stream output by an event camera based on deep learning.
Background
Moving object tracking is an important topic in computer vision that requires tracking of an object in the remaining frames of a video based on the size and position of the object in the first frame of a given video. The Convolutional Neural Network (CNNs) based methods have excellent performance in this field, most methods rely on traditional frame images (RGB images or grayscale images) to perform tracking, but the tracking effect of the frame image based tracker is reduced sharply under severe conditions (e.g., low light, fast motion conditions, etc.). To improve the robustness of the tracker in harsh environments, methods based on multiple modalities are gradually proposed, such as depth sensors and thermal infrared sensors, which can provide valuable additional information to improve the tracking effect. However, as with conventional frame-based sensors, depth and thermal infrared sensors also have limited frame rates and suffer from motion blur.
An event camera is a bionic sensor that asynchronously measures light intensity changes in a scene and outputs events. It therefore provides very high time resolution (up to 1MHz) and very little power consumption. Since the intensity variation is calculated in a logarithmic scale, it can operate at a very high dynamic range (140 dB). The event camera triggers the formation of "ON" and "OFF" events when the log scale pixel intensity variation is above or below a threshold. The event camera also has limitations, one of which is the inability to provide texture information and color information of the object. However, the frame image can easily acquire rich texture and semantic information of the object under normal conditions. The advantage of fusing the two domain data thus provides the possibility to solve visual target tracking in challenging scenarios. The related background art in this field is described in detail below.
(1) Single domain tracking based on frame images
At present, the mainstream frame image single-domain tracking method is mainly based on the principle of deep learning, and comprises a tracking model based on pre-training depth features, a tracking model based on off-line training features, and a tracking model fusing relevant filtering to a neural network. In the existing algorithm, features of an image are extracted by designing a convolution module or manually constructed image information by means of information such as texture and color of a target in a frame image, and finally, an observation feature is obtained so as to output the position of the target. Although single-domain tracking algorithms have achieved good results on relevant data sets, for some challenging scenes, such as low illumination, high dynamic contrast, fast motion, etc., the single-domain target tracking algorithm based on frame images still has difficulty achieving satisfactory results.
(2) Multi-domain tracking
The currently mainstream multi-domain tracking algorithm comprises the fusion of a frame image and depth information and the fusion of the frame image and thermal infrared information. The problem of mutual shielding between targets can be effectively solved according to the depth information of the environment. The thermal infrared imaging can effectively image in severe weather such as low illumination, rainy days, foggy days and the like, so that the tracking accuracy can be improved by combining with the frame image. The current research for fusing the event domain and the frame domain to track the target is still in the early exploration stage because the output data form of the event camera is completely different from that of the above-mentioned two sensors. Considering that the event camera can provide contour information of a moving object in a complex scene, how to combine the contour clue with abundant texture information in the frame image to improve the accuracy of the object tracking task is a problem worthy of research.
(3) Event tracking data set
Some studies have attempted to track using event data because the event camera captures a moving object well, but since the event data is stored in a form that is much different from the conventional frame image form, the event data is often superimposed and marked on the superimposed event frame. Hu et al collected a large event-based tracking dataset that collected event data by placing an event camera in front of the display screen and recorded a labeled gray scale, but since the display screen played discrete frames, the dataset failed to represent events between discrete frames. Mitrokhin et al collected two event-based tracking data sets: EED data set and EV-IMO data set. The EED data set only contains two tracking target classes, namely 179 frames (7.8 seconds) of gray scale images and corresponding labels, and the EV-IMO additionally provides masks of objects and increases the labeling frequency of event data to 200Hz, but only contains three tracking target classes in the data set.
Disclosure of Invention
Aiming at a moving target visual tracking task under the scenes of rapid movement and poor illumination, the invention firstly makes a moving target tracking data set based on an event camera, and simultaneously provides a visual target tracking algorithm based on cross-domain attention for accurately tracking a visual target based on the data set. The invention can take advantage of the respective advantages of combining the frame image and the event data: the frame image can provide rich texture information, and the event data can still provide clear object edge information in a challenging scene.
The technical scheme of the invention is as follows:
a moving target visual tracking method based on multi-source information fusion comprises the following steps:
(1) building a data set
The data set comprises 108 sequences, and 21 object categories in total, and covers various complex scenes such as high exposure, motion blur, HDR and the like; the data set provides event data and frame images taken in synchronization, and a target annotation box up to 240 HZ; according to the target category, the data set is divided into: animals, vehicles and everyday items (e.g. bottles and boxes). According to scene classification, the data set is divided into: low light, High Dynamic Range (HDR), fast motion with and without motion blur; according to whether the camera is moving and the number of objects, the data set is divided into: four scenes of single object motion in which the camera is stationary, single object motion in which the camera is moving, multiple object motion in which the camera is stationary, and multiple object motion in which the camera is moving;
(2) frame image feature extractor (FFE)
The frame image feature extractor is used for extracting features from the frame images, adopting ResNet18 as the frame image feature extractor, and taking the outputs of Block 4 and Block 5 in ResNet18 as low-level frame features and high-level frame features respectively;
(3) event image feature extraction module (EFE)
The event image feature extraction module is used for extracting features from event images of the event data stack, and the event data is expressed by the following formula:
wherein (x)k,yk) Is the pixel coordinate of the event, tkIs the time stamp of the event, pk± 1 is the polarity of the event; to input a captured asynchronous event into the event data feature extraction module, the event images are first superimposed to form an event image by: (a) aggregating events between two adjacent frame images into N three-dimensional images, and discretizing the events; (b) for each set of events, an event image is generated according to the following method:
where i denotes the ith branch, δ denotes the Dirac function, TjIs the time stamp corresponding to the jth frame image in the frame field, and B is the slice size in the time field, and its value is defined as B ═ (T)j+1-Tj)/2;
The event image feature extraction module also extracts low-level frame features and high-level frame features of the event image; meanwhile, in order to effectively aggregate event images in different time domains, the method fuses the feature maps by setting learnable parameters w which are the same as N, and the method specifically comprises the following steps:
wherein the content of the first and second substances,denotes pixel addition, eiIs the output of the ith branch, wiIs the weight of the output, obtained by the training of the network;
meanwhile, in order to better extract effective information of an event, the method provides an edge attention module (EAB), and efficient extraction of frame edge information is completed through an adaptive attention mechanism, and the process is expressed as follows:
where σ is the Sigmoid activation function, ψ1×1Denotes a 1 × 1 convolution, κiRefers to the output characteristics of the edge attention module in the ith branch,refers to the multiplication of elements and the multiplication of elements,andchannel addition and adaptive average pooling are referred to, respectively;
(4) cross-domain information modulation selection module (CDMS)
The cross-domain information modulation selection module is used for fusing information between an event domain and a frame domain, and is designed based on the following observation: 1. texture information and semantic information are effectively captured through a traditional frame imaging camera, and an event camera can well extract object edge information, so that the advantages of the texture information and the semantic information can be simultaneously exerted. 2. For the traditional frame imaging camera, under the scenes of low illumination and high object motion speed, the imaging quality can be sharply reducedWhereas the event camera is not affected by such a scene, in which case the event information is more reliable than the frame information. 3. In a scene where a plurality of objects move simultaneously, the objects are difficult to distinguish only through edge information, and at the moment, texture differences between the objects can provide effective assistance. The cross-domain information modulation selection module provided by the method fuses the characteristics of two different domains through a cross-domain attention mechanism, and particularly, for two kinds of information D from different domains1And D2The two kinds of information can be fused through the following process:
wherein D is1Refers to features from the frame domain, D2Refer to features from the event domain, [.]The operation refers to a channel splicing operation;a feature extraction process that represents the frame domain,a feature extraction process representing an event domain; psi3×3Representing a convolution of 3 x 3, #5×5Representing a convolution of 5 x 5.
The invention has the beneficial effects that:
(1) event data set in large challenging scenarios
The target tracking task based on deep learning relies on a large number of datasets with annotations, which makes the event data difficult to label since the output of the event camera is an asynchronous stream. This patent uses the motion capture system VICON to capture the motion of an object to obtain a high frequency moving object calibration frame and in this way creates a single object tracking data set based on the DAVIS346 event camera. This data set will facilitate the relevant study of subsequent event camera based tracking algorithms.
(2) Fusion of frame and event fields
Due to the asynchrony of the event data, the method is different from the current method of performing feature fusion by using RGB-D and RGB-T, and the method for fusing the RGB data and the event data is explored for the first time. The CDFI module provided by the patent effectively extracts features from a frame domain and an event domain through a mutual aid mechanism, and performs fusion according to the reliability of the features. By respectively setting the weights of the two kinds of domain information in different scenes, the method can effectively integrate the advantages of the two kinds of sensors so as to solve the problem of target tracking under complex conditions.
Drawings
Fig. 1 is a structural diagram of a cross-domain feature integration module (CDFI) according to the present invention, which includes a gray data feature extraction module (FFE), an event image feature extraction module (EFE), and a cross-domain information modulation selection module (CDMS).
FIG. 2 is a flow diagram of target tracking based on event and frame image fusion.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments, but the present invention is not limited to the specific embodiments.
A visual tracking method for a moving target based on fusion event and frame domain information features comprises the steps of manufacturing a data set, training a network model and testing.
(1) Training data set generation
In order to label the moving target for the gray frame and the event stream of the event camera, two steps need to be completed: and (4) transforming a camera coordinate system and transforming the coordinates of the target positioning point.
To achieve coordinate system conversion between an event camera and a VICON system, we first determine the event camera matrix K and the distortion coefficient d using a calibration plate. Then, we can obtain the rotation vector r and the translation vector t of DAVIS346 by the following formula,
r,t=S(K,d,pi,Pi),i=1,2,…,25(10)
wherein S represents the SolvePnP method, piIs a set of 2D points on a gray scale image from APS, and PiIs a set of 3D points on the target from Vicon. To obtain piAnd PiWe use a T-shape orthotic frame that both Vicon and APS can track. The correcting rack is provided with 5 infrared luminous points, the correcting rack is placed at 5 different positions, and a total of 25 paired p can be collectediAnd Pi. The Vicon captured 3D points are converted to 2D point coordinates of the gray frame image by the following equation:
wherein [ x ]j yj 1]TAnd [ X ]j Yj Zj]TRepresenting the 2D and 3D coordinates of the jth mark on the object, respectively. From the above information, the label bounding box of the target can be obtained by calculating the maximum and minimum values of all 2D points.
Wherein (x)l,yl) Is the coordinate of the point in the upper left corner of the label box, (x)r,yr) Is the coordinate of the point in the lower right-hand corner of the label box. We can obtain the width w and height h of the bounding box by the following equations,
w=xr-xl,h=yr-yl (13)
(2) network training
For FFE, its parameters are initialized using the Resnet18 model pre-trained on the ImageNet dataset. For EFE, the size of N is set to 3. The batch size of the model was set to 26. To train this network, the model parameters were updated using Adam as the optimizer, the number of iterations was set to 50, the decay factor for the learning rate was set to 0.2, and the decay was once every 15 iterations. The learning rates of the classifier, the frame regressor, and the CDFI are set to 0.001, 0.001, and 0.0001, respectively.
Claims (1)
1. A moving target visual tracking method based on multi-source information fusion is characterized by comprising the following steps:
(1) building a data set
The data set comprises 108 sequences, and 21 object categories in total, and covers various complex scenes such as high exposure, motion blur and HDR; the data set provides event data and frame images taken in synchronization, and a target annotation box up to 240 HZ; according to the target category, the data set is divided into: animals, vehicles and everyday goods; according to scene classification, the data set is divided into: low light, high dynamic range, fast motion with motion blur and fast motion without motion blur; according to whether the camera is moving and the number of objects, the data set is divided into: four scenes of single object motion in which the camera is stationary, single object motion in which the camera is moving, multiple object motion in which the camera is stationary, and multiple object motion in which the camera is moving;
(2) frame image feature extractor FFE
The frame image feature extractor is used for extracting features from the frame images, adopting ResNet18 as the frame image feature extractor, and taking the outputs of Block 4 and Block 5 in ResNet18 as low-level frame features and high-level frame features respectively;
(3) event image feature extraction module EFE
The event image feature extraction module is used for extracting features from event images of the event data stack, and the event data is expressed by the following formula:
wherein (x)k,yk) Is the pixel coordinate of the event, tkIs the time stamp of the event, pk± 1 is the polarity of the event; to input a captured asynchronous event into the event data feature extraction module, the event images are first superimposed to form an event image by: (a) for two adjacentAggregating events among the frame images into N three-dimensional images, and discretizing the events; (b) for each set of events, an event image is generated according to the following method:
where i denotes the ith branch, δ denotes the Dirac function, TjIs the time stamp corresponding to the jth frame image in the frame field, and B is the slice size in the time field, and its value is defined as B ═ (T)j+1-Tj)/2;
The event image feature extraction module also extracts low-level frame features and high-level frame features of the event image; meanwhile, in order to effectively aggregate event images in different time domains, the method fuses the feature maps by setting learnable parameters w which are the same as N, and the method specifically comprises the following steps:
wherein the content of the first and second substances,denotes pixel addition, eiIs the output of the ith branch, wiIs the weight of the output, obtained by the training of the network;
meanwhile, in order to better extract effective information of an event, the method provides an edge attention module EAB, and completes efficient extraction of frame edge information through a self-adaptive attention mechanism, wherein the process is expressed as follows:
where σ is the Sigmoid activation function, ψ1×1Denotes a 1 × 1 convolution, κiRefers to the output characteristics of the edge attention module in the ith branch,refers to the multiplication of elements and the multiplication of elements,andchannel addition and adaptive average pooling are referred to, respectively;
(4) cross-domain information modulation selection module CDMS
The cross-domain information modulation selection module is used for fusing information between an event domain and a frame domain; the cross-domain information modulation selection module provided by the method fuses the characteristics of two different domains through a cross-domain attention mechanism, and particularly, for two kinds of information D from different domains1And D2The two kinds of information are fused through the following process:
wherein D is1Refers to features from the frame domain, D2Refer to features from the event domain, [.]The operation refers to a channel splicing operation;a feature extraction process that represents the frame domain,a feature extraction process representing an event domain; psi3×3Representing a convolution of 3 x 3, #5×5Representing a convolution of 5 x 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110015551.5A CN112686928B (en) | 2021-01-07 | 2021-01-07 | Moving target visual tracking method based on multi-source information fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110015551.5A CN112686928B (en) | 2021-01-07 | 2021-01-07 | Moving target visual tracking method based on multi-source information fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112686928A true CN112686928A (en) | 2021-04-20 |
CN112686928B CN112686928B (en) | 2022-10-14 |
Family
ID=75456139
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110015551.5A Active CN112686928B (en) | 2021-01-07 | 2021-01-07 | Moving target visual tracking method based on multi-source information fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112686928B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113269699A (en) * | 2021-04-22 | 2021-08-17 | 天津(滨海)人工智能军民融合创新中心 | Optical flow estimation method and system based on fusion of asynchronous event flow and gray level image |
CN113378917A (en) * | 2021-06-09 | 2021-09-10 | 深圳龙岗智能视听研究院 | Event camera target identification method based on self-attention mechanism |
CN115631407A (en) * | 2022-11-10 | 2023-01-20 | 中国石油大学(华东) | Underwater transparent biological detection based on event camera and color frame image fusion |
CN116188533A (en) * | 2023-04-23 | 2023-05-30 | 深圳时识科技有限公司 | Feature point tracking method and device and electronic equipment |
CN116206196A (en) * | 2023-04-27 | 2023-06-02 | 吉林大学 | Ocean low-light environment multi-target detection method and detection system thereof |
CN116309781A (en) * | 2023-05-18 | 2023-06-23 | 吉林大学 | Cross-modal fusion-based underwater visual target ranging method and device |
CN117808847A (en) * | 2024-02-29 | 2024-04-02 | 中国科学院光电技术研究所 | Space non-cooperative target feature tracking method integrating bionic dynamic vision |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110148159A (en) * | 2019-05-20 | 2019-08-20 | 厦门大学 | A kind of asynchronous method for tracking target based on event camera |
CN112037269A (en) * | 2020-08-24 | 2020-12-04 | 大连理工大学 | Visual moving target tracking method based on multi-domain collaborative feature expression |
-
2021
- 2021-01-07 CN CN202110015551.5A patent/CN112686928B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110148159A (en) * | 2019-05-20 | 2019-08-20 | 厦门大学 | A kind of asynchronous method for tracking target based on event camera |
CN112037269A (en) * | 2020-08-24 | 2020-12-04 | 大连理工大学 | Visual moving target tracking method based on multi-domain collaborative feature expression |
Non-Patent Citations (1)
Title |
---|
王轶等: "基于DST-PCR5多目标自适应视觉跟踪方法", 《计算机应用研究》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113269699B (en) * | 2021-04-22 | 2023-01-03 | 天津(滨海)人工智能军民融合创新中心 | Optical flow estimation method and system based on fusion of asynchronous event flow and gray level image |
CN113269699A (en) * | 2021-04-22 | 2021-08-17 | 天津(滨海)人工智能军民融合创新中心 | Optical flow estimation method and system based on fusion of asynchronous event flow and gray level image |
CN113378917B (en) * | 2021-06-09 | 2023-06-09 | 深圳龙岗智能视听研究院 | Event camera target recognition method based on self-attention mechanism |
CN113378917A (en) * | 2021-06-09 | 2021-09-10 | 深圳龙岗智能视听研究院 | Event camera target identification method based on self-attention mechanism |
CN115631407A (en) * | 2022-11-10 | 2023-01-20 | 中国石油大学(华东) | Underwater transparent biological detection based on event camera and color frame image fusion |
CN115631407B (en) * | 2022-11-10 | 2023-10-20 | 中国石油大学(华东) | Underwater transparent biological detection based on fusion of event camera and color frame image |
CN116188533A (en) * | 2023-04-23 | 2023-05-30 | 深圳时识科技有限公司 | Feature point tracking method and device and electronic equipment |
CN116188533B (en) * | 2023-04-23 | 2023-08-08 | 深圳时识科技有限公司 | Feature point tracking method and device and electronic equipment |
CN116206196A (en) * | 2023-04-27 | 2023-06-02 | 吉林大学 | Ocean low-light environment multi-target detection method and detection system thereof |
CN116206196B (en) * | 2023-04-27 | 2023-08-08 | 吉林大学 | Ocean low-light environment multi-target detection method and detection system thereof |
CN116309781A (en) * | 2023-05-18 | 2023-06-23 | 吉林大学 | Cross-modal fusion-based underwater visual target ranging method and device |
CN116309781B (en) * | 2023-05-18 | 2023-08-22 | 吉林大学 | Cross-modal fusion-based underwater visual target ranging method and device |
CN117808847A (en) * | 2024-02-29 | 2024-04-02 | 中国科学院光电技术研究所 | Space non-cooperative target feature tracking method integrating bionic dynamic vision |
Also Published As
Publication number | Publication date |
---|---|
CN112686928B (en) | 2022-10-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112686928B (en) | Moving target visual tracking method based on multi-source information fusion | |
CN111209810B (en) | Boundary frame segmentation supervision deep neural network architecture for accurately detecting pedestrians in real time through visible light and infrared images | |
Jiao et al. | New generation deep learning for video object detection: A survey | |
CN110443827B (en) | Unmanned aerial vehicle video single-target long-term tracking method based on improved twin network | |
Baldwin et al. | Time-ordered recent event (tore) volumes for event cameras | |
CN106845374B (en) | Pedestrian detection method and detection device based on deep learning | |
CN112037269B (en) | Visual moving target tracking method based on multi-domain collaborative feature expression | |
CN109684925B (en) | Depth image-based human face living body detection method and device | |
CN105160310A (en) | 3D (three-dimensional) convolutional neural network based human body behavior recognition method | |
Chen et al. | End-to-end learning of object motion estimation from retinal events for event-based object tracking | |
CN111539273A (en) | Traffic video background modeling method and system | |
CN109697726A (en) | A kind of end-to-end target method for estimating based on event camera | |
CN110929593A (en) | Real-time significance pedestrian detection method based on detail distinguishing and distinguishing | |
CN103984955B (en) | Multi-camera object identification method based on salience features and migration incremental learning | |
WO2019136591A1 (en) | Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network | |
CN107392131A (en) | A kind of action identification method based on skeleton nodal distance | |
CN113408584B (en) | RGB-D multi-modal feature fusion 3D target detection method | |
CN107133610B (en) | Visual detection and counting method for traffic flow under complex road conditions | |
CN113762009B (en) | Crowd counting method based on multi-scale feature fusion and double-attention mechanism | |
CN113592911A (en) | Apparent enhanced depth target tracking method | |
CN114821764A (en) | Gesture image recognition method and system based on KCF tracking detection | |
CN115661246A (en) | Attitude estimation method based on self-supervision learning | |
CN106570885A (en) | Background modeling method based on brightness and texture fusion threshold value | |
Liang et al. | Methods of moving target detection and behavior recognition in intelligent vision monitoring. | |
CN108009512A (en) | A kind of recognition methods again of the personage based on convolutional neural networks feature learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |