CN115484410B

CN115484410B - Event camera video reconstruction method based on deep learning

Info

Publication number: CN115484410B
Application number: CN202211121596.1A
Authority: CN
Inventors: 杨敬钰; 潘锦蔚; 岳焕景; 李坤
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-11-24
Anticipated expiration: 2042-09-15
Also published as: CN115484410A

Abstract

The invention discloses an event camera video reconstruction method based on deep learning, and belongs to the field of digital image processing. The event camera is a novel neuromorphic sensor and has the characteristics of high dynamic range, high time resolution and low power consumption. Since the output of the event camera is not a visually friendly image that a person normally observes, video reconstruction of the event camera is one of its visual applications. The existing event camera video reconstruction method has poor image quality obtained in the early reconstruction stage and long initialization time of a reconstruction algorithm. In view of the situation, the invention designs a new deep neural network which comprises modules such as space-time Transformer, convLSTM, CNN and the like, and extracts information of unaligned space-time events in a plurality of adjacent time periods to jointly generate a frame of gray level image through complementary advantages among the modules. The method has a high initialization speed, can reconstruct images with high quality in the early stage of the video and the whole video, and the reconstructed video has a good visual effect.

Description

Event camera video reconstruction method based on deep learning

Technical Field

The invention belongs to the field of digital image processing, and particularly relates to an event camera video reconstruction method based on deep learning and computer vision.

Background

The event camera is a novel sensor for biological retina vision inspiring, and the working principle and the design paradigm of the bottom circuit are completely different from those of the traditional camera. The event camera has the characteristics of high dynamic range, high time resolution and low power consumption, and has wide application space in the fields of automatic driving, unmanned aerial vehicle visual navigation, security monitoring and the like which relate to high-speed movement or extreme illumination scenes.

The data output of the event camera at each pixel location is asynchronous with respect to the intensity camera outputting intensity values for the pixel locations synchronously throughout the pixel plane during the exposure time, and only outputting relative variation values of brightness. Because the output of the event camera is not a gray level or color image normally observed by people, reconstructing the event points output by the event camera into a visually friendly image and video which can be normally observed by people is one of visual applications of the event camera. The asynchronous trigger transmission characteristic of the event determines that the event is non-Euclidean data, and the existing image reconstruction method is difficult to directly apply to the real image reconstruction of the event camera, so that new image and video reconstruction algorithms are required to be researched aiming at the characteristic of the event camera.

Current event camera image or video reconstruction algorithms fall into two main categories: a method based on conventional image processing and a method based on deep learning. Based on the traditional image method, modeling is mainly performed on the differential characteristics of an event camera, and the intensity values of all pixel positions on the image are estimated through an integral or a filter. The deep learning-based method achieves better effects than the conventional image method. Such methods typically use ConvLSTM to introduce long-term estimates of image intensity, better modeling the differential nature of the event. However, the introduction of ConvLSTM and the spatial sparsity of events result in that the video reconstruction algorithm of the event camera requires an initialization time from several frames to several tens of frames, that is, in the initial stages of video reconstruction, from several frames to several tens of frames, the image quality reconstructed by the algorithm is poor, and generally has poor texture details and global contrast.

Disclosure of Invention

The invention aims to make up the defects of the prior art, enhance the initial imaging quality of a video and reduce the initialization time on the premise of ensuring the integral imaging quality of the video, thereby providing an event camera video reconstruction method based on deep learning.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the event camera video reconstruction method based on the deep learning is based on a deep neural network, an encoder-decoder deep neural network formed by combining ConvLSTM, a space-time converter, a space-time characteristic alignment unit, a 2D/3D convolution and other modules is designed, and a frame of gray level image is jointly generated by utilizing information of unaligned space-time events in a plurality of adjacent time periods (corresponding to multiple exposure times of an intensity camera); the specific method comprises the following steps:

s1, acquiring an event and preprocessing the event into an event frame;

s2, inputting the preprocessed original scale event frames into a shared feature extraction module to extract main features and sub-features;

s3, inputting the main features into a feature offset estimation module to obtain feature offsets of adjacent frames;

s4, inputting the main features into a space-time transducer module of ConvLSTM to perform feature encoding and decoding;

s5, resetting the coded main features according to the feature offset to realize feature alignment;

s6, inputting the reset main features into a Spadenormalization module;

s7, inputting the main features into a 3D CNN module for feature decoding, and adding sub-feature compensation loss information;

s8, downsampling the features into 1/2 scale and 1/4 scale to obtain main features under 1/2 scale and 1/4 scale, extracting sub-features of the 1/2 scale and 1/4 scale events through a shared feature extraction module, and repeating the S3-S7 operation;

s9, sampling up the main features after 1/2 scale and 1/4 scale decoding to the original scale through pixel shuffle, and fusing to obtain a reconstructed image;

s10, carrying out back propagation on the reconstruction result obtained in the S9 and a loss function of the original environment real image calculation network.

Preferably, the event preprocessing mentioned in S1 specifically includes the following: selecting event points between two frames of reference images to be stacked as space-time voxel grids; for a period of time Δt=t _k -t ₀ Event stream containing k eventsMapping each event point into a corresponding space-time voxel grid, with the following formula:

wherein t is _i A timestamp representing an ith event; c represents the channel number of the space-time voxel grid; (x) _i ，y _i ) Coordinates representing an ith event; p is p _i = ±1, indicating event polarity; t is t _n A channel index time representing a spatio-temporal voxel grid;

selecting event streams in odd number T of adjacent time periods to be stacked into a space-time voxel grid I' epsilon R ^{B×T×C×H×W} T contains 1 intermediate frame at the intermediate position and T-1 adjacent frames adjacent to the intermediate position.

Preferably, the shared feature extraction module is formed by combining two common convolution modules, and the four shared feature extraction modules respectively extract main features of an original scale, sub-features of the original scale, 1/2-scale sub-features and 1/4-scale sub-features.

Preferably, the feature offset estimation module is applied to a pre-trained optical flow estimation model on a real scene, and the module updates parameters simultaneously in the network training process, so that the module is migrated to a feature offset estimation space after training, and further, feature offset estimation of an event is realized.

Preferably, the S4 specifically includes the following:

s4.1, respectively extracting a Q value and a K value by applying the extracted main features to two grouping volumes, wherein the specific formula is as follows:

Q＝W _Q *F _m (2)

K＝W _K *F _m (3)

wherein, represents convolution operation;

s4.2, extracting V value characteristics of main characteristics by using ConvLSTM, wherein the specific formula is as follows:

wherein σ represents a sigmoid function, [. Cndot.,. Cndot. ] represents stitching the two features;

s4.3, after the extracted Q, K and V values are expanded, feature coding is carried out through the self-attention module and the multi-layer perceptron module in sequence, wherein the specific formula is as follows:

F _O ＝MLP(SA) (6)

wherein D represents a feature dimension; MLP represents a multi-layer perceptron that contains multiple convolutions or fully connected layers.

Preferably, the feature alignment mentioned in S5 is specifically: and (3) resetting the position features corresponding to the adjacent frames to the positions corresponding to the intermediate frames according to the feature offset of the adjacent frames obtained in the step (S3) to realize feature alignment.

Preferably, the step S6 specifically includes the following:

s6.1, carrying out non-parameter latch normalization standardization on the input reset main features, wherein the specific formula is as follows:

wherein,mean value->Representing standard deviation;

s6.2, the reconstruction result of the previous frame isUsing convolution W _s The main characteristics are expanded in dimension, and a specific calculation formula is as follows:

s6.3 applying convolution W _γ 、W _β Generating a coefficient and a bias term of main feature standardization, wherein the specific formula is as follows:

preferably, the S7 specifically includes the following: splicing the sub-features extracted from the original output with the features of the output, firstly fusing by using a 2D convolution module, then stacking a plurality of 3D convolutions into a module to decode the features, and using a leak ReLU to carry out nonlinear linearization among the 3D convolution modules, wherein the specific formula is as follows:

preferably, the S9 specifically includes the following: the main features output under the 1/2 scale and the 1/4 scale through S2-S7 are up-sampled to the original scale by using pixel shuffle operation, and all the main features are fused by using one ConvLSTM and a plurality of convolution layers, so that a reconstructed gray image is finally obtained

Preferably, the loss function mentioned in S10 includes an L1 loss function, a perceptual loss function and a time consistency loss function, and the sum of the three loss functions is taken as the final loss of the network, and the specific formula is as follows:

wherein, gamma _TC Set to 5; the perception loss selects the first five hidden layer outputs of VGG-19 after the pre-training of the ImageNet data set to calculate the L1 distance, i.e. L is taken ₀ =5; the weight of each hidden layer is set to 1, i.e. w _l ＝1， Representing the result of aligning the previous frame to the current frame by two according to the feature offset.

Compared with the prior art, the event camera video reconstruction method based on deep learning has the following beneficial effects:

the method avoids the change of imaging equipment hardware, adopts a post-processing method, and carries out video reconstruction on the event camera through the complementary advantages among a plurality of modules; the method specifically comprises the following steps:

1. the invention combines the modeling characteristics of ConvLSTM on long-term dependence, the aggregation of 2D/3D convolution on local characteristics and the modeling capability of space-time converter on mid-term dependence in adjacent time periods and global information in images, has complementary advantages, and reconstructs the event into a gray image conforming to a real scene.

2. The invention realizes the video reconstruction of the event camera, and the reconstructed video has better effect on the whole.

3. The invention has shorter imaging initialization time and higher imaging quality in the initialization period.

Drawings

FIG. 1 is a general flow chart of an event camera video reconstruction method based on deep learning according to the present invention;

FIG. 2 is a detailed flowchart of the event camera video reconstruction method based on deep learning according to the present invention;

fig. 3 is a comparison chart of the reconstruction results of the first four or five frames of images and other methods in the video reconstructed by the event camera video reconstruction method based on deep learning, wherein 1) is the reconstruction result of the method proposed by Henri Rebecq et al in paper High Speed and High Dynamic Range Video with an Event Camera published in IEEE Transactions on Pattern Analysis and Machine Intelligence of 2020; 2) Is Pablo Rodrigo Gantier Cadena et al in the paper SPADE-E2VID published in 2021 at IEEE Transactions On Image Processing: space-Adaptive Denormalization for Event-Based Video Reconstruction, which proposes the reconstruction of the method; 3) Is the result of a real image of the original scene taken by a normal camera.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

The invention designs a space-time transducer module embedded with ConvLSTM, captures long-term dependence in the whole video sequence period through ConvLSTM, captures medium-term dependence in adjacent time periods and global information of images through space-time transducer, and completes image reconstruction work by utilizing advantage complementation among a plurality of modules through local features of convolution learning images. The specific method comprises the following steps:

example 1:

referring to fig. 1 and 2, the present invention specifically includes the following steps:

1) Constructing input data:

11 Using event camera real shooting data published from the university of zurich Robotics and Perception Group website as experimental data, training and verifying the network using the simulation data set therein, and testing the network performance using the real data set therein. Both the simulation data and the real data contain the event stream output by the event camera and the corresponding original scene real image.

12 In order to apply the deep learning technology to the image reconstruction of the event camera, firstly, the non-Euclidean data of the event morphology needs to be structured. Specifically, an event point stack between two frames of reference images (corresponding to the exposure time of a normal camera) is selected as a spatio-temporal voxel grid. For a period of time deltat＝t _k -t ₀ Event stream containing k events We map each event point into a corresponding spatio-temporal voxel grid as follows:

wherein t is _i A timestamp representing an ith event; c represents the channel number of the space-time voxel grid; (x) _i ，y _i ) Coordinates representing an ith event; p is p _i = ±1, indicating event polarity; t is t _n A channel index time representing a spatio-temporal voxel grid; the invention selects the event streams in T (T is odd) adjacent time periods to be stacked as space-time voxel grids I, E and R ^{B×T×C×H×W} T contains 1 intermediate frame at the intermediate position and T-1 adjacent frames adjacent to the intermediate position.

2) We use a shared feature extractor to map successive frame representations into the same feature space through multiple convolutions layers of shared weights.

21 First extracting the primary features at the original scale pair I' using the primary shared feature extractorFor circulation as primary information in subsequent networks.

22 Using a shared feature extractor to extract multi-scale sub-features at three scales, respectively

For subsequent processing of the main information at different scaleAnd (5) supplementing.

3) The characteristics of all time periods in the same characteristic space are input into a characteristic offset estimation module, and the module updates parameters simultaneously in the network training process by using a pre-trained optical flow estimation model on a real scene, so that the module is migrated to the characteristic offset estimation space after training, thereby realizing characteristic offset estimation of an event.

4) The extracted main features are input into a space-time transducer module embedded in ConvLSTM.

41 The extracted main feature is subjected to two group convolution to extract a Q value and a K value, and ConvLSTM is applied to extract a V value, wherein the specific formula is as follows:

Q＝W _Q *F _m (2)

K＝W _K *F _m (3)

wherein, x represents convolution operation, sigma represents sigmoid function, [. Cndot.,. Cndot. ] represents stitching two features;

42 The extracted Q, K and V values are sequentially transmitted through a self-attention module and a multi-layer perceptron module after being expanded, and the specific formulas are as follows:

F _O ＝MLP(SA) (6)

5)

51 And (3) resetting the corresponding position features of the adjacent frames to the positions corresponding to the intermediate frames according to the feature offset of the adjacent frames obtained in the step 3) to realize feature alignment.

52 The aligned features are input into a Spade Normalization module, specifically, firstly, the input features are normalized by batch normalization without parameters, and the specific formula is as follows:

wherein,for mean value->Is the standard deviation.

And then using the reconstruction result of the previous frame, firstly performing dimension expansion by using one convolution, and respectively generating the standardized coefficient and the standardized bias term by using two convolutions, wherein the specific formula is as follows:

53 Inputting the input features into a 3D CNN module, specifically, firstly splicing the sub-features extracted from the original output with the output features, firstly using a 2D convolution module to fuse, then using a plurality of 3D convolution stacks to form a module to decode the features, and using a Leaky ReLU to perform nonlinear between each 3D convolution module, wherein the specific formula is as follows:

6) And (3) downsampling the output obtained in the steps to obtain feature graphs with different scales, and then performing a series of identical operations again.

7) Upsampling two low-scale feature maps to original scale by pixel shuffling operation, feature maps from different scales are processed using one ConvLSTM and several 2D convolution layersFusion to obtain the final result of reconstruction

8) And calculating a loss function of the network by using the network reconstruction result and the original environment real image. The invention uses the L1 loss function, the perceptual loss function and the time consistency loss function, and the sum of the three loss functions is taken as the final loss of the network. The specific formula is as follows:

example 2:

referring to fig. 1-2, the following detailed description is based on embodiment 1, but differs therefrom, in conjunction with the accompanying drawings and specific example data:

the invention provides an event camera video reconstruction method based on deep learning (as shown in the flow of fig. 1 and 2), which designs a space-time transducer module embedded with ConvLSTM, captures long-term dependence in the whole video sequence period through ConvLSTM, captures medium-term dependence in adjacent period and global information of an image through space-time transducer, and completes image reconstruction work by utilizing advantage complementation among a plurality of modules through local characteristics of convolution learning images. The specific method comprises the following steps:

1) Constructing input data:

11 Using event camera real shooting data published from the university of zurich Robotics and Perception Group website as experimental data, training and verifying the network using the simulation data set therein, and testing the network performance using the real data set therein. Both the simulation data and the real data contain the event stream output by the event camera and the corresponding original scene real image. The training set is used for 100 video sequences, and the verification set is used for 25 video sequences; in each training iteration process, randomly taking 40 fragment lengths from each video sequence, and reconstructing a corresponding 40-frame video image; the real data set for testing is photographed by using a DAVIS240C event camera with the resolution of 240×180, and the camera can output an aligned scene event stream and gray level diagram, and 7 video sequences are selected for testing.

12 In order to apply the deep learning technology to the image reconstruction of the event camera, firstly, the non-Euclidean data of the event morphology needs to be structured. Specifically, an event point stack between two frames of reference images (corresponding to the exposure time of a normal camera) is selected as a spatio-temporal voxel grid. For a period of time Δt=t _k -t ₀ Event stream containing k events We map each event point into a corresponding spatio-temporal voxel grid as follows:

wherein t is _i A time stamp representing an ith event, C representing the number of channels of the spatio-temporal voxel grid; (x) _i ，y _i ) Coordinates representing an ith event; p is p _i = ±1, indicating event polarity; t is t _n A channel index time representing a spatio-temporal voxel grid; the invention selects the event streams in T (T is odd) adjacent time periods to be stacked as space-time voxel grid I' E R ^{B×T×C×H×W} T contains 1 intermediate frame at the intermediate position and T-1 adjacent frames adjacent to the intermediate position.

13 We choose 5 spatio-temporal voxel grids as the input of the network, each spatio-temporal voxel grid contains 5 channels, set the Batch size of the network to 2, the resolution of the spatio-temporal voxel grid to 128×128, and finally reconstruct a frame gray map, i.e. t=5, c=5, b=2, h=w=128, the input I' e R of the network ^{B×T×C×H×W} Of a network ofOutput of

21 First extracting the primary features at the original scale pair I' using the primary shared feature extractorFor circulation as primary information in subsequent networks, where C _m ＝64。

22 Using a shared feature extractor to extract multi-scale sub-features at three scales, respectively For subsequent supplementing of the main information at different scale, wherein C _s ＝6。

3) The characteristics of all time periods in the same characteristic space are input into a characteristic offset estimation module, and the module updates parameters simultaneously in the network training process by using a pre-trained optical flow estimation model on a real scene, so that the module is migrated to the characteristic offset estimation space after training, thereby realizing characteristic offset estimation of an event. The present invention uses Spatial Pyramid Network as an optical flow estimation model.

Q＝W _Q *F _m (2)

K＝W _K *F _m (3)

F _O ＝MLP(SA) (6)

5) The following steps are as follows:

wherein,for mean value->Is the standard deviation. And then using the reconstruction result of the previous frame, firstly performing dimension expansion by using one convolution, and respectively generating the standardized coefficient and the standardized bias term by using two convolutions, wherein the specific formula is as follows:

where α=0.1.

7) The two low-scale feature images are up-sampled to the original scale through pixel shuffling operation, and a ConvLSTM and a plurality of 2D convolution layers are used for fusing the feature images from different scales to obtain a reconstructed final result

wherein we will gamma _TC Set to 50, the perceptual loss selects the first five hidden layer outputs of VGG-19 after pre-training of the ImageNet dataset to calculate the L1 distance, and the weight occupied by each hidden layer is set to 1.

9) In the training process of the deep neural network, the initial learning rate is 0.00002, the training iteration is 450 times, and an Adam optimizer optimizing network is selected.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The event camera video reconstruction method based on deep learning is characterized by being realized based on a deep neural network and specifically comprising the following steps of:

s1, acquiring an event and preprocessing the event into an event frame;

s6, inputting the reset main characteristic into a Spade Normalization module;

2. The event camera video reconstruction method based on deep learning according to claim 1, wherein the event preprocessing mentioned in S1 specifically includes the following: selecting event points between two frames of reference images to be stacked as space-time voxel grids; for a period of time△t=t _k -t ₀ Event stream containing k eventsMapping each event point into a corresponding space-time voxel grid, and adopting the following formula:

（1）

wherein,t _i a timestamp representing an ith event; c represents the channel number of the space-time voxel grid; (x _i ，y _i ) Coordinates representing an ith event; p is p _i = ±1, indicating event polarity;t _n a channel index time representing a spatio-temporal voxel grid;

selecting event streams in odd number T of adjacent time periods to be stacked as a space-time voxel grid I ^、 ∈R ^{B×T×C×H×W} T contains 1 intermediate frame at the intermediate position and T-1 adjacent frames adjacent to the intermediate position.

3. The event camera video reconstruction method based on deep learning according to claim 1, wherein the shared feature extraction module is formed by combining two common convolution modules, and four shared feature extraction modules extract primary features of an original scale, sub-features of the original scale, sub-features of 1/2 scale and sub-features of 1/4 scale respectively.

4. The deep learning-based event camera video reconstruction method according to claim 1, wherein the feature offset estimation module is applied to a pre-trained optical flow estimation model on a real scene, and the module updates parameters simultaneously in a network training process, so that the module is migrated to a feature offset estimation space after training, thereby realizing feature offset estimation of an event.

5. The event camera video reconstruction method based on deep learning according to claim 1, wherein S4 specifically comprises the following:

Q=W _Q *F _m （2）

K=W _K *F _m （3）

wherein, represents convolution operation;

（4）

wherein,σrepresenting a sigmoid function, [,]representing that the two features are spliced;

（5）

F _O =MLP(SA)（6）

wherein,Drepresenting a feature dimension; MLP represents a multi-layer perceptron that contains multiple convolutions or fully connected layers.

6. The event camera video reconstruction method based on deep learning according to claim 1, wherein the feature alignment mentioned in S5 is specifically: and (3) resetting the position features corresponding to the adjacent frames to the positions corresponding to the intermediate frames according to the feature offset of the adjacent frames obtained in the step (S3) to realize feature alignment.

7. The event camera video reconstruction method based on deep learning according to claim 1, wherein S6 specifically comprises the following:

s6.1, carrying out parameter-free batch normalization standardization on the input reset main characteristic, wherein the specific formula is as follows:

（7）

wherein,mean value->Representing standard deviation;

s6.2, the reconstruction result of the previous frame isUsing convolutionW _s The main characteristics are expanded in dimension, and a specific calculation formula is as follows:

（8）

s6.3 applying convolutionGenerating a coefficient and a bias term of main feature standardization, wherein the specific formula is as follows:

（9）。

8. the event camera video reconstruction method based on deep learning according to claim 1, wherein S7 specifically comprises the following: splicing the sub-features extracted from the original output with the features of the output, firstly fusing by using a 2D convolution module, then stacking a plurality of 3D convolutions into a module to decode the features, and using a leak ReLU to carry out nonlinear linearization among the 3D convolution modules, wherein the specific formula is as follows:

（10）。

9. the event camera video reconstruction method based on deep learning according to claim 1, wherein S9 specifically comprises the following: the main features output under the 1/2 scale and the 1/4 scale through S2-S7 are up-sampled to the original scale by using pixel shuffle operation, and all the main features are fused by using one ConvLSTM and a plurality of convolution layers, so that a reconstructed gray image is finally obtained。

10. The event camera video reconstruction method based on deep learning as set forth in claim 1, wherein the loss function mentioned in S10 includes an L1 loss function, a perceptual loss function and a time consistency loss function, and the sum of the three loss functions is taken as the final loss of the network, and the specific formula is as follows:

（11）

wherein,γ _TC set to 5; the perception loss selects the first five hidden layer outputs of VGG-19 after the pre-training of the ImageNet data set to calculate the L1 distance, namely, the distance is takenThe method comprises the steps of carrying out a first treatment on the surface of the The weight of each hidden layer is set to 1, i.e. +.>，/>；/>Representing the result of aligning the previous frame to the current frame by two according to the feature offset.