CN115484410B - Event camera video reconstruction method based on deep learning - Google Patents

Event camera video reconstruction method based on deep learning Download PDF

Info

Publication number
CN115484410B
CN115484410B CN202211121596.1A CN202211121596A CN115484410B CN 115484410 B CN115484410 B CN 115484410B CN 202211121596 A CN202211121596 A CN 202211121596A CN 115484410 B CN115484410 B CN 115484410B
Authority
CN
China
Prior art keywords
event
features
feature
scale
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211121596.1A
Other languages
Chinese (zh)
Other versions
CN115484410A (en
Inventor
杨敬钰
潘锦蔚
岳焕景
李坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202211121596.1A priority Critical patent/CN115484410B/en
Publication of CN115484410A publication Critical patent/CN115484410A/en
Application granted granted Critical
Publication of CN115484410B publication Critical patent/CN115484410B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Graphics (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an event camera video reconstruction method based on deep learning, and belongs to the field of digital image processing. The event camera is a novel neuromorphic sensor and has the characteristics of high dynamic range, high time resolution and low power consumption. Since the output of the event camera is not a visually friendly image that a person normally observes, video reconstruction of the event camera is one of its visual applications. The existing event camera video reconstruction method has poor image quality obtained in the early reconstruction stage and long initialization time of a reconstruction algorithm. In view of the situation, the invention designs a new deep neural network which comprises modules such as space-time Transformer, convLSTM, CNN and the like, and extracts information of unaligned space-time events in a plurality of adjacent time periods to jointly generate a frame of gray level image through complementary advantages among the modules. The method has a high initialization speed, can reconstruct images with high quality in the early stage of the video and the whole video, and the reconstructed video has a good visual effect.

Description

Event camera video reconstruction method based on deep learning
Technical Field
The invention belongs to the field of digital image processing, and particularly relates to an event camera video reconstruction method based on deep learning and computer vision.
Background
The event camera is a novel sensor for biological retina vision inspiring, and the working principle and the design paradigm of the bottom circuit are completely different from those of the traditional camera. The event camera has the characteristics of high dynamic range, high time resolution and low power consumption, and has wide application space in the fields of automatic driving, unmanned aerial vehicle visual navigation, security monitoring and the like which relate to high-speed movement or extreme illumination scenes.
The data output of the event camera at each pixel location is asynchronous with respect to the intensity camera outputting intensity values for the pixel locations synchronously throughout the pixel plane during the exposure time, and only outputting relative variation values of brightness. Because the output of the event camera is not a gray level or color image normally observed by people, reconstructing the event points output by the event camera into a visually friendly image and video which can be normally observed by people is one of visual applications of the event camera. The asynchronous trigger transmission characteristic of the event determines that the event is non-Euclidean data, and the existing image reconstruction method is difficult to directly apply to the real image reconstruction of the event camera, so that new image and video reconstruction algorithms are required to be researched aiming at the characteristic of the event camera.
Current event camera image or video reconstruction algorithms fall into two main categories: a method based on conventional image processing and a method based on deep learning. Based on the traditional image method, modeling is mainly performed on the differential characteristics of an event camera, and the intensity values of all pixel positions on the image are estimated through an integral or a filter. The deep learning-based method achieves better effects than the conventional image method. Such methods typically use ConvLSTM to introduce long-term estimates of image intensity, better modeling the differential nature of the event. However, the introduction of ConvLSTM and the spatial sparsity of events result in that the video reconstruction algorithm of the event camera requires an initialization time from several frames to several tens of frames, that is, in the initial stages of video reconstruction, from several frames to several tens of frames, the image quality reconstructed by the algorithm is poor, and generally has poor texture details and global contrast.
Disclosure of Invention
The invention aims to make up the defects of the prior art, enhance the initial imaging quality of a video and reduce the initialization time on the premise of ensuring the integral imaging quality of the video, thereby providing an event camera video reconstruction method based on deep learning.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the event camera video reconstruction method based on the deep learning is based on a deep neural network, an encoder-decoder deep neural network formed by combining ConvLSTM, a space-time converter, a space-time characteristic alignment unit, a 2D/3D convolution and other modules is designed, and a frame of gray level image is jointly generated by utilizing information of unaligned space-time events in a plurality of adjacent time periods (corresponding to multiple exposure times of an intensity camera); the specific method comprises the following steps:
s1, acquiring an event and preprocessing the event into an event frame;
s2, inputting the preprocessed original scale event frames into a shared feature extraction module to extract main features and sub-features;
s3, inputting the main features into a feature offset estimation module to obtain feature offsets of adjacent frames;
s4, inputting the main features into a space-time transducer module of ConvLSTM to perform feature encoding and decoding;
s5, resetting the coded main features according to the feature offset to realize feature alignment;
s6, inputting the reset main features into a Spadenormalization module;
s7, inputting the main features into a 3D CNN module for feature decoding, and adding sub-feature compensation loss information;
s8, downsampling the features into 1/2 scale and 1/4 scale to obtain main features under 1/2 scale and 1/4 scale, extracting sub-features of the 1/2 scale and 1/4 scale events through a shared feature extraction module, and repeating the S3-S7 operation;
s9, sampling up the main features after 1/2 scale and 1/4 scale decoding to the original scale through pixel shuffle, and fusing to obtain a reconstructed image;
s10, carrying out back propagation on the reconstruction result obtained in the S9 and a loss function of the original environment real image calculation network.
Preferably, the event preprocessing mentioned in S1 specifically includes the following: selecting event points between two frames of reference images to be stacked as space-time voxel grids; for a period of time Δt=t k -t 0 Event stream containing k eventsMapping each event point into a corresponding space-time voxel grid, with the following formula:
wherein t is i A timestamp representing an ith event; c represents the channel number of the space-time voxel grid; (x) i ,y i ) Coordinates representing an ith event; p is p i = ±1, indicating event polarity; t is t n A channel index time representing a spatio-temporal voxel grid;
selecting event streams in odd number T of adjacent time periods to be stacked into a space-time voxel grid I' epsilon R B×T×C×H×W T contains 1 intermediate frame at the intermediate position and T-1 adjacent frames adjacent to the intermediate position.
Preferably, the shared feature extraction module is formed by combining two common convolution modules, and the four shared feature extraction modules respectively extract main features of an original scale, sub-features of the original scale, 1/2-scale sub-features and 1/4-scale sub-features.
Preferably, the feature offset estimation module is applied to a pre-trained optical flow estimation model on a real scene, and the module updates parameters simultaneously in the network training process, so that the module is migrated to a feature offset estimation space after training, and further, feature offset estimation of an event is realized.
Preferably, the S4 specifically includes the following:
s4.1, respectively extracting a Q value and a K value by applying the extracted main features to two grouping volumes, wherein the specific formula is as follows:
Q=W Q *F m (2)
K=W K *F m (3)
wherein, represents convolution operation;
s4.2, extracting V value characteristics of main characteristics by using ConvLSTM, wherein the specific formula is as follows:
wherein σ represents a sigmoid function, [. Cndot.,. Cndot. ] represents stitching the two features;
s4.3, after the extracted Q, K and V values are expanded, feature coding is carried out through the self-attention module and the multi-layer perceptron module in sequence, wherein the specific formula is as follows:
F O =MLP(SA) (6)
wherein D represents a feature dimension; MLP represents a multi-layer perceptron that contains multiple convolutions or fully connected layers.
Preferably, the feature alignment mentioned in S5 is specifically: and (3) resetting the position features corresponding to the adjacent frames to the positions corresponding to the intermediate frames according to the feature offset of the adjacent frames obtained in the step (S3) to realize feature alignment.
Preferably, the step S6 specifically includes the following:
s6.1, carrying out non-parameter latch normalization standardization on the input reset main features, wherein the specific formula is as follows:
wherein,mean value->Representing standard deviation;
s6.2, the reconstruction result of the previous frame isUsing convolution W s The main characteristics are expanded in dimension, and a specific calculation formula is as follows:
s6.3 applying convolution W γ 、W β Generating a coefficient and a bias term of main feature standardization, wherein the specific formula is as follows:
preferably, the S7 specifically includes the following: splicing the sub-features extracted from the original output with the features of the output, firstly fusing by using a 2D convolution module, then stacking a plurality of 3D convolutions into a module to decode the features, and using a leak ReLU to carry out nonlinear linearization among the 3D convolution modules, wherein the specific formula is as follows:
preferably, the S9 specifically includes the following: the main features output under the 1/2 scale and the 1/4 scale through S2-S7 are up-sampled to the original scale by using pixel shuffle operation, and all the main features are fused by using one ConvLSTM and a plurality of convolution layers, so that a reconstructed gray image is finally obtained
Preferably, the loss function mentioned in S10 includes an L1 loss function, a perceptual loss function and a time consistency loss function, and the sum of the three loss functions is taken as the final loss of the network, and the specific formula is as follows:
wherein, gamma TC Set to 5; the perception loss selects the first five hidden layer outputs of VGG-19 after the pre-training of the ImageNet data set to calculate the L1 distance, i.e. L is taken 0 =5; the weight of each hidden layer is set to 1, i.e. w l =1, Representing the result of aligning the previous frame to the current frame by two according to the feature offset.
Compared with the prior art, the event camera video reconstruction method based on deep learning has the following beneficial effects:
the method avoids the change of imaging equipment hardware, adopts a post-processing method, and carries out video reconstruction on the event camera through the complementary advantages among a plurality of modules; the method specifically comprises the following steps:
1. the invention combines the modeling characteristics of ConvLSTM on long-term dependence, the aggregation of 2D/3D convolution on local characteristics and the modeling capability of space-time converter on mid-term dependence in adjacent time periods and global information in images, has complementary advantages, and reconstructs the event into a gray image conforming to a real scene.
2. The invention realizes the video reconstruction of the event camera, and the reconstructed video has better effect on the whole.
3. The invention has shorter imaging initialization time and higher imaging quality in the initialization period.
Drawings
FIG. 1 is a general flow chart of an event camera video reconstruction method based on deep learning according to the present invention;
FIG. 2 is a detailed flowchart of the event camera video reconstruction method based on deep learning according to the present invention;
fig. 3 is a comparison chart of the reconstruction results of the first four or five frames of images and other methods in the video reconstructed by the event camera video reconstruction method based on deep learning, wherein 1) is the reconstruction result of the method proposed by Henri Rebecq et al in paper High Speed and High Dynamic Range Video with an Event Camera published in IEEE Transactions on Pattern Analysis and Machine Intelligence of 2020; 2) Is Pablo Rodrigo Gantier Cadena et al in the paper SPADE-E2VID published in 2021 at IEEE Transactions On Image Processing: space-Adaptive Denormalization for Event-Based Video Reconstruction, which proposes the reconstruction of the method; 3) Is the result of a real image of the original scene taken by a normal camera.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
The invention designs a space-time transducer module embedded with ConvLSTM, captures long-term dependence in the whole video sequence period through ConvLSTM, captures medium-term dependence in adjacent time periods and global information of images through space-time transducer, and completes image reconstruction work by utilizing advantage complementation among a plurality of modules through local features of convolution learning images. The specific method comprises the following steps:
example 1:
referring to fig. 1 and 2, the present invention specifically includes the following steps:
1) Constructing input data:
11 Using event camera real shooting data published from the university of zurich Robotics and Perception Group website as experimental data, training and verifying the network using the simulation data set therein, and testing the network performance using the real data set therein. Both the simulation data and the real data contain the event stream output by the event camera and the corresponding original scene real image.
12 In order to apply the deep learning technology to the image reconstruction of the event camera, firstly, the non-Euclidean data of the event morphology needs to be structured. Specifically, an event point stack between two frames of reference images (corresponding to the exposure time of a normal camera) is selected as a spatio-temporal voxel grid. For a period of time deltat=t k -t 0 Event stream containing k events We map each event point into a corresponding spatio-temporal voxel grid as follows:
wherein t is i A timestamp representing an ith event; c represents the channel number of the space-time voxel grid; (x) i ,y i ) Coordinates representing an ith event; p is p i = ±1, indicating event polarity; t is t n A channel index time representing a spatio-temporal voxel grid; the invention selects the event streams in T (T is odd) adjacent time periods to be stacked as space-time voxel grids I, E and R B×T×C×H×W T contains 1 intermediate frame at the intermediate position and T-1 adjacent frames adjacent to the intermediate position.
2) We use a shared feature extractor to map successive frame representations into the same feature space through multiple convolutions layers of shared weights.
21 First extracting the primary features at the original scale pair I' using the primary shared feature extractorFor circulation as primary information in subsequent networks.
22 Using a shared feature extractor to extract multi-scale sub-features at three scales, respectively
For subsequent processing of the main information at different scaleAnd (5) supplementing.
3) The characteristics of all time periods in the same characteristic space are input into a characteristic offset estimation module, and the module updates parameters simultaneously in the network training process by using a pre-trained optical flow estimation model on a real scene, so that the module is migrated to the characteristic offset estimation space after training, thereby realizing characteristic offset estimation of an event.
4) The extracted main features are input into a space-time transducer module embedded in ConvLSTM.
41 The extracted main feature is subjected to two group convolution to extract a Q value and a K value, and ConvLSTM is applied to extract a V value, wherein the specific formula is as follows:
Q=W Q *F m (2)
K=W K *F m (3)
wherein, x represents convolution operation, sigma represents sigmoid function, [. Cndot.,. Cndot. ] represents stitching two features;
42 The extracted Q, K and V values are sequentially transmitted through a self-attention module and a multi-layer perceptron module after being expanded, and the specific formulas are as follows:
F O =MLP(SA) (6)
wherein D represents a feature dimension; MLP represents a multi-layer perceptron that contains multiple convolutions or fully connected layers.
5)
51 And (3) resetting the corresponding position features of the adjacent frames to the positions corresponding to the intermediate frames according to the feature offset of the adjacent frames obtained in the step 3) to realize feature alignment.
52 The aligned features are input into a Spade Normalization module, specifically, firstly, the input features are normalized by batch normalization without parameters, and the specific formula is as follows:
wherein,for mean value->Is the standard deviation.
And then using the reconstruction result of the previous frame, firstly performing dimension expansion by using one convolution, and respectively generating the standardized coefficient and the standardized bias term by using two convolutions, wherein the specific formula is as follows:
53 Inputting the input features into a 3D CNN module, specifically, firstly splicing the sub-features extracted from the original output with the output features, firstly using a 2D convolution module to fuse, then using a plurality of 3D convolution stacks to form a module to decode the features, and using a Leaky ReLU to perform nonlinear between each 3D convolution module, wherein the specific formula is as follows:
6) And (3) downsampling the output obtained in the steps to obtain feature graphs with different scales, and then performing a series of identical operations again.
7) Upsampling two low-scale feature maps to original scale by pixel shuffling operation, feature maps from different scales are processed using one ConvLSTM and several 2D convolution layersFusion to obtain the final result of reconstruction
8) And calculating a loss function of the network by using the network reconstruction result and the original environment real image. The invention uses the L1 loss function, the perceptual loss function and the time consistency loss function, and the sum of the three loss functions is taken as the final loss of the network. The specific formula is as follows:
example 2:
referring to fig. 1-2, the following detailed description is based on embodiment 1, but differs therefrom, in conjunction with the accompanying drawings and specific example data:
the invention provides an event camera video reconstruction method based on deep learning (as shown in the flow of fig. 1 and 2), which designs a space-time transducer module embedded with ConvLSTM, captures long-term dependence in the whole video sequence period through ConvLSTM, captures medium-term dependence in adjacent period and global information of an image through space-time transducer, and completes image reconstruction work by utilizing advantage complementation among a plurality of modules through local characteristics of convolution learning images. The specific method comprises the following steps:
1) Constructing input data:
11 Using event camera real shooting data published from the university of zurich Robotics and Perception Group website as experimental data, training and verifying the network using the simulation data set therein, and testing the network performance using the real data set therein. Both the simulation data and the real data contain the event stream output by the event camera and the corresponding original scene real image. The training set is used for 100 video sequences, and the verification set is used for 25 video sequences; in each training iteration process, randomly taking 40 fragment lengths from each video sequence, and reconstructing a corresponding 40-frame video image; the real data set for testing is photographed by using a DAVIS240C event camera with the resolution of 240×180, and the camera can output an aligned scene event stream and gray level diagram, and 7 video sequences are selected for testing.
12 In order to apply the deep learning technology to the image reconstruction of the event camera, firstly, the non-Euclidean data of the event morphology needs to be structured. Specifically, an event point stack between two frames of reference images (corresponding to the exposure time of a normal camera) is selected as a spatio-temporal voxel grid. For a period of time Δt=t k -t 0 Event stream containing k events We map each event point into a corresponding spatio-temporal voxel grid as follows:
wherein t is i A time stamp representing an ith event, C representing the number of channels of the spatio-temporal voxel grid; (x) i ,y i ) Coordinates representing an ith event; p is p i = ±1, indicating event polarity; t is t n A channel index time representing a spatio-temporal voxel grid; the invention selects the event streams in T (T is odd) adjacent time periods to be stacked as space-time voxel grid I' E R B×T×C×H×W T contains 1 intermediate frame at the intermediate position and T-1 adjacent frames adjacent to the intermediate position.
13 We choose 5 spatio-temporal voxel grids as the input of the network, each spatio-temporal voxel grid contains 5 channels, set the Batch size of the network to 2, the resolution of the spatio-temporal voxel grid to 128×128, and finally reconstruct a frame gray map, i.e. t=5, c=5, b=2, h=w=128, the input I' e R of the network B×T×C×H×W Of a network ofOutput of
2) We use a shared feature extractor to map successive frame representations into the same feature space through multiple convolutions layers of shared weights.
21 First extracting the primary features at the original scale pair I' using the primary shared feature extractorFor circulation as primary information in subsequent networks, where C m =64。
22 Using a shared feature extractor to extract multi-scale sub-features at three scales, respectively For subsequent supplementing of the main information at different scale, wherein C s =6。
3) The characteristics of all time periods in the same characteristic space are input into a characteristic offset estimation module, and the module updates parameters simultaneously in the network training process by using a pre-trained optical flow estimation model on a real scene, so that the module is migrated to the characteristic offset estimation space after training, thereby realizing characteristic offset estimation of an event. The present invention uses Spatial Pyramid Network as an optical flow estimation model.
4) The extracted main features are input into a space-time transducer module embedded in ConvLSTM.
41 The extracted main feature is subjected to two group convolution to extract a Q value and a K value, and ConvLSTM is applied to extract a V value, wherein the specific formula is as follows:
Q=W Q *F m (2)
K=W K *F m (3)
wherein, x represents convolution operation, sigma represents sigmoid function, [. Cndot.,. Cndot. ] represents stitching two features;
42 The extracted Q, K and V values are sequentially transmitted through a self-attention module and a multi-layer perceptron module after being expanded, and the specific formulas are as follows:
F O =MLP(SA) (6)
5) The following steps are as follows:
51 And (3) resetting the corresponding position features of the adjacent frames to the positions corresponding to the intermediate frames according to the feature offset of the adjacent frames obtained in the step 3) to realize feature alignment.
52 The aligned features are input into a Spade Normalization module, specifically, firstly, the input features are normalized by batch normalization without parameters, and the specific formula is as follows:
wherein,for mean value->Is the standard deviation. And then using the reconstruction result of the previous frame, firstly performing dimension expansion by using one convolution, and respectively generating the standardized coefficient and the standardized bias term by using two convolutions, wherein the specific formula is as follows:
53 Inputting the input features into a 3D CNN module, specifically, firstly splicing the sub-features extracted from the original output with the output features, firstly using a 2D convolution module to fuse, then using a plurality of 3D convolution stacks to form a module to decode the features, and using a Leaky ReLU to perform nonlinear between each 3D convolution module, wherein the specific formula is as follows:
where α=0.1.
6) And (3) downsampling the output obtained in the steps to obtain feature graphs with different scales, and then performing a series of identical operations again.
7) The two low-scale feature images are up-sampled to the original scale through pixel shuffling operation, and a ConvLSTM and a plurality of 2D convolution layers are used for fusing the feature images from different scales to obtain a reconstructed final result
8) And calculating a loss function of the network by using the network reconstruction result and the original environment real image. The invention uses the L1 loss function, the perceptual loss function and the time consistency loss function, and the sum of the three loss functions is taken as the final loss of the network. The specific formula is as follows:
wherein we will gamma TC Set to 50, the perceptual loss selects the first five hidden layer outputs of VGG-19 after pre-training of the ImageNet dataset to calculate the L1 distance, and the weight occupied by each hidden layer is set to 1.
9) In the training process of the deep neural network, the initial learning rate is 0.00002, the training iteration is 450 times, and an Adam optimizer optimizing network is selected.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (10)

1. The event camera video reconstruction method based on deep learning is characterized by being realized based on a deep neural network and specifically comprising the following steps of:
s1, acquiring an event and preprocessing the event into an event frame;
s2, inputting the preprocessed original scale event frames into a shared feature extraction module to extract main features and sub-features;
s3, inputting the main features into a feature offset estimation module to obtain feature offsets of adjacent frames;
s4, inputting the main features into a space-time transducer module of ConvLSTM to perform feature encoding and decoding;
s5, resetting the coded main features according to the feature offset to realize feature alignment;
s6, inputting the reset main characteristic into a Spade Normalization module;
s7, inputting the main features into a 3D CNN module for feature decoding, and adding sub-feature compensation loss information;
s8, downsampling the features into 1/2 scale and 1/4 scale to obtain main features under 1/2 scale and 1/4 scale, extracting sub-features of the 1/2 scale and 1/4 scale events through a shared feature extraction module, and repeating the S3-S7 operation;
s9, sampling up the main features after 1/2 scale and 1/4 scale decoding to the original scale through pixel shuffle, and fusing to obtain a reconstructed image;
s10, carrying out back propagation on the reconstruction result obtained in the S9 and a loss function of the original environment real image calculation network.
2. The event camera video reconstruction method based on deep learning according to claim 1, wherein the event preprocessing mentioned in S1 specifically includes the following: selecting event points between two frames of reference images to be stacked as space-time voxel grids; for a period of time△t=t k -t 0 Event stream containing k eventsMapping each event point into a corresponding space-time voxel grid, and adopting the following formula:
(1)
wherein,t i a timestamp representing an ith event; c represents the channel number of the space-time voxel grid; (x iy i ) Coordinates representing an ith event; p is p i = ±1, indicating event polarity;t n a channel index time representing a spatio-temporal voxel grid;
selecting event streams in odd number T of adjacent time periods to be stacked as a space-time voxel grid I ∈R B×T×C×H×W T contains 1 intermediate frame at the intermediate position and T-1 adjacent frames adjacent to the intermediate position.
3. The event camera video reconstruction method based on deep learning according to claim 1, wherein the shared feature extraction module is formed by combining two common convolution modules, and four shared feature extraction modules extract primary features of an original scale, sub-features of the original scale, sub-features of 1/2 scale and sub-features of 1/4 scale respectively.
4. The deep learning-based event camera video reconstruction method according to claim 1, wherein the feature offset estimation module is applied to a pre-trained optical flow estimation model on a real scene, and the module updates parameters simultaneously in a network training process, so that the module is migrated to a feature offset estimation space after training, thereby realizing feature offset estimation of an event.
5. The event camera video reconstruction method based on deep learning according to claim 1, wherein S4 specifically comprises the following:
s4.1, respectively extracting a Q value and a K value by applying the extracted main features to two grouping volumes, wherein the specific formula is as follows:
Q=W Q *F m (2)
K=W K *F m (3)
wherein, represents convolution operation;
s4.2, extracting V value characteristics of main characteristics by using ConvLSTM, wherein the specific formula is as follows:
(4)
wherein,σrepresenting a sigmoid function, [,]representing that the two features are spliced;
s4.3, after the extracted Q, K and V values are expanded, feature coding is carried out through the self-attention module and the multi-layer perceptron module in sequence, wherein the specific formula is as follows:
(5)
F O =MLP(SA)(6)
wherein,Drepresenting a feature dimension; MLP represents a multi-layer perceptron that contains multiple convolutions or fully connected layers.
6. The event camera video reconstruction method based on deep learning according to claim 1, wherein the feature alignment mentioned in S5 is specifically: and (3) resetting the position features corresponding to the adjacent frames to the positions corresponding to the intermediate frames according to the feature offset of the adjacent frames obtained in the step (S3) to realize feature alignment.
7. The event camera video reconstruction method based on deep learning according to claim 1, wherein S6 specifically comprises the following:
s6.1, carrying out parameter-free batch normalization standardization on the input reset main characteristic, wherein the specific formula is as follows:
(7)
wherein,mean value->Representing standard deviation;
s6.2, the reconstruction result of the previous frame isUsing convolutionW s The main characteristics are expanded in dimension, and a specific calculation formula is as follows:
(8)
s6.3 applying convolutionGenerating a coefficient and a bias term of main feature standardization, wherein the specific formula is as follows:
(9)。
8. the event camera video reconstruction method based on deep learning according to claim 1, wherein S7 specifically comprises the following: splicing the sub-features extracted from the original output with the features of the output, firstly fusing by using a 2D convolution module, then stacking a plurality of 3D convolutions into a module to decode the features, and using a leak ReLU to carry out nonlinear linearization among the 3D convolution modules, wherein the specific formula is as follows:
(10)。
9. the event camera video reconstruction method based on deep learning according to claim 1, wherein S9 specifically comprises the following: the main features output under the 1/2 scale and the 1/4 scale through S2-S7 are up-sampled to the original scale by using pixel shuffle operation, and all the main features are fused by using one ConvLSTM and a plurality of convolution layers, so that a reconstructed gray image is finally obtained
10. The event camera video reconstruction method based on deep learning as set forth in claim 1, wherein the loss function mentioned in S10 includes an L1 loss function, a perceptual loss function and a time consistency loss function, and the sum of the three loss functions is taken as the final loss of the network, and the specific formula is as follows:
(11)
wherein,γ TC set to 5; the perception loss selects the first five hidden layer outputs of VGG-19 after the pre-training of the ImageNet data set to calculate the L1 distance, namely, the distance is takenThe method comprises the steps of carrying out a first treatment on the surface of the The weight of each hidden layer is set to 1, i.e. +.>,/>;/>Representing the result of aligning the previous frame to the current frame by two according to the feature offset.
CN202211121596.1A 2022-09-15 2022-09-15 Event camera video reconstruction method based on deep learning Active CN115484410B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211121596.1A CN115484410B (en) 2022-09-15 2022-09-15 Event camera video reconstruction method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211121596.1A CN115484410B (en) 2022-09-15 2022-09-15 Event camera video reconstruction method based on deep learning

Publications (2)

Publication Number Publication Date
CN115484410A CN115484410A (en) 2022-12-16
CN115484410B true CN115484410B (en) 2023-11-24

Family

ID=84424074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211121596.1A Active CN115484410B (en) 2022-09-15 2022-09-15 Event camera video reconstruction method based on deep learning

Country Status (1)

Country Link
CN (1) CN115484410B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116456183B (en) * 2023-04-20 2023-09-26 北京大学 High dynamic range video generation method and system under guidance of event camera
CN116309781B (en) * 2023-05-18 2023-08-22 吉林大学 Cross-modal fusion-based underwater visual target ranging method and device
CN117097876B (en) * 2023-07-07 2024-03-08 天津大学 Event camera image reconstruction method based on neural network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN113034380A (en) * 2021-02-09 2021-06-25 浙江大学 Video space-time super-resolution method and device based on improved deformable convolution correction
CN113837938A (en) * 2021-07-28 2021-12-24 北京大学 Super-resolution method for reconstructing potential image based on dynamic vision sensor

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3933673A1 (en) * 2020-07-01 2022-01-05 Tata Consultancy Services Limited System and method to capture spatio-temporal representation for video reconstruction and analysis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN113034380A (en) * 2021-02-09 2021-06-25 浙江大学 Video space-time super-resolution method and device based on improved deformable convolution correction
CN113837938A (en) * 2021-07-28 2021-12-24 北京大学 Super-resolution method for reconstructing potential image based on dynamic vision sensor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于时空特征和神经网络的视频超分辨率算法;李玲慧;杜军平;梁美玉;任楠;Lee Jang Myung;北京邮电大学学报;第39卷(第4期);全文 *

Also Published As

Publication number Publication date
CN115484410A (en) 2022-12-16

Similar Documents

Publication Publication Date Title
CN115484410B (en) Event camera video reconstruction method based on deep learning
CN111582483B (en) Unsupervised learning optical flow estimation method based on space and channel combined attention mechanism
CN112465718B (en) Two-stage image restoration method based on generation of countermeasure network
CN110288555B (en) Low-illumination enhancement method based on improved capsule network
CN112818764B (en) Low-resolution image facial expression recognition method based on feature reconstruction model
CN112733950A (en) Power equipment fault diagnosis method based on combination of image fusion and target detection
CN111819568A (en) Method and device for generating face rotation image
CN110570363A (en) Image defogging method based on Cycle-GAN with pyramid pooling and multi-scale discriminator
CN113283444B (en) Heterogeneous image migration method based on generation countermeasure network
CN114445292A (en) Multi-stage progressive underwater image enhancement method
CN116168067B (en) Supervised multi-modal light field depth estimation method based on deep learning
CN113610046B (en) Behavior recognition method based on depth video linkage characteristics
CN113793261A (en) Spectrum reconstruction method based on 3D attention mechanism full-channel fusion network
CN115131214A (en) Indoor aged person image super-resolution reconstruction method and system based on self-attention
CN110335299A (en) A kind of monocular depth estimating system implementation method based on confrontation network
CN116258757A (en) Monocular image depth estimation method based on multi-scale cross attention
CN115115685A (en) Monocular image depth estimation algorithm based on self-attention neural network
CN116205962A (en) Monocular depth estimation method and system based on complete context information
CN117391938B (en) Infrared image super-resolution reconstruction method, system, equipment and terminal
CN114119356A (en) Method for converting thermal infrared image into visible light color image based on cycleGAN
CN113379606A (en) Face super-resolution method based on pre-training generation model
CN111401209B (en) Action recognition method based on deep learning
CN113066074A (en) Visual saliency prediction method based on binocular parallax offset fusion
Huang et al. CS-VQA: visual question answering with compressively sensed images
CN115439849B (en) Instrument digital identification method and system based on dynamic multi-strategy GAN network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant