CN112819858A - Target tracking method, device and equipment based on video enhancement and storage medium - Google Patents

Target tracking method, device and equipment based on video enhancement and storage medium Download PDF

Info

Publication number
CN112819858A
CN112819858A CN202110129674.1A CN202110129674A CN112819858A CN 112819858 A CN112819858 A CN 112819858A CN 202110129674 A CN202110129674 A CN 202110129674A CN 112819858 A CN112819858 A CN 112819858A
Authority
CN
China
Prior art keywords
video
enhanced
target tracking
video data
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110129674.1A
Other languages
Chinese (zh)
Other versions
CN112819858B (en
Inventor
向国庆
文映博
严韫瑶
张鹏
贾惠柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Boya Huishi Intelligent Technology Research Institute Co ltd
Original Assignee
Beijing Boya Huishi Intelligent Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Boya Huishi Intelligent Technology Research Institute Co ltd filed Critical Beijing Boya Huishi Intelligent Technology Research Institute Co ltd
Priority to CN202110129674.1A priority Critical patent/CN112819858B/en
Publication of CN112819858A publication Critical patent/CN112819858A/en
Application granted granted Critical
Publication of CN112819858B publication Critical patent/CN112819858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a target tracking method, a target tracking device, target tracking equipment and a storage medium based on video enhancement, wherein the method comprises the following steps: acquiring video data to be enhanced; enhancing the video data to be enhanced through a pre-trained low-light image enhancement network to obtain enhanced video data; and carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced. According to the method and the device, the low-light image enhancement network is constructed and trained, the video data to be enhanced is enhanced through the low-light image enhancement network, the contrast and the chroma in each video frame in the video data to be enhanced are improved, the noise in each video frame is reduced, the details in the video data to be enhanced are clearer, and the target to be tracked is convenient to identify. Target tracking is carried out on the video data on the basis of the video enhancement, and the accuracy of target tracking is greatly improved.

Description

Target tracking method, device and equipment based on video enhancement and storage medium
Technical Field
The application belongs to the technical field of video processing, and particularly relates to a target tracking method, device, equipment and storage medium based on video enhancement.
Background
The captured image or video is often limited by the environment, so that the captured image or video has the defects of insufficient brightness, low contrast, serious noise and the like. For example, the monitoring video at night is limited by serious insufficient light, the video has the problems of extreme darkness, blurred details and serious noise, the target category in the video is difficult to distinguish and is difficult to distinguish, and the target detection and tracking are greatly hindered. There is a need for enhancement processing of such images or video.
At present, some video enhancement-based target tracking methods are provided in the related art, for example, a multi-scale Retinex (image defogging algorithm) low-light enhancement technology, after enhancement by the technology, although a high-brightness image is obtained, contrast, chromaticity and texture details of the high-brightness image are damaged to some extent, and bottom layer noise is amplified along with brightness, so that the finally obtained video image effect cannot meet the visual effect of human eyes, the target in the video image is still difficult to distinguish and clearly realize, and the task of target detection and tracking is difficult to realize.
Disclosure of Invention
The application provides a target tracking method, a target tracking device, a target tracking equipment and a target tracking storage medium based on video enhancement, which enhance video data to be enhanced through a low-light image enhancement network trained in advance, improve contrast and chroma in each video frame in the video data to be enhanced, reduce noise in each video frame, enable details in the video data to be enhanced to be clearer, and facilitate identification of a target to be tracked. Target tracking is carried out on the video data on the basis of the video enhancement, and the accuracy of target tracking is greatly improved.
An embodiment of a first aspect of the present application provides a target tracking method based on video enhancement, including:
acquiring video data to be enhanced;
enhancing the video data to be enhanced through a pre-trained low-light image enhancement network to obtain enhanced video data;
and carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced.
In some embodiments of the present application, before the enhancing the video data to be enhanced by using the pre-trained low-light image enhancement network to obtain the enhanced video data, the method further includes:
constructing a network structure of a low-light image enhancement network;
acquiring a training set, wherein the training set comprises night video images;
and training the constructed low-light image enhancement network according to the training set to obtain the trained low-light image enhancement network.
In some embodiments of the present application, the network structure for constructing the low-light image enhancement network includes:
connecting the first convolution layer and the activation layer in series to obtain a feature extraction module;
sequentially connecting the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in series to obtain an image enhancement module;
sequentially connecting a preset number of the feature extraction modules in series;
connecting each feature extraction module with one image enhancement module respectively;
and connecting each image adding module with a full connection layer to obtain the network structure of the low-light image enhancement network.
In some embodiments of the present application, the training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network includes:
acquiring the night video images from the training set;
inputting the night video images into a preset number of feature extraction modules which are sequentially connected in series to obtain a preset number of feature graphs;
inputting the preset number of feature maps into an image enhancement module connected with each adjustment extraction module respectively to obtain an enhanced feature map corresponding to each feature map;
connecting each enhanced feature map through the full-connection layer to obtain an enhanced video image corresponding to the night video image;
calculating a spatial consistency loss value, a perception loss value and a color loss value corresponding to the current training period according to the night video image and the enhanced video image corresponding to the night video image;
and when the spatial consistency loss value, the perception loss value and the color loss value meet a preset convergence condition, obtaining a trained low-light image enhancement network.
In some embodiments of the present application, before the inputting the night video image into a preset number of the feature extraction modules connected in series in sequence, the method further includes:
and performing regularization processing on the night video image, and compressing the pixel value of each color channel in the night video image to a preset interval.
In some embodiments of the present application, the performing target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced includes:
respectively carrying out target detection on each video frame in the enhanced video data through a preset target tracking network, and positioning each target to be tracked in each video frame;
tracking each target to be tracked through a preset target tracking algorithm to obtain a target tracking result corresponding to each target to be tracked;
respectively carrying out smooth interpolation processing on target tracking results corresponding to each target to be tracked;
and generating a target track video according to the target tracking result corresponding to each target to be tracked after the smooth interpolation processing.
In some embodiments of the present application, the convolution kernels of the first convolution layer and the seventh convolution layer each have a size of 3 × 3, and are configured to output a 256 × 256 × 32 feature map;
the sizes of convolution kernels of the second convolution layer and the sixth convolution layer are both 3 × 3, and the convolution kernels are used for outputting a 128 × 128 × 8 feature map;
the sizes of convolution kernels of the third convolution layer and the fifth convolution layer are both 5 × 5, and the convolution kernels are used for outputting a feature map of 64 × 64 × 16;
the convolution kernel of the fourth convolution layer has a size of 5 × 5, and is used to output a 32 × 32 × 32 feature map.
An embodiment of a second aspect of the present application provides a target tracking apparatus based on video enhancement, including:
the video acquisition module is used for acquiring video data to be enhanced;
the enhancement processing module is used for enhancing the video data to be enhanced through a low-light image enhancement network trained in advance to obtain enhanced video data;
and the target tracking module is used for carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced.
Embodiments of the third aspect of the present application provide an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of the first aspect.
An embodiment of a fourth aspect of the present application provides a computer-readable storage medium having a computer program stored thereon, the program being executable by a processor to implement the method of the first aspect.
The technical scheme provided in the embodiment of the application at least has the following technical effects or advantages:
in the embodiment of the application, the low-light image enhancement network is constructed and trained, and the video data to be enhanced is enhanced through the low-light image enhancement network, so that the contrast and the chroma of each video frame in the video data to be enhanced are improved, the noise in each video frame is reduced, the details in the video data to be enhanced are clearer, and the target to be tracked is convenient to identify. Target tracking is carried out on the video data on the basis of the video enhancement, and the accuracy of target tracking is greatly improved.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic network structure diagram of a low-light image enhancement network provided in an embodiment of the present application;
FIG. 2 is a flowchart illustrating a target tracking method based on video enhancement according to an embodiment of the present application;
fig. 3 is a schematic diagram illustrating a process of enhancing a low-light image by using a low-light image enhancement network according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram illustrating a target tracking device based on video enhancement according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 6 shows a schematic diagram of a storage medium provided in an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which this application belongs.
A target tracking method, an apparatus, a device and a storage medium based on video enhancement according to embodiments of the present application are described below with reference to the accompanying drawings.
The embodiment of the application provides a target tracking method based on video enhancement, which trains a low-light image enhancement network, enhances video data to be enhanced by using the low-light image enhancement network, and tracks a target of each enhanced video frame to finally obtain a tracking video sequence, thereby effectively improving the tracking accuracy. Compared with the method for tracking the target directly on the basis of the original low-illumination video, the method can improve the tracking accuracy by 105% on average.
In the embodiment of the present application, the low light image enhancement network is trained through the following operations of steps S1-S3, which specifically includes:
s1: and constructing a network structure of the low-light image enhancement network.
Specifically, the first convolution layer and the activation layer are connected in series to obtain a feature extraction module; sequentially connecting the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in series to obtain an image enhancement module; sequentially connecting a preset number of feature extraction modules in series; connecting each feature extraction module with an image enhancement module respectively; and connecting each image adding module with the full connection layer to obtain the network structure of the low-light image enhancement network.
The convolution kernels of the first convolution layer and the seventh convolution layer may both be 3 × 3 in size, and the step size is 1, so as to output a 256 × 256 × 32 feature map. The convolution kernels of the second convolutional layer and the sixth convolutional layer may each have a size of 3 × 3 with a step size of 1, for outputting a 128 × 128 × 8 feature map. The convolution kernels of the third convolution layer and the fifth convolution layer may each have a size of 5 × 5 with a step size of 1, for outputting a feature map of 64 × 64 × 16. The convolution kernel of the fourth convolution layer has a size of 5 × 5 and a step size of 1, and is used to output a 32 × 32 × 32 feature map. The predetermined number may be 8, 9, 10, etc.
The embodiment of the present application does not limit the values of the convolution kernels and the step lengths of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer, and the seventh convolution layer, and does not limit the size of the feature map output by each convolution layer, and the values can be set according to requirements in practical applications. The embodiment of the present application also does not limit the specific values of the preset number, and the specific values can be set according to requirements in practical application.
As shown in fig. 1, a network structure of a low-light image enhancement network with a preset number of 8 is shown, in which the sizes of convolution kernels of the first convolution layer and the seventh convolution layer are 3 × 3, and the sizes of convolution kernels of the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer and the sixth convolution layer are all 5 × 5.
S2: a training set is obtained, the training set including night video images.
According to the embodiment of the application, a training set is established according to a public standard video data set. The special sequences in the public data set CDW2014 are used as training data, 11 groups of video sequences are contained in the data set, severe weather, night videos, thermal imaging videos and multi-shadow videos are contained in each group, each group contains 4 to 6 videos, the night video group is selected as a training video in the embodiment of the application, the total number of the videos is 6, low-light and high-light vehicle driving scenes are contained, the group data set is used as the training data, and most low-light enhancement application scenes can be contained.
S3: and training the constructed low-light image enhancement network according to the training set to obtain the trained low-light image enhancement network.
First, a certain number of night video images are acquired from a training set. The training set comprises a plurality of night videos shot under a night low-light scene, each night video comprises a plurality of frames of night video images, and the plurality of night video images of the batch size are acquired from the training set according to the batch size (batch processing quantity) corresponding to the constructed low-light image enhancement network.
Each acquired night video image is compressed to a preset size, which may be 512 × 512 × 3. And then, carrying out regularization processing on each night video image, and compressing the pixel value of each color channel in the night video image to a preset interval, wherein the preset interval can be [0,1 ]. And then inputting each night video image into a preset number of feature extraction modules which are sequentially connected in series to obtain a preset number of feature maps. For any night video image, the convolution layer included in the first feature extraction module performs convolution operation on the night video image, and then the convolution result is activated by using the ReLU activation function in the activation layer to obtain a first feature map corresponding to the night video image. Then inputting the feature map into a second feature extraction module to perform convolution and activation operations to obtain a second feature map. And then inputting the second feature map into a third feature extraction module, and obtaining a preset number of feature maps corresponding to the night video image after sequentially passing through a preset number of feature extraction modules.
And respectively inputting the obtained feature maps with the preset number into the image enhancement module connected with each adjustment extraction module to obtain an enhanced feature map corresponding to each feature map. In the image enhancement module, a feature map generated by a feature extraction module connected with the image enhancement module is input into the image enhancement module, the image enhancement module is composed of a second convolution layer, a third convolution layer, a fourth convolution layer, a fifth convolution layer, a sixth convolution layer and a seventh convolution layer which are sequentially connected in series, and the second convolution layer performs convolution operation on the feature map and then outputs a 128 × 128 × 8 feature map. The 128 × 128 × 8 feature map is input to the third convolutional layer, and a 64 × 64 × 16 feature map is output. The feature map of 64 × 64 × 16 is input to the fourth convolution layer, and a feature map of 32 × 32 × 32 is output. The 32 × 32 × 32 feature map is input to the fifth convolutional layer, and a 64 × 64 × 16 feature map is output. The 64 × 64 × 16 feature map is input to the sixth convolutional layer, and a 128 × 128 × 8 feature map is output. The 128 × 128 × 8 feature map is input to the seventh convolutional layer, and a 256 × 256 × 32 feature map is output.
And performing image enhancement processing in each image enhancement module through the process to obtain a preset number of enhancement feature maps corresponding to the night video image. And finally, connecting each enhanced feature map through a full connection layer to obtain an enhanced video image corresponding to the night video image.
For each of the batch size night video images input into the low light enhancement network, an enhanced video image corresponding to each night video image is obtained in the above manner. And training and learning the batch size night video images, which is called as a training period of the low light enhancement network. In the current training period, after obtaining an enhanced video image corresponding to each night video image in the batch size night video images, calculating a spatial consistency loss value, a perception loss value and a color loss value corresponding to the current training period according to the night video images and the enhanced video images corresponding to the night video images.
Wherein the spatial consistency loss value is calculated by the following formula (1),
Figure BDA0002924697530000071
in the formula (1), LspaFor the spatial consistency loss value, K is the number of local regions in the enhanced video image, Ω is the four block sets (with the size set to 8 × 8) of the current local region I, Y is the average intensity of the current local region, and I is the average intensity of the current local region on the corresponding night video image.
The perceptual loss value is calculated by the following formula (2),
Figure BDA0002924697530000072
in the formula (2), LperFor the perception loss value, i and j represent the ith largest pooling layer of VGG-16(Visual Geometry Group Network-16) Network and the jth convolution layer of the ith largest pooling layer, Wi,j、Hi,j、Ci,The width, height and channel number of the feature map, namely the size of the feature map. Fi,(I)x,,、Fi,j(O)x,,And the feature maps of the corresponding i and j layers of the night video image and the corresponding enhanced video image are shown.
The color loss value is calculated by the following formula (3),
Figure BDA0002924697530000081
in the formula (3), LcolFor the color loss value, J represents the average intensity of a certain color channel of the enhanced video image, and (p, q) represents a pair of color channels, two out of three of the three RGB color channels, the set being epsilon.
In the current training period, after the spatial consistency loss value, the perception loss value and the color loss value corresponding to each night video image in the current training period are respectively calculated through the formulas (1) to (3), whether the calculated spatial consistency loss value, the perception loss value and the color loss value meet a preset convergence condition is judged, and the spatial consistency loss value, the perception loss value and the color loss value are required to be smaller than a preset spatial consistency loss threshold value, a preset perception loss threshold value and a preset color loss threshold value respectively when training is finished is stipulated in the preset convergence condition. If the spatial consistency loss value, the perception loss value and the color loss value of the current training period are judged to be respectively smaller than the preset spatial consistency loss threshold value, the perception loss threshold value and the color loss threshold value, the current spatial consistency loss value, the perception loss value and the color loss value are determined to meet the preset convergence condition, the training is stopped, and the low-light image enhancement network and the parameters thereof of the current training period are determined to be the trained low-light image enhancement network.
If any one of the spatial consistency loss value, the perception loss value and the color loss value in the current training period does not meet the preset convergence condition, adjusting parameters of the low-light image enhancement network through back propagation, acquiring batch size night video images from the training set again, and performing training in the next training period according to the above mode until the spatial consistency loss value, the perception loss value and the color loss value meet the preset convergence condition to obtain the trained low-light image enhancement network.
After the trained low-light image enhancement network is obtained in the above manner, as shown in fig. 2, enhancement and target tracking are performed on video data to be enhanced through the following steps.
Step 101: and acquiring video data to be enhanced.
The video data to be enhanced may be video data shot by a camera in a low-light scene, or video data which needs to be enhanced and is acquired from a network.
Step 102: and carrying out enhancement processing on the video data to be enhanced through a pre-trained low-light image enhancement network to obtain enhanced video data.
Compressing each frame of image in the acquired video data to be enhanced to a preset size, performing regularization processing on each frame of image after compression, and compressing the pixel value of each color channel in the night video image to a preset interval. And then obtaining the batch size images corresponding to the low-light image enhancement network from each frame of image corresponding to the video data to be enhanced, inputting the obtained images into the trained low-light image enhancement network, and outputting the enhanced image corresponding to each image in the batch size images. As shown in fig. 3, a frame of image is preprocessed and then input into a low-light image enhancement network to obtain a corresponding enhanced image, the low-light image enhancement network shown in fig. 3 has 8 feature extraction modules connected in series in sequence and 8 image enhancement modules, and can generate 8 enhanced feature maps corresponding to an array of images, and finally, the 8 enhanced feature maps are connected through a full connection layer to obtain a final enhanced image.
And for each frame of image in the video data to be enhanced, obtaining a corresponding enhanced image through the trained low-light image enhancement network according to the mode, so as to obtain enhanced video data corresponding to the video data to be enhanced.
Step 103: and carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced.
After the enhanced video data corresponding to the video data to be enhanced are obtained, the enhanced video data are input into a preset target tracking network, target detection is respectively carried out on each video frame in the enhanced video data through the preset target tracking network, and each target to be tracked in each video frame is located. Specifically, a preset target tracking network performs Fast R-CNN target detection on each input video frame, firstly, a candidate region where a target to be tracked is located needs to be extracted, candidate regions are extracted from an input image by using a Selective Search algorithm, and the candidate regions are mapped to a convolution characteristic layer of the preset target tracking network according to a spatial position relationship; then, carrying out regional normalization, and carrying out ROI (region of interest) Pooling operation on each candidate region on the convolution feature layer to obtain extracted features; and finally, inputting the extracted features into a full connection layer, classifying by using a Softmax (logistic regression model), and regressing the positions of the candidate regions to obtain a target detection result, namely positioning each target to be tracked from each video frame.
And then tracking each target to be tracked through a preset target tracking algorithm to obtain a target tracking result corresponding to each target to be tracked. Specifically, tracking is carried out according to each target to be tracked detected by a Fast R-CNN algorithm, and a tracking result is obtained by using a preset target tracking algorithm, such as Deep Sort. Deep Sort is a multi-target tracking algorithm, data association is carried out by utilizing a motion model and appearance information, the running speed is mainly determined by a detection algorithm, the algorithm carries out target detection on each frame, and then the motion trail obtained before is matched with the current detection object by a weighted Hungary matching algorithm to form the motion trail of the object. The weight is obtained by weighting and summing the Mahalanobis distance between the point and the motion trail and the similarity of the image blocks.
And performing smooth interpolation processing on the target tracking result corresponding to each target to be tracked respectively according to the obtained target tracking result of each target to be tracked in the image at different time, and generating a target track video according to the target tracking result corresponding to each target to be tracked after the smooth interpolation processing. And each target to be tracked in the finally generated target track video corresponds to a minimum circumscribed rectangle frame for identifying the target, and the moving track of each target to be tracked is identified through a curve.
In the embodiment of the application, the quality evaluation can be performed on the processing results of the low-light image enhancement network and the preset target tracking network. The evaluation parameters used for the low-light image enhancement network may be PSNR (peak signal to noise ratio), SSIM (structural similarity), and MAE (mean absolute error). The evaluation parameters used for the preset target Tracking network may be MOTA (Multiple Object Tracking Accuracy), MOTP (Multiple Object Tracking Precision), and IDP (Identification Precision). Evaluation results show that the target tracking accuracy can be effectively improved finally.
In the embodiment of the application, the low-light image enhancement network is constructed and trained, and the video data to be enhanced is enhanced through the low-light image enhancement network, so that the contrast and the chroma of each video frame in the video data to be enhanced are improved, the noise in each video frame is reduced, the details in the video data to be enhanced are clearer, and the target to be tracked is convenient to identify. Target tracking is carried out on the video data on the basis of the video enhancement, and the accuracy of target tracking is greatly improved.
The embodiment of the application also provides a target tracking device based on video enhancement, which is used for executing the target tracking method based on video enhancement provided by any one of the above embodiments. Referring to fig. 4, the apparatus includes:
a video obtaining module 401, configured to obtain video data to be enhanced;
an enhancement processing module 402, configured to perform enhancement processing on video data to be enhanced through a low-light image enhancement network trained in advance, so as to obtain enhanced video data;
and a target tracking module 403, configured to perform target tracking processing on the enhanced video data through a preset target tracking network, so as to obtain a target tracking video sequence corresponding to the video data to be enhanced.
The device also includes: the network training module is used for constructing a network structure of the low-light image enhancement network; acquiring a training set, wherein the training set comprises night video images; and training the constructed low-light image enhancement network according to the training set to obtain the trained low-light image enhancement network.
The network training module is used for connecting the first convolution layer and the activation layer in series to obtain a feature extraction module; sequentially connecting the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in series to obtain an image enhancement module; sequentially connecting a preset number of feature extraction modules in series; connecting each feature extraction module with an image enhancement module respectively; and connecting each image adding module with the full connection layer to obtain the network structure of the low-light image enhancement network. The convolution kernels of the first convolution layer and the seventh convolution layer are both 3 x 3 in size and are used for outputting a characteristic diagram of 256 x 32; the sizes of convolution kernels of the second convolution layer and the sixth convolution layer are both 3 x 3, and the convolution kernels are used for outputting a 128 x 8 feature map; the sizes of convolution kernels of the third convolution layer and the fifth convolution layer are both 5 multiplied by 5, and the convolution kernels are used for outputting a feature map of 64 multiplied by 16; the convolution kernel of the fourth convolution layer has a size of 5 × 5, and is used to output a 32 × 32 × 32 feature map.
The network training module is used for acquiring night video images from the training set; inputting the night video image into a preset number of feature extraction modules which are sequentially connected in series to obtain a preset number of feature graphs; respectively inputting a preset number of feature maps into the image enhancement module connected with each adjustment extraction module to obtain an enhanced feature map corresponding to each feature map; connecting each enhanced feature map through a full connection layer to obtain an enhanced video image corresponding to the night video image; calculating a spatial consistency loss value, a perception loss value and a color loss value corresponding to the current training period according to the night video image and the enhanced video image corresponding to the night video image; and when the spatial consistency loss value, the perception loss value and the color loss value meet the preset convergence condition, obtaining the trained low-light image enhancement network.
And the network training module is used for inputting the night video images into the preset number of feature extraction modules which are sequentially connected in series, and carrying out regularization processing on the night video images to compress the pixel value of each color channel in the night video images to a preset interval.
A target tracking module 403, configured to perform target detection on each video frame in the enhanced video data through a preset target tracking network, and locate each target to be tracked in each video frame; tracking each target to be tracked through a preset target tracking algorithm to obtain a target tracking result corresponding to each target to be tracked; respectively carrying out smooth interpolation processing on target tracking results corresponding to each target to be tracked; and generating a target track video according to the target tracking result corresponding to each target to be tracked after the smooth interpolation processing.
The target tracking device based on video enhancement provided by the above embodiment of the present application and the target tracking method based on video enhancement provided by the embodiment of the present application have the same inventive concept and have the same beneficial effects as the method adopted, operated or realized by the application program stored in the target tracking device.
The embodiment of the application also provides electronic equipment for executing the target tracking method based on video enhancement. Please refer to fig. 5, which illustrates a schematic diagram of an electronic device according to some embodiments of the present application. As shown in fig. 5, the electronic apparatus 5 includes: the system comprises a processor 500, a memory 501, a bus 502 and a communication interface 503, wherein the processor 500, the communication interface 503 and the memory 501 are connected through the bus 502; the memory 501 stores a computer program that can be executed on the processor 500, and the processor 500 executes the computer program to execute the target tracking method based on video enhancement provided by any of the foregoing embodiments of the present application.
The Memory 501 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 503 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used.
Bus 502 can be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The memory 501 is used for storing a program, and the processor 500 executes the program after receiving an execution instruction, and the video enhancement-based target tracking method disclosed in any embodiment of the foregoing application may be applied to the processor 500, or implemented by the processor 500.
The processor 500 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 500. The Processor 500 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 501, and the processor 500 reads the information in the memory 501, and completes the steps of the method in combination with the hardware thereof.
The electronic device provided by the embodiment of the application and the target tracking method based on video enhancement provided by the embodiment of the application have the same beneficial effects as the method adopted, operated or realized by the electronic device.
The embodiment of the present application further provides a computer-readable storage medium corresponding to the target tracking method based on video enhancement provided in the foregoing embodiment, please refer to fig. 6, which illustrates a computer-readable storage medium, which is an optical disc 30 and stores thereon a computer program (i.e., a program product), where the computer program, when executed by a processor, executes the target tracking method based on video enhancement provided in any of the foregoing embodiments.
It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.
The computer-readable storage medium provided by the above-mentioned embodiment of the present application and the target tracking method based on video enhancement provided by the embodiment of the present application have the same beneficial effects as the method adopted, run or implemented by the application program stored in the computer-readable storage medium.
It should be noted that:
in the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted to reflect the following schematic: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A target tracking method based on video enhancement is characterized by comprising the following steps:
acquiring video data to be enhanced;
enhancing the video data to be enhanced through a pre-trained low-light image enhancement network to obtain enhanced video data;
and carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced.
2. The method according to claim 1, wherein before the enhancement processing of the video data to be enhanced by the pre-trained low-light image enhancement network to obtain the enhanced video data, further comprising:
constructing a network structure of a low-light image enhancement network;
acquiring a training set, wherein the training set comprises night video images;
and training the constructed low-light image enhancement network according to the training set to obtain the trained low-light image enhancement network.
3. The method of claim 2, wherein the constructing the network structure of the low-light image enhancement network comprises:
connecting the first convolution layer and the activation layer in series to obtain a feature extraction module;
sequentially connecting the second convolution layer, the third convolution layer, the fourth convolution layer, the fifth convolution layer, the sixth convolution layer and the seventh convolution layer in series to obtain an image enhancement module;
sequentially connecting a preset number of the feature extraction modules in series;
connecting each feature extraction module with one image enhancement module respectively;
and connecting each image adding module with a full connection layer to obtain the network structure of the low-light image enhancement network.
4. The method of claim 3, wherein the training the constructed low-light image enhancement network according to the training set to obtain a trained low-light image enhancement network comprises:
acquiring the night video images from the training set;
inputting the night video images into a preset number of feature extraction modules which are sequentially connected in series to obtain a preset number of feature graphs;
inputting the preset number of feature maps into an image enhancement module connected with each adjustment extraction module respectively to obtain an enhanced feature map corresponding to each feature map;
connecting each enhanced feature map through the full-connection layer to obtain an enhanced video image corresponding to the night video image;
calculating a spatial consistency loss value, a perception loss value and a color loss value corresponding to the current training period according to the night video image and the enhanced video image corresponding to the night video image;
and when the spatial consistency loss value, the perception loss value and the color loss value meet a preset convergence condition, obtaining a trained low-light image enhancement network.
5. The method according to claim 4, wherein before inputting the night video image into a preset number of the feature extraction modules connected in series in sequence, the method further comprises:
and performing regularization processing on the night video image, and compressing the pixel value of each color channel in the night video image to a preset interval.
6. The method according to any one of claims 1 to 5, wherein the performing target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced comprises:
respectively carrying out target detection on each video frame in the enhanced video data through a preset target tracking network, and positioning each target to be tracked in each video frame;
tracking each target to be tracked through a preset target tracking algorithm to obtain a target tracking result corresponding to each target to be tracked;
respectively carrying out smooth interpolation processing on target tracking results corresponding to each target to be tracked;
and generating a target track video according to the target tracking result corresponding to each target to be tracked after the smooth interpolation processing.
7. The method according to any one of claims 3 to 5,
the convolution kernels of the first convolution layer and the seventh convolution layer are both 3 × 3 in size and are used for outputting a 256 × 256 × 32 feature map;
the sizes of convolution kernels of the second convolution layer and the sixth convolution layer are both 3 × 3, and the convolution kernels are used for outputting a 128 × 128 × 8 feature map;
the sizes of convolution kernels of the third convolution layer and the fifth convolution layer are both 5 × 5, and the convolution kernels are used for outputting a feature map of 64 × 64 × 16;
the convolution kernel of the fourth convolution layer has a size of 5 × 5, and is used to output a 32 × 32 × 32 feature map.
8. An apparatus for tracking a target based on video enhancement, comprising:
the video acquisition module is used for acquiring video data to be enhanced;
the enhancement processing module is used for enhancing the video data to be enhanced through a low-light image enhancement network trained in advance to obtain enhanced video data;
and the target tracking module is used for carrying out target tracking processing on the enhanced video data through a preset target tracking network to obtain a target tracking video sequence corresponding to the video data to be enhanced.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the method of any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program is executed by a processor to implement the method according to any of claims 1-7.
CN202110129674.1A 2021-01-29 2021-01-29 Target tracking method, device, equipment and storage medium based on video enhancement Active CN112819858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110129674.1A CN112819858B (en) 2021-01-29 2021-01-29 Target tracking method, device, equipment and storage medium based on video enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110129674.1A CN112819858B (en) 2021-01-29 2021-01-29 Target tracking method, device, equipment and storage medium based on video enhancement

Publications (2)

Publication Number Publication Date
CN112819858A true CN112819858A (en) 2021-05-18
CN112819858B CN112819858B (en) 2024-03-22

Family

ID=75860465

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110129674.1A Active CN112819858B (en) 2021-01-29 2021-01-29 Target tracking method, device, equipment and storage medium based on video enhancement

Country Status (1)

Country Link
CN (1) CN112819858B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065533A (en) * 2021-06-01 2021-07-02 北京达佳互联信息技术有限公司 Feature extraction model generation method and device, electronic equipment and storage medium
CN113744164A (en) * 2021-11-05 2021-12-03 深圳市安软慧视科技有限公司 Method, system and related equipment for enhancing low-illumination image at night quickly
CN114827567A (en) * 2022-03-23 2022-07-29 阿里巴巴(中国)有限公司 Video quality analysis method, apparatus and readable medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200234414A1 (en) * 2019-01-23 2020-07-23 Inception Institute of Artificial Intelligence, Ltd. Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures
CN111460968A (en) * 2020-03-27 2020-07-28 上海大学 Video-based unmanned aerial vehicle identification and tracking method and device
CN111814755A (en) * 2020-08-18 2020-10-23 深延科技(北京)有限公司 Multi-frame image pedestrian detection method and device for night motion scene
CN112085088A (en) * 2020-09-03 2020-12-15 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200234414A1 (en) * 2019-01-23 2020-07-23 Inception Institute of Artificial Intelligence, Ltd. Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures
CN111460968A (en) * 2020-03-27 2020-07-28 上海大学 Video-based unmanned aerial vehicle identification and tracking method and device
CN111814755A (en) * 2020-08-18 2020-10-23 深延科技(北京)有限公司 Multi-frame image pedestrian detection method and device for night motion scene
CN112085088A (en) * 2020-09-03 2020-12-15 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MEIFANG YANG; XIN NIE; RYAN WEN LIU: "Coarse-to-Fine Luminance Estimation for Low-Light Image Enhancement in Maritime Video Surveillance", 2019 IEEE INTELLIGENT TRANSPORTATION SYSTEMS CONFERENCE (ITSC) *
方路平;翁佩强;周国民;: "基于深度学习的低光彩码图像增强", 浙江工业大学学报, no. 04 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113065533A (en) * 2021-06-01 2021-07-02 北京达佳互联信息技术有限公司 Feature extraction model generation method and device, electronic equipment and storage medium
CN113744164A (en) * 2021-11-05 2021-12-03 深圳市安软慧视科技有限公司 Method, system and related equipment for enhancing low-illumination image at night quickly
CN114827567A (en) * 2022-03-23 2022-07-29 阿里巴巴(中国)有限公司 Video quality analysis method, apparatus and readable medium
CN114827567B (en) * 2022-03-23 2024-05-28 阿里巴巴(中国)有限公司 Video quality analysis method, apparatus and readable medium

Also Published As

Publication number Publication date
CN112819858B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
CN111741211B (en) Image display method and apparatus
CN112819858B (en) Target tracking method, device, equipment and storage medium based on video enhancement
CN107274445B (en) Image depth estimation method and system
Onzon et al. Neural auto-exposure for high-dynamic range object detection
CN109472191B (en) Pedestrian re-identification and tracking method based on space-time context
US8582915B2 (en) Image enhancement for challenging lighting conditions
WO2021063341A1 (en) Image enhancement method and apparatus
CN110580428A (en) image processing method, image processing device, computer-readable storage medium and electronic equipment
CN108897786A (en) Recommended method, device, storage medium and the mobile terminal of application program
CN112348747A (en) Image enhancement method, device and storage medium
CN113065645A (en) Twin attention network, image processing method and device
CN114708437B (en) Training method of target detection model, target detection method, device and medium
Cai et al. Guided attention network for object detection and counting on drones
CN113378775B (en) Video shadow detection and elimination method based on deep learning
CN116681636B (en) Light infrared and visible light image fusion method based on convolutional neural network
CN114881871A (en) Attention-fused single image rain removing method
CN114708615B (en) Human body detection method based on image enhancement in low-illumination environment, electronic equipment and storage medium
CN113177438A (en) Image processing method, apparatus and storage medium
Hao et al. Low-light image enhancement based on retinex and saliency theories
CN116757986A (en) Infrared and visible light image fusion method and device
CN116977208A (en) Low-illumination image enhancement method for double-branch fusion
CN116263942A (en) Method for adjusting image contrast, storage medium and computer program product
CN113658197B (en) Image processing method, device, electronic equipment and computer readable storage medium
CN114821086A (en) Video prediction method and system
WO2022120996A1 (en) Visual position recognition method and apparatus, and computer device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant