CN110111351B - Pedestrian contour tracking method fusing RGBD multi-modal information - Google Patents

Pedestrian contour tracking method fusing RGBD multi-modal information Download PDF

Info

Publication number
CN110111351B
CN110111351B CN201910389276.6A CN201910389276A CN110111351B CN 110111351 B CN110111351 B CN 110111351B CN 201910389276 A CN201910389276 A CN 201910389276A CN 110111351 B CN110111351 B CN 110111351B
Authority
CN
China
Prior art keywords
pedestrian
rgb
image
confidence map
target segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910389276.6A
Other languages
Chinese (zh)
Other versions
CN110111351A (en
Inventor
周雪
周政
邹见效
徐红兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910389276.6A priority Critical patent/CN110111351B/en
Publication of CN110111351A publication Critical patent/CN110111351A/en
Application granted granted Critical
Publication of CN110111351B publication Critical patent/CN110111351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/12Edge-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20116Active contour; Active surface; Snakes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a pedestrian contour tracking method fusing RGBD multi-modal information, which comprises the steps of constructing a pedestrian target confidence map acquisition model comprising an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, training the pedestrian target confidence map acquisition model by adopting a training sample, respectively inputting an RGB image and a Depth image into a pedestrian target confidence map during actual application, and acquiring a target contour by taking the obtained fusion confidence map as external energy for guiding active contour evolution. The invention introduces a Depth image as input, and can effectively improve the pedestrian contour tracking effect in the tracking scene with insufficient background light or similar target appearance to background appearance.

Description

Pedestrian contour tracking method fusing RGBD multi-modal information
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a pedestrian contour tracking method fusing RGBD multi-modal information.
Background
As one of the visual tasks with great significance in the field of computer vision research, target tracking comprises advanced technologies such as pattern recognition, image processing and the like. The content of the target tracking task is the position information of the interested target in the appointed frame of the given video sequence, and the target is found out and the position information is given in the subsequent continuous frames in sequence. In recent years, with the continuous development of computer science and technology and the increasing practical application requirements in life and work, the research of target tracking algorithm has become a big hot spot in the field of computer vision.
In the current target tracking algorithm, most of the mature tracking algorithms are based on RGB color images, firstly, because the current RGB image imaging technology is the most mature, a large number of images with good quality can be easily obtained no matter by a mobile phone or a camera, and just because the visual data set based on RGB color images is also extremely rich, secondly, because the RGB images carry enough rich information, for example, various feature information such as color histogram, HOG, LBP and the like can be extracted from the RGB images, which is one of the most important reasons for being the main data source of the visual task. However, despite the advantages, there are problems in real scenes that cannot be solved by only relying on RGB images, such as the case where the target object is similar to the background appearance or the ambient light is dark. In order to make up for the insufficient tracking effect caused by single RGB image information, many tracking algorithms have introduced, for example, a light flow graph or a Depth image as an RGB image supplement to improve the tracking effect.
The Depth image is different from an RGB color image, and is an image which is visually recorded with distance data, and each pixel value of the image records the distance from each position of an imaging object to a camera. The Depth image is generally a 16-bit single-channel image, and the display effect of the Depth image is the same as that of a common single-channel gray scale image, the closer an imaging object is to a camera in vision, the smaller the pixel value of the Depth image is, so the darker the gray scale display is, and the farther the pixel value is from the camera, the larger the gray scale display is, the brighter the gray scale display is. Because the Depth image has the characteristics of being not influenced by illumination change and only related to distance, the fusion of the Depth image can help solve the problem of poor tracking effect caused by the failure of the apparent information of the RGB image in certain specific tracking scenes such as weak ambient light, similar background and target appearance and the like.
In the tracking form of the target, a bounding box-based tracking and an outline-based tracking are mainly classified. The so-called bounding box-based tracking is to use a rectangular frame or an elliptical frame on a tracked target to express the tracked target, and the tracking mode is simple, but the form is fixed, and the flexibility for expressing a non-rigid target is poor. The contour tracking method is the tracking of the target body edge, which needs to perform more complex mathematical calculation, but is really very flexible to the expression of non-rigid targets, can provide the information and change of the target form in detail, and is a very meaningful tracking method. At present, how to integrate the Depth image into the contour tracking is still a technical problem to be solved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a pedestrian contour tracking method fusing RGBD multi-modal information, which combines an RGB image and a Depth image to improve the pedestrian contour tracking effect in a tracking scene with insufficient background light or similar target appearance to background appearance.
In order to achieve the above purpose, the pedestrian contour tracking method fusing RGBD multi-modal information of the present invention includes the following steps:
s1: the pedestrian target confidence map acquisition model is constructed and comprises an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, wherein:
the RGB target segmentation convolution network module takes the RGB image as input and outputs a confidence map T for segmenting the pedestrian targetRGBThe last layer of feature map FRGBOutputting to a feature map superposition module to obtain a confidence map TRGBOutputting to a fusion module;
depth target scoreThe segmentation convolution network module takes a Depth image corresponding to the RGB image as input and outputs a confidence map T for segmenting the pedestrian targetDepthThe last layer of feature map FDepthOutputting to a feature map superposition module to obtain a confidence map TDepthOutputting to a fusion module;
feature map overlay Module vs. feature map FRGBAnd feature map FDepthOverlapping to obtain a characteristic diagram FSWAnd output to the convolution module;
the convolution module is used for comparing the feature map FSWPerforming convolution operation to obtain a probability conversion diagram SW and outputting the probability conversion diagram SW to a fusion module, wherein the expression of the convolution operation is as follows:
SW=h(WSW*FSW+bSW)
wherein, WSWAnd bSWWeight parameters of the 1 x 1 convolution kernel and weight parameters of the bias term, respectively, h (-) represents an activation function;
the confidence map fusion module carries out fusion on the received confidence map T according to the probability conversion map SWRGBConfidence map TDepthFusing to obtain a fused confidence map TfusedWherein the fusion operation expression is as follows:
Tfused=SW⊙TRGB+(1-SW)⊙TDepth
wherein an indicates that a corresponding element is multiplied;
s2: acquiring a plurality of training samples, wherein each training sample comprises an RGB (red, green and blue) image containing a pedestrian target, a corresponding Depth image and a pedestrian target segmentation label, and the pedestrian target segmentation label is used for representing whether each pixel belongs to a pedestrian target or a background;
s3: training a pedestrian target confidence map acquisition model by adopting the following method:
s3.1: the method comprises the steps that an RGB image in a training sample is used as input, an RGB target segmentation convolution network module is trained, a pedestrian target segmentation result is obtained by an output confidence map through a threshold value method in the training process, and error calculation is carried out on the pedestrian target segmentation result and a pedestrian target segmentation label;
s3.2: taking a Depth image in a training sample as input, training a Depth target segmentation convolution network module, acquiring a pedestrian target segmentation result from an output confidence image by a threshold value method in the training process, and performing error calculation with a pedestrian target segmentation label;
s3.3: initializing an RGB target segmentation convolution network module in a pedestrian target belief map acquisition model according to the training result of the step S3.1, initializing a Depth target segmentation convolution network module according to the training result of the step S3.2, taking an RGB image in a training sample as the input of the RGB target segmentation convolution network module, taking a corresponding Depth image as the input of the Depth target segmentation convolution network module, training a pedestrian target belief map acquisition model, acquiring a trip human target segmentation result by adopting a threshold value method for an output fusion belief map in the training process, and carrying out error calculation with a pedestrian target segmentation label;
s4: acquiring a pedestrian video sequence, manually calibrating the pedestrian outline in the first frame image, inputting an RGB image and a corresponding Depth image into a pedestrian target confidence map acquisition model for a subsequent image to obtain a fusion confidence map, and inputting the fusion confidence map as an external energy input guide outline of outline evolution to evolve to obtain a pedestrian outline tracking result of the current image.
The invention discloses a pedestrian contour tracking method fusing RGBD multi-modal information, which comprises the steps of constructing a pedestrian target confidence map acquisition model comprising an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, training the pedestrian target confidence map acquisition model by adopting a training sample, respectively inputting an RGB image and a Depth image into a pedestrian target confidence map during actual application, and acquiring a target contour by taking the obtained fusion confidence map as external energy for guiding active contour evolution. The invention introduces a Depth image as input, and can effectively improve the pedestrian contour tracking effect in the tracking scene with insufficient background light or similar target appearance to background appearance.
Drawings
FIG. 1 is a flow chart of an embodiment of a pedestrian contour tracking method fusing RGBD multi-modal information according to the present invention;
FIG. 2 is a block diagram of a pedestrian object confidence map acquisition model in accordance with the present invention;
fig. 3 is a diagram of the OSVOS network structure in the present embodiment;
FIG. 4 is a structural diagram of a pedestrian target confidence map acquisition model based on an OSVOS network in the embodiment;
fig. 5 is an exemplary diagram of conversion of a Depth image into an RGB image in the present embodiment;
FIG. 6 is a flowchart of the online update of the pedestrian target confidence map acquisition model in this embodiment;
FIG. 7 is a comparison chart of the pedestrian contour tracking results of the pedestrian video sequence 1 according to the present invention and three comparison methods in this embodiment;
FIG. 8 is a comparison graph of the regional similarity curves of the pedestrian target and the real pedestrian target in the pedestrian video sequence 1 according to the present invention and three comparison methods in this embodiment;
FIG. 9 is a comparison graph of the accuracy curves of the pedestrian contour tracking result of the pedestrian video sequence 1 and the real pedestrian contour according to the present invention and three comparison methods in this embodiment;
FIG. 10 is a comparison chart of the pedestrian contour tracking results of the pedestrian video sequence 2 according to the present invention and three comparison methods in this embodiment;
FIG. 11 is a comparison graph of the regional similarity curves of the pedestrian target and the real pedestrian target of the pedestrian video sequence 2 according to the present invention and three comparison methods in this embodiment;
fig. 12 is a comparison graph of the accuracy curves of the pedestrian contour tracking result of the pedestrian video sequence 2 and the real pedestrian contour by the invention and three comparison methods in the embodiment.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Examples
FIG. 1 is a flow chart of an embodiment of the pedestrian contour tracking method fusing RGBD multi-modal information. As shown in fig. 1, the pedestrian contour tracking method fusing RGBD multi-modal information of the present invention specifically includes the steps of:
s101: constructing a pedestrian target confidence map acquisition model:
fig. 2 is a structural diagram of a pedestrian target confidence map acquisition model in the present invention. As shown in fig. 2, the pedestrian target confidence map acquisition model in the invention includes an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, which are respectively described in detail below.
The RGB target segmentation convolution network module takes the RGB image as input and outputs a confidence map T for segmenting the pedestrian targetRGBThe last layer of feature map FRGBOutputting to a feature map superposition module to obtain a confidence map TRGBAnd outputting the data to the fusion module.
The Depth target segmentation convolution network module takes a Depth image corresponding to the RGB image as input and outputs a confidence map T for segmenting the pedestrian targetDepthThe last layer of feature map FDepthOutputting to a feature map superposition module to obtain a confidence map TDepthAnd outputting the data to the fusion module.
Feature map overlay Module vs. feature map FRGBAnd feature map FDepthOverlapping to obtain a characteristic diagram FSWAnd output to the convolution module.
The convolution module is used for comparing the feature map FSWPerforming convolution operation to obtain a probability conversion diagram SW and outputting the probability conversion diagram SW to a fusion module, wherein the expression of the convolution operation is as follows:
SW=h(WSW*FSW+bSW)
wherein, WSWAnd bSWThe weight parameters of the 1 × 1 convolution kernel and the bias term weight parameters, h (·), respectively, represent activation functions, and in this embodiment, Sigmoid functions are used.
In the embodiment, the probability transition map SW is a single-channel image with pixel values of [0,1], which plays a decision role in weighting the confidence map of the RGB image and the confidence map of the Depth image in fusion.
The confidence map fusion module carries out fusion on the received confidence map T according to the probability conversion map SWRGBConfidence map TDepthFusing to obtain a fused confidence map TfusedWherein the fusion operation expression is as follows:
Tfused=SW⊙TRGB+(1-SW)⊙TDepth
wherein an indicates that the corresponding element is multiplied.
This embodiment is based on a confidence map TfusedUsing a threshold method, i.e. confidence maps T, for obtaining target segmentation resultsfusedIf the pixel value of (2) is greater than the preset threshold value, the pedestrian target is regarded as the pedestrian target, otherwise, the pedestrian target is not regarded as the pedestrian target.
In the invention, the performance of the RGB target segmentation convolutional network module and the Depth target segmentation convolutional network module has a great influence on the quality of the pedestrian contour tracking result of the invention, so that in practical application, the target segmentation convolutional network can be optimized according to the requirement. Through multi-party comparison and research, an OSVOS (One-Shot Video Object Segmentation) network is preferably used as the target Segmentation convolutional network in the present embodiment. The OSVOS network is a target segmentation network based on deep learning, and based on an FCN (full Convolutional neural network) framework, learned information can be transferred to general semantic information, so as to realize foreground segmentation. Fig. 3 is a diagram of the OSVOS network structure in this embodiment. As shown in fig. 3, the OSVOS network includes 5 convolution groups (conv 1-conv 5) for extracting features and a layer of fusion convolution layer (conv final) for fusing all the superimposed feature maps, the feature maps output by the convolution groups conv 2-conv 5 are up-sampled and then superimposed and input into the fusion convolution layer, the 51-channel feature map obtained by superimposition is the last layer of feature map, the feature map of 102 channels can be formed by superimposing two 51-channel feature maps by the feature map superimposing module, and the output of the fusion convolution layer (conv final) is the confidence map. The specific structure and operation of an OSVOS network can be found in the documents "One-Shot Video Object Segmentation, S.Caelles, K.K.Maninis, J.Pont-Tuset, L.Leal-Taix, D.Cremers, and L.Van Gool, Computer Vision and Pattern Recognition (CVPR), 2017".
Fig. 4 is a structural diagram of a pedestrian target confidence map acquisition model based on the OSVOS network in the embodiment. As shown in fig. 4, the RGB image and the Depth image are respectively input into an OSVOS network, the confidence map is output to the convolution module Conv-SW to obtain a probability transformation map SW, the last layer of feature map is input to the confidence map fusion module fuse, and the confidence map fusion module fuse performs fusion operation to obtain a fusion confidence map.
Because the current OSVOS network mainly aims at RGB three-channel images, and the input of the Depth target segmentation convolution network module is a single-channel Depth image, in this embodiment, a single-channel Depth image needs to be converted into a three-channel image, and the specific method is as follows:
the Depth image is first hole filled. This is because the imaging principle of a Depth image capturing device such as Kinect, etc. causes that the acquired Depth image has a pixel value of an invalid distance value 0, and is a black hollow region in visual perception, and a large area of the hollow region has a serious influence on the use of the Depth image, so that the filling of the hollow needs to be performed. In this embodiment, a hole filling method provided in an NYU Depth segmentation data set toolbox is used for hole filling processing of a Depth image, and the principle is based on a coloring method (coloring using optimization) proposed by Anat Levin et al, and the specific process is not described herein again.
And then coding the Depth image filled with the hollow space to generate a three-channel RGB image. In this embodiment, a Depth image is encoded by a Jet colormap encoding method, and is converted into an RGB image. Colormap is a real matrix of m × 3, each row defines an RGB vector of a color, and Jet Colormap represents colors from blue to red, and cyan, yellow and orange are passed through in the middle, so that different Depth values in a Depth image are mapped to gradually changing colors in Jet Colormap to represent gradually changing Depth distances, such as that blue represents the near, and then corresponding colors of the Colormap gradually increase and change with the increase of Depth, and finally reach the farthest represented by red. The coding mode takes depth information effectiveness and computational efficiency into consideration at the same time, and is a relatively balanced coding mode as a whole.
Fig. 5 is an exemplary diagram of conversion of a Depth image into an RGB image in the present embodiment. As shown in fig. 5, after hole filling and encoding, the Depth image can be converted into an RGB image, which is different from the original RGB image in the training sample, and only uses RGB encoding to represent Depth information.
S102: obtaining a training sample:
the method comprises the steps of obtaining a plurality of training samples, wherein each training sample comprises an RGB (red, green and blue) image containing a pedestrian target, a corresponding Depth image and a pedestrian target segmentation label, and the pedestrian target segmentation label is used for representing whether each pixel belongs to a pedestrian target or a background and is a binary image. The training samples can be selected from the existing general sample library or can be obtained by self.
S103: training a pedestrian target confidence map acquisition model:
next, training of the pedestrian target confidence map acquisition model is required. Because the pedestrian target confidence map acquisition model adopts a dual-channel structure model, in order to acquire better performance, end-to-end training is not suitable to be directly carried out, but needs to be carried out step by step, namely, a training mode of 'part first and whole second' is adopted, and the specific method comprises the following steps:
s3.1: training an RGB target segmentation convolution network module:
the method comprises the steps of training an RGB target segmentation convolution network module by using RGB images in a training sample as input, obtaining a pedestrian target segmentation result by using a threshold value method for an output confidence map in the training process, and carrying out error calculation on the pedestrian target segmentation result and a pedestrian target segmentation label.
S3.2: training a Depth target segmentation convolution network module:
and (3) training a Depth image in a training sample as an input, and similarly, acquiring a pedestrian target segmentation result from an output confidence map by using a threshold method in the training process, and performing error calculation with a pedestrian target segmentation label.
Because the convolution network is divided by two targets in the embodimentIn the OSVOS network, because the parameters of the basic network can be directly obtained, the training stage of the basic network can be omitted and two single-channel network models using different input information can be directly trained respectively, and the parameters of convolution layers can be directly migrated from the trained basic network. Each OSVOS needs to go through three stages, with specific parameters: selecting step by the learning strategy, namely reducing the learning rate according to a given gamma value when iteration reaches a certain number of times; the basic learning rate is set to 10-8(ii) a The gamma attenuation rate is 0.1; stepsize is attenuation step length and is set to 10000; the weight decay rate is set to 0.0002; itersize is 10, i.e. 10 pictures are fed in at the same time for training each iteration.
S3.3: integral training:
initializing an RGB target segmentation convolution network module in the pedestrian target belief map acquisition model according to the training result of the step S3.1, initializing a Depth target segmentation convolution network module according to the training result of the step S3.2, taking an RGB image in a training sample as the input of the RGB target segmentation convolution network module, taking a corresponding Depth image as the input of the Depth target segmentation convolution network module, training the pedestrian target belief map acquisition model, similarly, acquiring a pedestrian target segmentation result by adopting a threshold method for an output fusion belief map in the training process, and carrying out error calculation with a pedestrian target segmentation label.
In this embodiment, the parameter value of the convolutional layer of the pedestrian target confidence map acquisition model is initialized to 0, and the training iteration number is 20000.
In order to realize automatic weighing of the confidence map of the RGB image or the confidence map of the Depth image, the pseudo label Y of a probability conversion map SW is constructed in the embodimentSWLearning of the unguided probability transformation graph SW, YSWIs defined by the following formula:
YSW=TRGB⊙Y+(1-TRGB)⊙(1-Y)
wherein Y represents a pedestrian object segmentation label in the training sample.
YSWThe value of 1 indicates the pedestrian target segmentation node obtained by the confidence map of the RGB imageIf the result is consistent with the pedestrian target segmentation label, the prediction result of the RGB image is reliable, and Y isSWIncreasing the weight of the guide probability conversion map SW to the RGB image confidence map, and correspondingly decreasing the weight of the Depth image confidence map; if the value is 0, the pedestrian target segmentation result obtained by the confidence map of the RGB image is inconsistent with the pedestrian target segmentation label, the prediction result of the RGB image is unreliable, and Y isSWThe leading probability transition map SW is decreased by the weight of the RGB image confidence map and correspondingly the weight of the Depth image confidence map will increase.
S104: and (3) pedestrian contour tracking:
acquiring a pedestrian video sequence, manually calibrating the pedestrian outline in the first frame image, inputting an RGB image and a corresponding Depth image into a pedestrian target confidence map acquisition model for a subsequent image to obtain a fusion confidence map, and inputting the confidence map as an external energy input guide outline of outline evolution to evolve to obtain a pedestrian outline tracking result of the current image.
In this embodiment, a DRLSE LevelSet model is used for profile evolution, and an evolution equation can be expressed as follows:
Figure BDA0002055888930000091
where phi denotes the level set function, t denotes the time,
Figure BDA0002055888930000095
the gradient of the level set function is represented,
Figure BDA0002055888930000092
where p () is a double potential well function, p' () represents its first derivative, δε(φ) represents a Dirac function, div () represents the divergence, the term represents the distance regularization term, g represents the edge indication function, and α represents the weighted area term coefficient.
First term of evolution equation
Figure BDA0002055888930000096
Is a positive distanceThe term is normalized. Partial differential equation of the second term of the evolution equation
Figure BDA0002055888930000093
The method is characterized in that an equation corresponding to a GAC geodesic active contour model is adopted, a curvature and edge gradient function jointly guide curve evolution, the curvature evolution plays a role of smoothing a curve, and the curve can be converged to the maximum value of the local gradient of an image and depends on the edge gradient function. The third term of the evolution equation has the functions of accelerating the evolution of the level set and determining the evolution direction of the curve. In the invention, the fusion confidence map is adopted to guide the pedestrian target contour to continuously evolve, so that the term needs to be changed into the fusion confidence map, the fusion confidence map is ConfMap, and the evolution equation of the changed level set is shown as the following formula:
Figure BDA0002055888930000094
because the pedestrian target confidence map acquisition model outputs a single fusion confidence map, the probability that each pixel belongs to the foreground target is described, and the numerical range p belongs to [0,1 ]. In the LevelSet evolution equation, if the fused confidence map ConfMap is to well guide the contour to evolve, positive and negative values (that is, the signs of the internal and external numerical values of the contour are different) are required, and a numerical range ConfMap is obtained for the range of [ -1,1], so that the fused confidence map output by the pedestrian target confidence map acquisition model is required to be mapped, and the numerical value of the fused confidence map is mapped into the range of [ -1,1 ]. In this embodiment, the formula ConfMap ═ 2p-1 is used for mapping, where p represents the fusion confidence map of the original output. When ConfMap > 0, i.e., inside the contour, the curve has a tendency to flare outward; when the ConfMap is less than 0, i.e. outside the contour, the curve tends to shrink inward, and when the ConfMap is 0, the curve is generally at the contour edge, and then the curve evolves so that it becomes smooth on the edge, guided only by the curvature and edge gradient functions. In addition to determining the direction of profile evolution, ConfMap also determines the profile evolution speed, and the larger the absolute value of the ConfMap is, the faster the profile evolution speed is.
In order to enable the pedestrian target confidence map acquisition model to be more adaptive to the change of the actual pedestrian target video sequence, the embodiment further provides an online tracking update strategy, and the pedestrian target confidence map acquisition model is updated online after the pedestrian contour tracking is completed. Fig. 6 is a flowchart of online updating of the pedestrian target confidence map acquisition model in the present embodiment. As shown in fig. 6, the specific steps of online updating the pedestrian target confidence map acquisition model in this embodiment include:
s601: and acquiring a pedestrian target segmentation result from the previous frame of fused confidence map by using a threshold method, judging whether the number of pixels contained in the pedestrian target of the previous frame is less than a preset threshold, if so, not updating the pedestrian target confidence map acquisition model, and otherwise, entering the step S602. This is because the pedestrian target may be completely blocked or go out of the field of view, so the number of foreground pixels of the pedestrian target in the previous frame needs to be calculated before updating, and if the number of foreground pixels is less than a certain value, the target is determined to disappear, and at this time, the network will not be updated online, so as to prevent the network from only learning the background information and covering the target information.
S602: optimizing the pedestrian target segmentation result of the current frame:
in many methods for tracking a target based on a discriminant and labeled with a bounding box, in order to update a discriminator, positive and negative samples are collected near a target position acquired in a previous frame, and the collected positive and negative samples are used as data for discriminant update training. In the video target segmentation, because the whole image is input, only the target cannot be subjected to sample acquisition alone, but the pedestrian target segmentation result of the current frame can also be optimized based on such a priori, so as to obtain updated data, and the specific method is as follows:
and obtaining a trip person target segmentation result from the fusion confidence map of the previous frame by adopting a threshold value method, calculating the minimum distance between each background pixel and a pedestrian target area in the previous frame, and setting the distance between the pixels of the pedestrian target area to be 0 so as to obtain a distance matrix. And then obtaining a trip human target segmentation result from the fusion confidence map of the current frame by adopting a threshold value method, inquiring each pixel in the pedestrian target from the distance matrix of the previous frame to obtain a distance value, if the distance value is less than or equal to a preset threshold value, determining that the pixel is a real pedestrian target pixel without any operation, if the distance value is greater than the preset threshold value, indicating that the pixel has high probability of belonging to the background (based on the prior that the position change of the target foreground between two frames is not too large), and setting the pixel as the background pixel. Therefore, the pixels which are wrongly segmented out of the range in the current frame pedestrian target segmentation process can be corrected, the segmentation result of the current frame is accurately guaranteed to be used as a next updated data label, and error propagation and accumulation of network updating are prevented.
S603: obtaining an updating sample to update the model:
and obtaining a pedestrian target segmentation label according to the manually calibrated pedestrian outline of the first frame image, and taking the RGB image, the corresponding Depth image and the pedestrian target segmentation label as an update sample. And obtaining a pedestrian target segmentation label according to the pedestrian target segmentation result of the current frame optimized in the step S602, and taking the RGB image, the corresponding Depth image and the pedestrian target segmentation label as an update sample. And selecting one of the two updating samples as an updating sample used for the on-line updating of the pedestrian target confidence map acquisition model at this time, and updating the pedestrian target confidence map acquisition model.
Considering the situation that the obtained pedestrian target of the current frame is also segmented wrongly in the possibly active area, the pedestrian target cannot be corrected through distance judgment, so that only the segmentation result of the pedestrian target of the current frame cannot be used as updated data during updating, otherwise, error propagation and accumulation are caused. Since the labeling information given by the first frame is always the most reliable in the whole pedestrian contour tracking process (the other frames are obtained by dividing the pedestrian target into known labels), the first frame is always required to be used as an alternative updating sample to update the network every time the updating is carried out, namely, each updating is carried out by using updating data from two frames, one is the first frame, and the other is the current frame, and one of the two frames is selected for updating. By using reliable information of the first frame, error propagation accumulation of network updating caused by segmentation errors in the target activity area can be effectively inhibited.
Experiments show that the two updating samples can be used in a crossed manner in online updating to obtain a better effect, the using probability of the updating sample of the current frame is higher, namely, the times are more, the first frame is used for playing an auxiliary role of preventing error propagation and accumulation, the using probability is lower, namely, the times are relatively less, and therefore the pedestrian target confidence map acquisition model can better adapt to actual needs and can also keep good performance. In practical application, the use period of the first frame update sample can be set, the use period is greater than the video image interval, when the use period of the first frame update sample is not reached, the current frame update sample is adopted for online updating, and when the use period of the first frame update sample is reached, the first frame update sample is adopted for online updating.
In order to better illustrate the technical effect of the invention, the self-made data set is adopted to carry out the verification of the pedestrian contour tracking test. In the test and verification, a contour tracking method (marked as MT + LS) which only uses an RGB image as input and takes a MaskTrack method as a confidence map extraction module, a contour tracking method (marked as OVS + LS) which only uses the RGB image as input and takes an original OSVOS network as a confidence map extraction module, and a contour tracking method (marked as OVS + LS + UPD) which only uses the RGB image as input and takes an OSVOS network with an online updating strategy as a confidence map extraction module are adopted as opposite methods, and the tracking results are compared with the tracking results of the pedestrian contour tracking method which uses the online updating strategy and uses Depth information.
Fig. 7 is a comparison graph of the pedestrian contour tracking result of the pedestrian video sequence 1 according to the present invention and three comparison methods in this embodiment. Fig. 8 is a comparison graph of the regional similarity curves of the pedestrian target and the real pedestrian target of the pedestrian video sequence 1 according to the present invention and three comparison methods in this embodiment. Fig. 9 is a comparison graph of the accuracy curves of the pedestrian contour tracking result of the pedestrian video sequence 1 and the real pedestrian contour by the invention and three comparison methods in the embodiment.
Fig. 10 is a comparison graph of the pedestrian contour tracking result of the pedestrian video sequence 2 according to the present invention and three comparison methods in this embodiment. Fig. 11 is a comparison graph of the regional similarity curves of the pedestrian target and the real pedestrian target of the pedestrian video sequence 2 according to the present invention and the three comparison methods in this embodiment. Fig. 12 is a comparison graph of the accuracy curves of the pedestrian contour tracking result of the pedestrian video sequence 2 and the real pedestrian contour by the invention and three comparison methods in the embodiment.
As can be seen from fig. 7 to 12, the pedestrian contour tracking method with the added Depth image information and the added online update strategy according to the present invention has good and stable tracking effect in the tracking scene with dim light and similar target appearance to background appearance.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims (6)

1. A pedestrian contour tracking method fusing RGBD multi-modal information is characterized by comprising the following steps:
s1: the method comprises the following steps of constructing a pedestrian target confidence map acquisition model, wherein the model comprises an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, and the method comprises the following steps:
the RGB target segmentation convolution network module takes the RGB image as input and outputs a confidence map T for segmenting the pedestrian targetRGBThe last layer of feature map FRGBOutputting to a feature map superposition module to obtain a confidence map TRGBOutputting to a fusion module;
the Depth target segmentation convolution network module takes a Depth image corresponding to the RGB image as input and outputs a confidence map T for segmenting the pedestrian targetDepthThe last layer of feature map FDepthOutputting to a feature map superposition module to obtain a confidence map TDepthOutputting to a fusion module;
feature map overlay Module vs. feature map FRGBAnd feature map FDepthOverlapping to obtain a characteristic diagram FswAnd output to the convolution module;
the convolution module is used for comparing the feature map FSWPerforming convolution operation to obtain a probability conversion diagram SW and outputting the probability conversion diagram SW to a fusion module, wherein the expression of the convolution operation is as follows:
SW=h(WSW*FSW+bSW)
wherein, WSWAnd bSWWeight parameters of the 1 x 1 convolution kernel and weight parameters of the bias term, respectively, h (-) represents an activation function;
the confidence map fusion module carries out fusion on the received confidence map T according to the probability conversion map SWRGBConfidence map TDepthFusing to obtain a fused confidence map TfusedWherein the fusion operation expression is as follows:
Tfused=SW⊙TRGB+(1-SW)⊙TDepth
wherein an indicates that a corresponding element is multiplied;
s2: acquiring a plurality of training samples, wherein each training sample comprises an RGB (red, green and blue) image containing a pedestrian target, a corresponding Depth image and a pedestrian target segmentation label, and the pedestrian target segmentation label is used for representing whether each pixel belongs to a pedestrian target or a background;
s3: training a pedestrian target confidence map acquisition model by adopting the following method:
s3.1: the method comprises the steps that an RGB image in a training sample is used as input, an RGB target segmentation convolution network module is trained, a pedestrian target segmentation result is obtained by an output confidence map through a threshold value method in the training process, and error calculation is carried out on the pedestrian target segmentation result and a pedestrian target segmentation label;
s3.2: taking a Depth image in a training sample as input, training a Depth target segmentation convolution network module, acquiring a pedestrian target segmentation result from an output confidence image by a threshold value method in the training process, and performing error calculation with a pedestrian target segmentation label;
s3.3: initializing an RGB target segmentation convolution network module in a pedestrian target belief map acquisition model according to the training result of the step S3.1, initializing a Depth target segmentation convolution network module according to the training result of the step S3.2, taking an RGB image in a training sample as the input of the RGB target segmentation convolution network module, taking a corresponding Depth image as the input of the Depth target segmentation convolution network module, training a pedestrian target belief map acquisition model, acquiring a pedestrian target segmentation result from an output fusion belief map by adopting a threshold value method in the training process, and performing error calculation with a pedestrian target segmentation label;
s4: acquiring a pedestrian video sequence, manually calibrating the pedestrian outline in the first frame image, inputting an RGB image and a corresponding Depth image into a pedestrian target confidence map acquisition model for a subsequent image to obtain a fusion confidence map, and inputting the fusion confidence map as an external energy input guide outline of outline evolution to evolve to obtain a pedestrian outline tracking result of the current image.
2. The pedestrian contour tracking method according to claim 1, wherein the RGB target segmentation convolutional network module and the Depth target segmentation convolutional network module employ an OSVOS network, and when an input of the OSVOS network is a Depth image, a single-channel Depth image needs to be converted into a three-channel image, and the specific method is as follows: the method comprises the steps of firstly filling holes in a Depth image, then coding the Depth image after the holes are filled, and generating a three-channel RGB image.
3. The pedestrian contour tracing method according to claim 1, wherein in step S3.2, a pseudo label Y of a probability transition diagram SW is constructedSWLearning of the unguided probability transformation graph SW, YSWIs defined by the following formula:
YSW=TRGB⊙Y+(1-TRGB)⊙(1-Y)
wherein Y represents a pedestrian target segmentation label in the training sample;
YSWthe result is 1, the pedestrian target segmentation result obtained by the confidence map of the RGB image is consistent with the pedestrian target segmentation label, the prediction result of the RGB image is reliable, and Y isSWIncreasing the weight of the guide probability conversion map SW to the RGB image confidence map, and correspondingly decreasing the weight of the Depth image confidence map; if the value is 0, the pedestrian target segmentation result obtained by the confidence map of the RGB image is inconsistent with the pedestrian target segmentation label, the prediction result of the RGB image is unreliable, and Y isSWThe leading probability transition map SW is decreased by the weight of the RGB image confidence map and correspondingly the weight of the Depth image confidence map will increase.
4. The pedestrian contour tracking method according to claim 1, wherein in step S4, a DRLSE LevelSet model is used for contour evolution.
5. The pedestrian contour tracking method according to any one of claims 1 to 4, further comprising updating a pedestrian target confidence map acquisition model on line, the specific steps comprising:
(1) acquiring a pedestrian target segmentation result from the previous frame of fused confidence map by adopting a threshold method, judging whether the number of pixels contained in the pedestrian target of the previous frame is less than a preset threshold, if so, not updating the pedestrian target confidence map acquisition model, otherwise, entering the step (2);
(2) obtaining a trip person target segmentation result from the fusion confidence map of the previous frame by adopting a threshold value method, calculating the minimum distance between each background pixel and a pedestrian target area in the previous frame, and setting the distance between the pixels of the pedestrian target area to be 0 so as to obtain a distance matrix; then, obtaining a trip human target segmentation result from the fusion confidence map of the current frame by adopting a threshold value method, inquiring each pixel in the pedestrian target from the distance matrix of the previous frame to obtain a distance value, if the distance value is less than or equal to a preset threshold value, not performing any operation, and if the distance value is greater than the preset threshold value, setting the pixel as a background pixel;
(3) obtaining a pedestrian target segmentation label according to a manually calibrated pedestrian outline of the first frame image, and taking an RGB image, a corresponding Depth image and the pedestrian target segmentation label as an update sample; obtaining a pedestrian target segmentation label according to the pedestrian target segmentation result of the current frame optimized in the step (2), and taking the RGB image, the corresponding Depth image and the pedestrian target segmentation label as an update sample; and selecting one of the two updating samples as an updating sample used for the online updating of the pedestrian target confidence map acquisition model at this time, and updating the pedestrian target acquisition model.
6. The pedestrian contour tracking method according to claim 5, wherein in the selecting of the update sample, the use probability of the current frame update sample is higher than that of the first frame update sample.
CN201910389276.6A 2019-05-10 2019-05-10 Pedestrian contour tracking method fusing RGBD multi-modal information Active CN110111351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910389276.6A CN110111351B (en) 2019-05-10 2019-05-10 Pedestrian contour tracking method fusing RGBD multi-modal information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910389276.6A CN110111351B (en) 2019-05-10 2019-05-10 Pedestrian contour tracking method fusing RGBD multi-modal information

Publications (2)

Publication Number Publication Date
CN110111351A CN110111351A (en) 2019-08-09
CN110111351B true CN110111351B (en) 2022-03-25

Family

ID=67489299

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910389276.6A Active CN110111351B (en) 2019-05-10 2019-05-10 Pedestrian contour tracking method fusing RGBD multi-modal information

Country Status (1)

Country Link
CN (1) CN110111351B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796100B (en) * 2019-10-31 2022-06-07 浙江大华技术股份有限公司 Gait recognition method and device, terminal and storage device
CN111627017B (en) * 2020-05-29 2024-02-23 苏州博动戎影医疗科技有限公司 Automatic segmentation method for vascular lumen based on deep learning
CN111968087B (en) * 2020-08-13 2023-11-07 中国农业科学院农业信息研究所 Plant disease area detection method
CN112990171B (en) * 2021-05-20 2021-08-06 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and storage medium
CN113556600B (en) * 2021-07-13 2023-08-18 广州虎牙科技有限公司 Drive control method and device based on time sequence information, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105739702A (en) * 2016-01-29 2016-07-06 电子科技大学 Multi-posture fingertip tracking method for natural man-machine interaction
CN106952294A (en) * 2017-02-15 2017-07-14 北京工业大学 A kind of video tracing method based on RGB D data
CN108470355A (en) * 2018-04-04 2018-08-31 中山大学 Merge the method for tracking target of convolutional network feature and discriminate correlation filter
CN109146921A (en) * 2018-07-02 2019-01-04 华中科技大学 A kind of pedestrian target tracking based on deep learning
CN109543697A (en) * 2018-11-16 2019-03-29 西北工业大学 A kind of RGBD images steganalysis method based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157307B (en) * 2016-06-27 2018-09-11 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105739702A (en) * 2016-01-29 2016-07-06 电子科技大学 Multi-posture fingertip tracking method for natural man-machine interaction
CN106952294A (en) * 2017-02-15 2017-07-14 北京工业大学 A kind of video tracing method based on RGB D data
CN108470355A (en) * 2018-04-04 2018-08-31 中山大学 Merge the method for tracking target of convolutional network feature and discriminate correlation filter
CN109146921A (en) * 2018-07-02 2019-01-04 华中科技大学 A kind of pedestrian target tracking based on deep learning
CN109543697A (en) * 2018-11-16 2019-03-29 西北工业大学 A kind of RGBD images steganalysis method based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"A spatiotemporal warping-based video synchronization method for video stitching";Xue Zhou .etc;《2018 IEEE Visual Communications and Image Processing (VCIP)》;20190425;第1-4页 *
"The Design of an Augmented Reality System for Urban Search and Rescue";Runze Wang .etc;《2018 IEEE International Conference on Intelligence and Safety for Robotics (ISR)》;20181115;第267-272页 *
"融合RGB特征和Depth特征的3D目标识别方法";胡良梅 等;《电子测量与仪器学报》;20151031;第29卷(第10期);第1431-1439页 *
《基于超像素的多特征融合的水平集轮廓跟踪》;周雪 等;《电子科技大学学报》;20180930;第47卷(第5期);第745-752页 *

Also Published As

Publication number Publication date
CN110111351A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
CN110111351B (en) Pedestrian contour tracking method fusing RGBD multi-modal information
CN111507343B (en) Training of semantic segmentation network and image processing method and device thereof
CN109636905B (en) Environment semantic mapping method based on deep convolutional neural network
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN111797716B (en) Single target tracking method based on Siamese network
CN112966684B (en) Cooperative learning character recognition method under attention mechanism
CN109753913B (en) Multi-mode video semantic segmentation method with high calculation efficiency
CN111368846B (en) Road ponding identification method based on boundary semantic segmentation
CN112950477B (en) Dual-path processing-based high-resolution salient target detection method
CN111507210A (en) Traffic signal lamp identification method and system, computing device and intelligent vehicle
CN109543632A (en) A kind of deep layer network pedestrian detection method based on the guidance of shallow-layer Fusion Features
CN108537147A (en) A kind of gesture identification method based on deep learning
WO2019136591A1 (en) Salient object detection method and system for weak supervision-based spatio-temporal cascade neural network
CN107506792B (en) Semi-supervised salient object detection method
US20230059499A1 (en) Image processing system, image processing method, and non-transitory computer readable medium
CN109920018A (en) Black-and-white photograph color recovery method, device and storage medium neural network based
Wang et al. A feature-supervised generative adversarial network for environmental monitoring during hazy days
CN113326735A (en) Multi-mode small target detection method based on YOLOv5
CN111079807A (en) Ground object classification method and device
CN115100491B (en) Abnormal robust segmentation method and system for complex automatic driving scene
CN116486431A (en) RGB-T multispectral pedestrian detection method based on target perception fusion strategy
CN110942463A (en) Video target segmentation method based on generation countermeasure network
CN113506230B (en) Photovoltaic power station aerial image dodging processing method based on machine vision
CN114998132A (en) Weak supervision shadow detection method for mining complementary features through double networks
CN115035429A (en) Aerial photography target detection method based on composite backbone network and multiple measuring heads

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant