CN110111351B

CN110111351B - Pedestrian contour tracking method fusing RGBD multi-modal information

Info

Publication number: CN110111351B
Application number: CN201910389276.6A
Authority: CN
Inventors: 周雪; 周政; 邹见效; 徐红兵
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-05-10
Filing date: 2019-05-10
Publication date: 2022-03-25
Anticipated expiration: 2039-05-10
Also published as: CN110111351A

Abstract

The invention discloses a pedestrian contour tracking method fusing RGBD multi-modal information, which comprises the steps of constructing a pedestrian target confidence map acquisition model comprising an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, training the pedestrian target confidence map acquisition model by adopting a training sample, respectively inputting an RGB image and a Depth image into a pedestrian target confidence map during actual application, and acquiring a target contour by taking the obtained fusion confidence map as external energy for guiding active contour evolution. The invention introduces a Depth image as input, and can effectively improve the pedestrian contour tracking effect in the tracking scene with insufficient background light or similar target appearance to background appearance.

Description

Pedestrian contour tracking method fusing RGBD multi-modal information

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a pedestrian contour tracking method fusing RGBD multi-modal information.

Background

As one of the visual tasks with great significance in the field of computer vision research, target tracking comprises advanced technologies such as pattern recognition, image processing and the like. The content of the target tracking task is the position information of the interested target in the appointed frame of the given video sequence, and the target is found out and the position information is given in the subsequent continuous frames in sequence. In recent years, with the continuous development of computer science and technology and the increasing practical application requirements in life and work, the research of target tracking algorithm has become a big hot spot in the field of computer vision.

In the current target tracking algorithm, most of the mature tracking algorithms are based on RGB color images, firstly, because the current RGB image imaging technology is the most mature, a large number of images with good quality can be easily obtained no matter by a mobile phone or a camera, and just because the visual data set based on RGB color images is also extremely rich, secondly, because the RGB images carry enough rich information, for example, various feature information such as color histogram, HOG, LBP and the like can be extracted from the RGB images, which is one of the most important reasons for being the main data source of the visual task. However, despite the advantages, there are problems in real scenes that cannot be solved by only relying on RGB images, such as the case where the target object is similar to the background appearance or the ambient light is dark. In order to make up for the insufficient tracking effect caused by single RGB image information, many tracking algorithms have introduced, for example, a light flow graph or a Depth image as an RGB image supplement to improve the tracking effect.

The Depth image is different from an RGB color image, and is an image which is visually recorded with distance data, and each pixel value of the image records the distance from each position of an imaging object to a camera. The Depth image is generally a 16-bit single-channel image, and the display effect of the Depth image is the same as that of a common single-channel gray scale image, the closer an imaging object is to a camera in vision, the smaller the pixel value of the Depth image is, so the darker the gray scale display is, and the farther the pixel value is from the camera, the larger the gray scale display is, the brighter the gray scale display is. Because the Depth image has the characteristics of being not influenced by illumination change and only related to distance, the fusion of the Depth image can help solve the problem of poor tracking effect caused by the failure of the apparent information of the RGB image in certain specific tracking scenes such as weak ambient light, similar background and target appearance and the like.

In the tracking form of the target, a bounding box-based tracking and an outline-based tracking are mainly classified. The so-called bounding box-based tracking is to use a rectangular frame or an elliptical frame on a tracked target to express the tracked target, and the tracking mode is simple, but the form is fixed, and the flexibility for expressing a non-rigid target is poor. The contour tracking method is the tracking of the target body edge, which needs to perform more complex mathematical calculation, but is really very flexible to the expression of non-rigid targets, can provide the information and change of the target form in detail, and is a very meaningful tracking method. At present, how to integrate the Depth image into the contour tracking is still a technical problem to be solved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a pedestrian contour tracking method fusing RGBD multi-modal information, which combines an RGB image and a Depth image to improve the pedestrian contour tracking effect in a tracking scene with insufficient background light or similar target appearance to background appearance.

In order to achieve the above purpose, the pedestrian contour tracking method fusing RGBD multi-modal information of the present invention includes the following steps:

s1: the pedestrian target confidence map acquisition model is constructed and comprises an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, wherein:

the RGB target segmentation convolution network module takes the RGB image as input and outputs a confidence map T for segmenting the pedestrian target_RGBThe last layer of feature map F_RGBOutputting to a feature map superposition module to obtain a confidence map T_RGBOutputting to a fusion module;

depth target scoreThe segmentation convolution network module takes a Depth image corresponding to the RGB image as input and outputs a confidence map T for segmenting the pedestrian target_DepthThe last layer of feature map F_DepthOutputting to a feature map superposition module to obtain a confidence map T_DepthOutputting to a fusion module;

feature map overlay Module vs. feature map F_RGBAnd feature map F_DepthOverlapping to obtain a characteristic diagram F_SWAnd output to the convolution module;

the convolution module is used for comparing the feature map F_SWPerforming convolution operation to obtain a probability conversion diagram SW and outputting the probability conversion diagram SW to a fusion module, wherein the expression of the convolution operation is as follows:

SW＝h(W_SW*F_SW+b_SW)

wherein, W_SWAnd b_SWWeight parameters of the 1 x 1 convolution kernel and weight parameters of the bias term, respectively, h (-) represents an activation function;

the confidence map fusion module carries out fusion on the received confidence map T according to the probability conversion map SW_RGBConfidence map T_DepthFusing to obtain a fused confidence map T_fusedWherein the fusion operation expression is as follows:

T_fused＝SW⊙T_RGB+(1-SW)⊙T_Depth

wherein an indicates that a corresponding element is multiplied;

s2: acquiring a plurality of training samples, wherein each training sample comprises an RGB (red, green and blue) image containing a pedestrian target, a corresponding Depth image and a pedestrian target segmentation label, and the pedestrian target segmentation label is used for representing whether each pixel belongs to a pedestrian target or a background;

s3: training a pedestrian target confidence map acquisition model by adopting the following method:

s3.1: the method comprises the steps that an RGB image in a training sample is used as input, an RGB target segmentation convolution network module is trained, a pedestrian target segmentation result is obtained by an output confidence map through a threshold value method in the training process, and error calculation is carried out on the pedestrian target segmentation result and a pedestrian target segmentation label;

s3.2: taking a Depth image in a training sample as input, training a Depth target segmentation convolution network module, acquiring a pedestrian target segmentation result from an output confidence image by a threshold value method in the training process, and performing error calculation with a pedestrian target segmentation label;

s3.3: initializing an RGB target segmentation convolution network module in a pedestrian target belief map acquisition model according to the training result of the step S3.1, initializing a Depth target segmentation convolution network module according to the training result of the step S3.2, taking an RGB image in a training sample as the input of the RGB target segmentation convolution network module, taking a corresponding Depth image as the input of the Depth target segmentation convolution network module, training a pedestrian target belief map acquisition model, acquiring a trip human target segmentation result by adopting a threshold value method for an output fusion belief map in the training process, and carrying out error calculation with a pedestrian target segmentation label;

s4: acquiring a pedestrian video sequence, manually calibrating the pedestrian outline in the first frame image, inputting an RGB image and a corresponding Depth image into a pedestrian target confidence map acquisition model for a subsequent image to obtain a fusion confidence map, and inputting the fusion confidence map as an external energy input guide outline of outline evolution to evolve to obtain a pedestrian outline tracking result of the current image.

Drawings

FIG. 1 is a flow chart of an embodiment of a pedestrian contour tracking method fusing RGBD multi-modal information according to the present invention;

FIG. 2 is a block diagram of a pedestrian object confidence map acquisition model in accordance with the present invention;

fig. 3 is a diagram of the OSVOS network structure in the present embodiment;

FIG. 4 is a structural diagram of a pedestrian target confidence map acquisition model based on an OSVOS network in the embodiment;

fig. 5 is an exemplary diagram of conversion of a Depth image into an RGB image in the present embodiment;

FIG. 6 is a flowchart of the online update of the pedestrian target confidence map acquisition model in this embodiment;

FIG. 7 is a comparison chart of the pedestrian contour tracking results of the pedestrian video sequence 1 according to the present invention and three comparison methods in this embodiment;

FIG. 8 is a comparison graph of the regional similarity curves of the pedestrian target and the real pedestrian target in the pedestrian video sequence 1 according to the present invention and three comparison methods in this embodiment;

FIG. 9 is a comparison graph of the accuracy curves of the pedestrian contour tracking result of the pedestrian video sequence 1 and the real pedestrian contour according to the present invention and three comparison methods in this embodiment;

FIG. 10 is a comparison chart of the pedestrian contour tracking results of the pedestrian video sequence 2 according to the present invention and three comparison methods in this embodiment;

FIG. 11 is a comparison graph of the regional similarity curves of the pedestrian target and the real pedestrian target of the pedestrian video sequence 2 according to the present invention and three comparison methods in this embodiment;

fig. 12 is a comparison graph of the accuracy curves of the pedestrian contour tracking result of the pedestrian video sequence 2 and the real pedestrian contour by the invention and three comparison methods in the embodiment.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Examples

FIG. 1 is a flow chart of an embodiment of the pedestrian contour tracking method fusing RGBD multi-modal information. As shown in fig. 1, the pedestrian contour tracking method fusing RGBD multi-modal information of the present invention specifically includes the steps of:

s101: constructing a pedestrian target confidence map acquisition model:

fig. 2 is a structural diagram of a pedestrian target confidence map acquisition model in the present invention. As shown in fig. 2, the pedestrian target confidence map acquisition model in the invention includes an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, which are respectively described in detail below.

The RGB target segmentation convolution network module takes the RGB image as input and outputs a confidence map T for segmenting the pedestrian target_RGBThe last layer of feature map F_RGBOutputting to a feature map superposition module to obtain a confidence map T_RGBAnd outputting the data to the fusion module.

The Depth target segmentation convolution network module takes a Depth image corresponding to the RGB image as input and outputs a confidence map T for segmenting the pedestrian target_DepthThe last layer of feature map F_DepthOutputting to a feature map superposition module to obtain a confidence map T_DepthAnd outputting the data to the fusion module.

Feature map overlay Module vs. feature map F_RGBAnd feature map F_DepthOverlapping to obtain a characteristic diagram F_SWAnd output to the convolution module.

SW＝h(W_SW*F_SW+b_SW)

wherein, W_SWAnd b_SWThe weight parameters of the 1 × 1 convolution kernel and the bias term weight parameters, h (·), respectively, represent activation functions, and in this embodiment, Sigmoid functions are used.

In the embodiment, the probability transition map SW is a single-channel image with pixel values of [0,1], which plays a decision role in weighting the confidence map of the RGB image and the confidence map of the Depth image in fusion.

T_fused＝SW⊙T_RGB+(1-SW)⊙T_Depth

wherein an indicates that the corresponding element is multiplied.

This embodiment is based on a confidence map T_fusedUsing a threshold method, i.e. confidence maps T, for obtaining target segmentation results_fusedIf the pixel value of (2) is greater than the preset threshold value, the pedestrian target is regarded as the pedestrian target, otherwise, the pedestrian target is not regarded as the pedestrian target.

In the invention, the performance of the RGB target segmentation convolutional network module and the Depth target segmentation convolutional network module has a great influence on the quality of the pedestrian contour tracking result of the invention, so that in practical application, the target segmentation convolutional network can be optimized according to the requirement. Through multi-party comparison and research, an OSVOS (One-Shot Video Object Segmentation) network is preferably used as the target Segmentation convolutional network in the present embodiment. The OSVOS network is a target segmentation network based on deep learning, and based on an FCN (full Convolutional neural network) framework, learned information can be transferred to general semantic information, so as to realize foreground segmentation. Fig. 3 is a diagram of the OSVOS network structure in this embodiment. As shown in fig. 3, the OSVOS network includes 5 convolution groups (conv 1-conv 5) for extracting features and a layer of fusion convolution layer (conv final) for fusing all the superimposed feature maps, the feature maps output by the convolution groups conv 2-conv 5 are up-sampled and then superimposed and input into the fusion convolution layer, the 51-channel feature map obtained by superimposition is the last layer of feature map, the feature map of 102 channels can be formed by superimposing two 51-channel feature maps by the feature map superimposing module, and the output of the fusion convolution layer (conv final) is the confidence map. The specific structure and operation of an OSVOS network can be found in the documents "One-Shot Video Object Segmentation, S.Caelles, K.K.Maninis, J.Pont-Tuset, L.Leal-Taix, D.Cremers, and L.Van Gool, Computer Vision and Pattern Recognition (CVPR), 2017".

Fig. 4 is a structural diagram of a pedestrian target confidence map acquisition model based on the OSVOS network in the embodiment. As shown in fig. 4, the RGB image and the Depth image are respectively input into an OSVOS network, the confidence map is output to the convolution module Conv-SW to obtain a probability transformation map SW, the last layer of feature map is input to the confidence map fusion module fuse, and the confidence map fusion module fuse performs fusion operation to obtain a fusion confidence map.

Because the current OSVOS network mainly aims at RGB three-channel images, and the input of the Depth target segmentation convolution network module is a single-channel Depth image, in this embodiment, a single-channel Depth image needs to be converted into a three-channel image, and the specific method is as follows:

the Depth image is first hole filled. This is because the imaging principle of a Depth image capturing device such as Kinect, etc. causes that the acquired Depth image has a pixel value of an invalid distance value 0, and is a black hollow region in visual perception, and a large area of the hollow region has a serious influence on the use of the Depth image, so that the filling of the hollow needs to be performed. In this embodiment, a hole filling method provided in an NYU Depth segmentation data set toolbox is used for hole filling processing of a Depth image, and the principle is based on a coloring method (coloring using optimization) proposed by Anat Levin et al, and the specific process is not described herein again.

And then coding the Depth image filled with the hollow space to generate a three-channel RGB image. In this embodiment, a Depth image is encoded by a Jet colormap encoding method, and is converted into an RGB image. Colormap is a real matrix of m × 3, each row defines an RGB vector of a color, and Jet Colormap represents colors from blue to red, and cyan, yellow and orange are passed through in the middle, so that different Depth values in a Depth image are mapped to gradually changing colors in Jet Colormap to represent gradually changing Depth distances, such as that blue represents the near, and then corresponding colors of the Colormap gradually increase and change with the increase of Depth, and finally reach the farthest represented by red. The coding mode takes depth information effectiveness and computational efficiency into consideration at the same time, and is a relatively balanced coding mode as a whole.

Fig. 5 is an exemplary diagram of conversion of a Depth image into an RGB image in the present embodiment. As shown in fig. 5, after hole filling and encoding, the Depth image can be converted into an RGB image, which is different from the original RGB image in the training sample, and only uses RGB encoding to represent Depth information.

S102: obtaining a training sample:

the method comprises the steps of obtaining a plurality of training samples, wherein each training sample comprises an RGB (red, green and blue) image containing a pedestrian target, a corresponding Depth image and a pedestrian target segmentation label, and the pedestrian target segmentation label is used for representing whether each pixel belongs to a pedestrian target or a background and is a binary image. The training samples can be selected from the existing general sample library or can be obtained by self.

S103: training a pedestrian target confidence map acquisition model:

next, training of the pedestrian target confidence map acquisition model is required. Because the pedestrian target confidence map acquisition model adopts a dual-channel structure model, in order to acquire better performance, end-to-end training is not suitable to be directly carried out, but needs to be carried out step by step, namely, a training mode of 'part first and whole second' is adopted, and the specific method comprises the following steps:

s3.1: training an RGB target segmentation convolution network module:

the method comprises the steps of training an RGB target segmentation convolution network module by using RGB images in a training sample as input, obtaining a pedestrian target segmentation result by using a threshold value method for an output confidence map in the training process, and carrying out error calculation on the pedestrian target segmentation result and a pedestrian target segmentation label.

S3.2: training a Depth target segmentation convolution network module:

and (3) training a Depth image in a training sample as an input, and similarly, acquiring a pedestrian target segmentation result from an output confidence map by using a threshold method in the training process, and performing error calculation with a pedestrian target segmentation label.

Because the convolution network is divided by two targets in the embodimentIn the OSVOS network, because the parameters of the basic network can be directly obtained, the training stage of the basic network can be omitted and two single-channel network models using different input information can be directly trained respectively, and the parameters of convolution layers can be directly migrated from the trained basic network. Each OSVOS needs to go through three stages, with specific parameters: selecting step by the learning strategy, namely reducing the learning rate according to a given gamma value when iteration reaches a certain number of times; the basic learning rate is set to 10^-8(ii) a The gamma attenuation rate is 0.1; stepsize is attenuation step length and is set to 10000; the weight decay rate is set to 0.0002; itersize is 10, i.e. 10 pictures are fed in at the same time for training each iteration.

S3.3: integral training:

initializing an RGB target segmentation convolution network module in the pedestrian target belief map acquisition model according to the training result of the step S3.1, initializing a Depth target segmentation convolution network module according to the training result of the step S3.2, taking an RGB image in a training sample as the input of the RGB target segmentation convolution network module, taking a corresponding Depth image as the input of the Depth target segmentation convolution network module, training the pedestrian target belief map acquisition model, similarly, acquiring a pedestrian target segmentation result by adopting a threshold method for an output fusion belief map in the training process, and carrying out error calculation with a pedestrian target segmentation label.

In this embodiment, the parameter value of the convolutional layer of the pedestrian target confidence map acquisition model is initialized to 0, and the training iteration number is 20000.

In order to realize automatic weighing of the confidence map of the RGB image or the confidence map of the Depth image, the pseudo label Y of a probability conversion map SW is constructed in the embodiment_SWLearning of the unguided probability transformation graph SW, Y_SWIs defined by the following formula:

Y_SW＝T_RGB⊙Y+(1-T_RGB)⊙(1-Y)

wherein Y represents a pedestrian object segmentation label in the training sample.

Y_SWThe value of 1 indicates the pedestrian target segmentation node obtained by the confidence map of the RGB imageIf the result is consistent with the pedestrian target segmentation label, the prediction result of the RGB image is reliable, and Y is_SWIncreasing the weight of the guide probability conversion map SW to the RGB image confidence map, and correspondingly decreasing the weight of the Depth image confidence map; if the value is 0, the pedestrian target segmentation result obtained by the confidence map of the RGB image is inconsistent with the pedestrian target segmentation label, the prediction result of the RGB image is unreliable, and Y is_SWThe leading probability transition map SW is decreased by the weight of the RGB image confidence map and correspondingly the weight of the Depth image confidence map will increase.

S104: and (3) pedestrian contour tracking:

acquiring a pedestrian video sequence, manually calibrating the pedestrian outline in the first frame image, inputting an RGB image and a corresponding Depth image into a pedestrian target confidence map acquisition model for a subsequent image to obtain a fusion confidence map, and inputting the confidence map as an external energy input guide outline of outline evolution to evolve to obtain a pedestrian outline tracking result of the current image.

In this embodiment, a DRLSE LevelSet model is used for profile evolution, and an evolution equation can be expressed as follows:

where phi denotes the level set function, t denotes the time,

the gradient of the level set function is represented,

where p () is a double potential well function, p' () represents its first derivative, δ_ε(φ) represents a Dirac function, div () represents the divergence, the term represents the distance regularization term, g represents the edge indication function, and α represents the weighted area term coefficient.

First term of evolution equation

Is a positive distanceThe term is normalized. Partial differential equation of the second term of the evolution equation

The method is characterized in that an equation corresponding to a GAC geodesic active contour model is adopted, a curvature and edge gradient function jointly guide curve evolution, the curvature evolution plays a role of smoothing a curve, and the curve can be converged to the maximum value of the local gradient of an image and depends on the edge gradient function. The third term of the evolution equation has the functions of accelerating the evolution of the level set and determining the evolution direction of the curve. In the invention, the fusion confidence map is adopted to guide the pedestrian target contour to continuously evolve, so that the term needs to be changed into the fusion confidence map, the fusion confidence map is ConfMap, and the evolution equation of the changed level set is shown as the following formula:

because the pedestrian target confidence map acquisition model outputs a single fusion confidence map, the probability that each pixel belongs to the foreground target is described, and the numerical range p belongs to [0,1 ]. In the LevelSet evolution equation, if the fused confidence map ConfMap is to well guide the contour to evolve, positive and negative values (that is, the signs of the internal and external numerical values of the contour are different) are required, and a numerical range ConfMap is obtained for the range of [ -1,1], so that the fused confidence map output by the pedestrian target confidence map acquisition model is required to be mapped, and the numerical value of the fused confidence map is mapped into the range of [ -1,1 ]. In this embodiment, the formula ConfMap ═ 2p-1 is used for mapping, where p represents the fusion confidence map of the original output. When ConfMap > 0, i.e., inside the contour, the curve has a tendency to flare outward; when the ConfMap is less than 0, i.e. outside the contour, the curve tends to shrink inward, and when the ConfMap is 0, the curve is generally at the contour edge, and then the curve evolves so that it becomes smooth on the edge, guided only by the curvature and edge gradient functions. In addition to determining the direction of profile evolution, ConfMap also determines the profile evolution speed, and the larger the absolute value of the ConfMap is, the faster the profile evolution speed is.

In order to enable the pedestrian target confidence map acquisition model to be more adaptive to the change of the actual pedestrian target video sequence, the embodiment further provides an online tracking update strategy, and the pedestrian target confidence map acquisition model is updated online after the pedestrian contour tracking is completed. Fig. 6 is a flowchart of online updating of the pedestrian target confidence map acquisition model in the present embodiment. As shown in fig. 6, the specific steps of online updating the pedestrian target confidence map acquisition model in this embodiment include:

s601: and acquiring a pedestrian target segmentation result from the previous frame of fused confidence map by using a threshold method, judging whether the number of pixels contained in the pedestrian target of the previous frame is less than a preset threshold, if so, not updating the pedestrian target confidence map acquisition model, and otherwise, entering the step S602. This is because the pedestrian target may be completely blocked or go out of the field of view, so the number of foreground pixels of the pedestrian target in the previous frame needs to be calculated before updating, and if the number of foreground pixels is less than a certain value, the target is determined to disappear, and at this time, the network will not be updated online, so as to prevent the network from only learning the background information and covering the target information.

S602: optimizing the pedestrian target segmentation result of the current frame:

in many methods for tracking a target based on a discriminant and labeled with a bounding box, in order to update a discriminator, positive and negative samples are collected near a target position acquired in a previous frame, and the collected positive and negative samples are used as data for discriminant update training. In the video target segmentation, because the whole image is input, only the target cannot be subjected to sample acquisition alone, but the pedestrian target segmentation result of the current frame can also be optimized based on such a priori, so as to obtain updated data, and the specific method is as follows:

and obtaining a trip person target segmentation result from the fusion confidence map of the previous frame by adopting a threshold value method, calculating the minimum distance between each background pixel and a pedestrian target area in the previous frame, and setting the distance between the pixels of the pedestrian target area to be 0 so as to obtain a distance matrix. And then obtaining a trip human target segmentation result from the fusion confidence map of the current frame by adopting a threshold value method, inquiring each pixel in the pedestrian target from the distance matrix of the previous frame to obtain a distance value, if the distance value is less than or equal to a preset threshold value, determining that the pixel is a real pedestrian target pixel without any operation, if the distance value is greater than the preset threshold value, indicating that the pixel has high probability of belonging to the background (based on the prior that the position change of the target foreground between two frames is not too large), and setting the pixel as the background pixel. Therefore, the pixels which are wrongly segmented out of the range in the current frame pedestrian target segmentation process can be corrected, the segmentation result of the current frame is accurately guaranteed to be used as a next updated data label, and error propagation and accumulation of network updating are prevented.

S603: obtaining an updating sample to update the model:

and obtaining a pedestrian target segmentation label according to the manually calibrated pedestrian outline of the first frame image, and taking the RGB image, the corresponding Depth image and the pedestrian target segmentation label as an update sample. And obtaining a pedestrian target segmentation label according to the pedestrian target segmentation result of the current frame optimized in the step S602, and taking the RGB image, the corresponding Depth image and the pedestrian target segmentation label as an update sample. And selecting one of the two updating samples as an updating sample used for the on-line updating of the pedestrian target confidence map acquisition model at this time, and updating the pedestrian target confidence map acquisition model.

Considering the situation that the obtained pedestrian target of the current frame is also segmented wrongly in the possibly active area, the pedestrian target cannot be corrected through distance judgment, so that only the segmentation result of the pedestrian target of the current frame cannot be used as updated data during updating, otherwise, error propagation and accumulation are caused. Since the labeling information given by the first frame is always the most reliable in the whole pedestrian contour tracking process (the other frames are obtained by dividing the pedestrian target into known labels), the first frame is always required to be used as an alternative updating sample to update the network every time the updating is carried out, namely, each updating is carried out by using updating data from two frames, one is the first frame, and the other is the current frame, and one of the two frames is selected for updating. By using reliable information of the first frame, error propagation accumulation of network updating caused by segmentation errors in the target activity area can be effectively inhibited.

Experiments show that the two updating samples can be used in a crossed manner in online updating to obtain a better effect, the using probability of the updating sample of the current frame is higher, namely, the times are more, the first frame is used for playing an auxiliary role of preventing error propagation and accumulation, the using probability is lower, namely, the times are relatively less, and therefore the pedestrian target confidence map acquisition model can better adapt to actual needs and can also keep good performance. In practical application, the use period of the first frame update sample can be set, the use period is greater than the video image interval, when the use period of the first frame update sample is not reached, the current frame update sample is adopted for online updating, and when the use period of the first frame update sample is reached, the first frame update sample is adopted for online updating.

In order to better illustrate the technical effect of the invention, the self-made data set is adopted to carry out the verification of the pedestrian contour tracking test. In the test and verification, a contour tracking method (marked as MT + LS) which only uses an RGB image as input and takes a MaskTrack method as a confidence map extraction module, a contour tracking method (marked as OVS + LS) which only uses the RGB image as input and takes an original OSVOS network as a confidence map extraction module, and a contour tracking method (marked as OVS + LS + UPD) which only uses the RGB image as input and takes an OSVOS network with an online updating strategy as a confidence map extraction module are adopted as opposite methods, and the tracking results are compared with the tracking results of the pedestrian contour tracking method which uses the online updating strategy and uses Depth information.

Fig. 7 is a comparison graph of the pedestrian contour tracking result of the pedestrian video sequence 1 according to the present invention and three comparison methods in this embodiment. Fig. 8 is a comparison graph of the regional similarity curves of the pedestrian target and the real pedestrian target of the pedestrian video sequence 1 according to the present invention and three comparison methods in this embodiment. Fig. 9 is a comparison graph of the accuracy curves of the pedestrian contour tracking result of the pedestrian video sequence 1 and the real pedestrian contour by the invention and three comparison methods in the embodiment.

Fig. 10 is a comparison graph of the pedestrian contour tracking result of the pedestrian video sequence 2 according to the present invention and three comparison methods in this embodiment. Fig. 11 is a comparison graph of the regional similarity curves of the pedestrian target and the real pedestrian target of the pedestrian video sequence 2 according to the present invention and the three comparison methods in this embodiment. Fig. 12 is a comparison graph of the accuracy curves of the pedestrian contour tracking result of the pedestrian video sequence 2 and the real pedestrian contour by the invention and three comparison methods in the embodiment.

As can be seen from fig. 7 to 12, the pedestrian contour tracking method with the added Depth image information and the added online update strategy according to the present invention has good and stable tracking effect in the tracking scene with dim light and similar target appearance to background appearance.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. A pedestrian contour tracking method fusing RGBD multi-modal information is characterized by comprising the following steps:

s1: the method comprises the following steps of constructing a pedestrian target confidence map acquisition model, wherein the model comprises an RGB target segmentation convolution network module, a Depth target segmentation convolution network module, a feature map superposition module, a convolution module and a confidence map fusion module, and the method comprises the following steps:

the Depth target segmentation convolution network module takes a Depth image corresponding to the RGB image as input and outputs a confidence map T for segmenting the pedestrian target_DepthThe last layer of feature map F_DepthOutputting to a feature map superposition module to obtain a confidence map T_DepthOutputting to a fusion module;

SW＝h(W_SW*F_SW+b_SW)

T_fused＝SW⊙T_RGB+(1-SW)⊙T_Depth

wherein an indicates that a corresponding element is multiplied;

s3.3: initializing an RGB target segmentation convolution network module in a pedestrian target belief map acquisition model according to the training result of the step S3.1, initializing a Depth target segmentation convolution network module according to the training result of the step S3.2, taking an RGB image in a training sample as the input of the RGB target segmentation convolution network module, taking a corresponding Depth image as the input of the Depth target segmentation convolution network module, training a pedestrian target belief map acquisition model, acquiring a pedestrian target segmentation result from an output fusion belief map by adopting a threshold value method in the training process, and performing error calculation with a pedestrian target segmentation label;

2. The pedestrian contour tracking method according to claim 1, wherein the RGB target segmentation convolutional network module and the Depth target segmentation convolutional network module employ an OSVOS network, and when an input of the OSVOS network is a Depth image, a single-channel Depth image needs to be converted into a three-channel image, and the specific method is as follows: the method comprises the steps of firstly filling holes in a Depth image, then coding the Depth image after the holes are filled, and generating a three-channel RGB image.

3. The pedestrian contour tracing method according to claim 1, wherein in step S3.2, a pseudo label Y of a probability transition diagram SW is constructed_SWLearning of the unguided probability transformation graph SW, Y_SWIs defined by the following formula:

Y_SW＝T_RGB⊙Y+(1-T_RGB)⊙(1-Y)

wherein Y represents a pedestrian target segmentation label in the training sample;

Y_SWthe result is 1, the pedestrian target segmentation result obtained by the confidence map of the RGB image is consistent with the pedestrian target segmentation label, the prediction result of the RGB image is reliable, and Y is_SWIncreasing the weight of the guide probability conversion map SW to the RGB image confidence map, and correspondingly decreasing the weight of the Depth image confidence map; if the value is 0, the pedestrian target segmentation result obtained by the confidence map of the RGB image is inconsistent with the pedestrian target segmentation label, the prediction result of the RGB image is unreliable, and Y is_SWThe leading probability transition map SW is decreased by the weight of the RGB image confidence map and correspondingly the weight of the Depth image confidence map will increase.

4. The pedestrian contour tracking method according to claim 1, wherein in step S4, a DRLSE LevelSet model is used for contour evolution.

5. The pedestrian contour tracking method according to any one of claims 1 to 4, further comprising updating a pedestrian target confidence map acquisition model on line, the specific steps comprising:

(1) acquiring a pedestrian target segmentation result from the previous frame of fused confidence map by adopting a threshold method, judging whether the number of pixels contained in the pedestrian target of the previous frame is less than a preset threshold, if so, not updating the pedestrian target confidence map acquisition model, otherwise, entering the step (2);

(2) obtaining a trip person target segmentation result from the fusion confidence map of the previous frame by adopting a threshold value method, calculating the minimum distance between each background pixel and a pedestrian target area in the previous frame, and setting the distance between the pixels of the pedestrian target area to be 0 so as to obtain a distance matrix; then, obtaining a trip human target segmentation result from the fusion confidence map of the current frame by adopting a threshold value method, inquiring each pixel in the pedestrian target from the distance matrix of the previous frame to obtain a distance value, if the distance value is less than or equal to a preset threshold value, not performing any operation, and if the distance value is greater than the preset threshold value, setting the pixel as a background pixel;

(3) obtaining a pedestrian target segmentation label according to a manually calibrated pedestrian outline of the first frame image, and taking an RGB image, a corresponding Depth image and the pedestrian target segmentation label as an update sample; obtaining a pedestrian target segmentation label according to the pedestrian target segmentation result of the current frame optimized in the step (2), and taking the RGB image, the corresponding Depth image and the pedestrian target segmentation label as an update sample; and selecting one of the two updating samples as an updating sample used for the online updating of the pedestrian target confidence map acquisition model at this time, and updating the pedestrian target acquisition model.

6. The pedestrian contour tracking method according to claim 5, wherein in the selecting of the update sample, the use probability of the current frame update sample is higher than that of the first frame update sample.