CN109977834A

CN109977834A - The method and apparatus divided manpower from depth image and interact object

Info

Publication number: CN109977834A
Application number: CN201910207311.8A
Authority: CN
Inventors: 徐枫; 薄子豪; 雍俊海
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-07-05
Anticipated expiration: 2039-03-19
Also published as: CN109977834B

Abstract

The application proposes a kind of method and apparatus divided manpower from depth image and interact object, wherein method includes: to construct the manpower partitioned data set based on depth image using the dividing method based on color image；Using the manpower partitioned data set based on depth image, training obtains parted pattern, and parted pattern is made of encoder, attention TRANSFER MODEL and decoder；Depth image to be processed is split using parted pattern, obtains tag along sort figure corresponding with depth image to be processed, the value of each pixel is the types value of each pixel in tag along sort figure.The parted pattern that this method is obtained using the manpower partitioned data set based on depth image by training, depth image to be processed is split using parted pattern, realize the manpower and object segmentation of pixel scale, environmental robustness is improved, segmentation precision is higher, the manpower that is capable of handling under complex interaction situation and the case where object segmentation.

Description

The method and apparatus divided manpower from depth image and interact object

Technical field

This application involves technical field of computer vision, more particularly to one kind divide from depth image manpower with interact object The method and apparatus of body.

Background technique

Manpower segmentation is the basic problem of the research fields such as many gesture identifications, hand tracking, manpower reconstruction.It compares It is more more important in human-computer interaction and field of virtual reality to the research under same object interaction mode in individual hand exercise.

Semantic segmentation model neural network based general in recent years is more and more perfect, but the environment of existing method model The manpower that robustness is low, segmentation precision is poor, can not handle under complex interaction situation is divided.

Summary of the invention

The application proposes a kind of method and apparatus divided manpower from depth image and interact object, for solving correlation Existing manpower parted pattern environmental robustness is low in technology, segmentation precision is poor, can not handle manpower under complex interaction situation The problem of segmentation.

The application one side embodiment proposes a kind of method divided manpower from depth image and interact object, packet It includes:

Using the dividing method based on color image, the manpower partitioned data set based on depth image is constructed；

Using the manpower partitioned data set based on depth image, training obtains parted pattern, the parted pattern by Encoder, attention TRANSFER MODEL and decoder are constituted；

Depth image to be processed is split using the parted pattern, is obtained and the depth image to be processed Corresponding tag along sort figure, the value of each pixel is the types value of each pixel in the tag along sort figure, described Types value is used to characterize pixel type affiliated in the depth image to be processed.

The method divided manpower in the slave depth image of the embodiment of the present application and interact object, by using based on color diagram The dividing method of picture constructs the manpower partitioned data set based on depth image, divides data using the manpower based on depth image Collection, training parted pattern, parted pattern are made of encoder, attention TRANSFER MODEL and decoder, are treated using parted pattern The depth image of processing is split, acquisition tag along sort figure corresponding with depth image to be processed, every in tag along sort figure The value of a pixel is the types value of each pixel, each pixel can be determined according to the types value of each pixel belonging to Type, utilize the manpower partitioned data set based on depth image by the obtained parted pattern of training as a result, utilize segmentation mould Type is split depth image to be processed, realizes the manpower and object segmentation of pixel scale, improves environmental robustness, The case where segmentation precision is higher, is capable of handling under complex interaction situation manpower and object segmentation.

The application another aspect embodiment proposes a kind of device divided manpower from depth image and interact object, packet It includes:

Module is constructed, for utilizing the dividing method based on color image, the manpower based on depth image is constructed and divides number According to collection；

Training module, for using the manpower partitioned data set based on depth image, training to obtain parted pattern, institute Parted pattern is stated to be made of encoder, attention TRANSFER MODEL and decoder；

Identification module, for being split using the parted pattern to depth image to be processed, obtain with it is described to The corresponding tag along sort figure of the depth image of processing, the value of each pixel is each pixel in the tag along sort figure Types value, the types value be used for characterize pixel in the depth image to be processed belonging to type.

The device divided manpower in the slave depth image of the embodiment of the present application and interact object, by using based on color diagram The dividing method of picture constructs the manpower partitioned data set based on depth image, divides data using the manpower based on depth image Collection, training parted pattern, parted pattern are made of encoder, attention TRANSFER MODEL and decoder, are treated using parted pattern The depth image of processing is split, acquisition tag along sort figure corresponding with depth image to be processed, every in tag along sort figure The value of a pixel is the types value of each pixel, each pixel can be determined according to the types value of each pixel belonging to Type, utilize the manpower partitioned data set based on depth image by the obtained parted pattern of training as a result, utilize segmentation mould Type is split depth image to be processed, realizes the manpower and object segmentation of pixel scale, improves environmental robustness, The case where segmentation precision is higher, is capable of handling under complex interaction situation manpower and object segmentation.

The additional aspect of the application and advantage will be set forth in part in the description, and will partially become from the following description It obtains obviously, or recognized by the practice of the application.

Detailed description of the invention

The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is a kind of process for dividing manpower and the method for interacting object from depth image provided by the embodiments of the present application Schematic diagram；

Fig. 2 is a kind of structural schematic diagram of parted pattern provided by the embodiments of the present application；

Fig. 3 is a kind of structural schematic diagram of attention Mechanism Model provided by the embodiments of the present application；

Fig. 4 is another stream for dividing manpower and the method for interacting object from depth image provided by the embodiments of the present application Journey schematic diagram；

Fig. 5 is a kind of training process schematic diagram of parted pattern provided by the embodiments of the present application；

Fig. 6 is a kind of effect diagram using profile errors provided by the embodiments of the present application；

Fig. 7 is a kind of structure for dividing manpower and the device for interacting object from depth image provided by the embodiments of the present application Schematic diagram.

Specific embodiment

Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the application, and should not be understood as the limitation to the application.

Below with reference to the accompanying drawings describe to divide in the slave depth image of the embodiment of the present application manpower and the method for interacting object and Device.

Fig. 1 is a kind of process for dividing manpower and the method for interacting object from depth image provided by the embodiments of the present application Schematic diagram.

As shown in Figure 1, manpower should be divided from depth image and with the method for interacting object include:

Step 101, using the dividing method based on color image, the manpower partitioned data set based on depth image is constructed.

Since depth camera can acquire colored and depth image simultaneously, manpower and object are acquired using depth camera Interactive color image and depth image, to obtain multipair color image and depth image.Then, based on color image to depth Image procossing is spent, and then is attained at the manpower partitioned data set of depth image.

In order to improve segmentation precision, in the present embodiment, can in the fixed light source of same brightness and colour temperature, to manpower skin The object for differing biggish color is acquired.For example, blue pen is held in acquisition in the case where same brightness and light source Image.

Step 102, using the manpower partitioned data set based on depth image, training obtains parted pattern.

After obtaining the manpower partitioned data set based on depth image, using the data set to initial neural network model It is trained, the parted pattern met the requirements.

Wherein, in the training process, it can use the estimated performance that loss function measures parted pattern.

In the present embodiment, parted pattern is made of encoder, attention TRANSFER MODEL and decoder.Wherein, encoder makes With large-scale convolutional network, decoder restores high layer information to image pixel scale using warp lamination.

Fig. 2 is a kind of structural schematic diagram of parted pattern provided by the embodiments of the present application.As shown in Fig. 2, parted pattern by Encoder, attention TRANSFER MODEL and decoder are constituted.In the present embodiment, increase attention machine between encoder and decoder System, for strengthening the connection of the same layer between codec, can be improved by merging multi-scale image feature construction attention characteristic pattern The accuracy and validity of information transmitting between the two.

Fig. 3 is a kind of structural schematic diagram of attention Mechanism Model provided by the embodiments of the present application.In Fig. 3, by the 1st layer, 2nd layer ..., (i-1)-th layer of characteristic pattern, which be multiplied, obtains bottom attention (FineAtt)；1st layer, the 2nd layer ..., (i-1)-th layer Every layer include scale scaling network (SqueezeNet, abbreviation SN) and bilinearity down-sampling layer (Bilinear down- Sampling, abbreviation DS), wherein SN can be with normalization characteristic figure dimension.By i+1 layer, the i-th+2 layers ..., n-th layer characteristic pattern It is multiplied to obtain and constitutes high-rise attention (CoarseAtt)；Wherein, i+1 layer, the i-th+2 layers ..., every layer of n-th layer include SN and It up-samples layer (up-sampling layer, abbreviation US).DS and US is respectively used to reduce characteristic pattern scale and amplification characteristic figure ruler Degree.The FineAtt and CoarseAtt that will acquire pay attention to trying hard to, with input decoder after i-th layer of characteristic pattern cascade.To in Fig. 3 1st layer of each layer of the characteristic pattern scale to n-th layer is strengthened using the attention mechanism.

Step 103, depth image to be processed is split using parted pattern, is obtained and depth image to be processed Corresponding tag along sort figure.

In the present embodiment, before being identified to the depth image of processing, it can be obtained by depth camera to be processed Depth image.

After obtaining parted pattern, depth image to be processed is input in the parted pattern that training obtains, divides mould Type exports tag along sort figure corresponding with depth image to be processed.Wherein, tag along sort figure and depth image to be processed Size is identical, and the value of each pixel is the types value of each pixel in tag along sort figure.Types value is for characterizing pixel Point type affiliated in depth image to be processed.In addition, pixel coordinate value is lain in image pixel arrangement, input The value of each pixel is depth value in depth image.

Wherein, pixel type affiliated in depth image to be processed may include manpower, object, background.Specific real Now, these three types of manpower, object, background can be indicated with different types values.For example, with 0 indicate background, 1 indicate manpower, 2 indicate object.

It is available to be processed according to the types value of each pixel and the corresponding type of types value in the present embodiment The segmentation result of manpower and object in depth image realizes manpower and the object segmentation that interacts.

As shown in Fig. 2, depth image to be processed is input in depth network model, encoder is first passed through, using Attention TRANSFER MODEL finally passes through decoder, the tag along sort figure of depth image to be processed is exported, according to each pixel Types value, obtain the position of manpower and object, realize manpower and object segmentation.

In the embodiment of the present application, according to the types value of each pixel in the depth image to be processed of parted pattern output And the corresponding type of types value, can determine the pixel for belonging to manpower and the pixel that belongs to object, thus realize by The manpower of interaction is opened with object segmentation in processing image, realizes the manpower and object segmentation of pixel scale, and segmentation precision is higher, The manpower of interaction under complicated case can be split with object.

In one embodiment of the application, the manpower based on depth image can be constructed according to color image and divides training number According to collection.It is described in detail below with reference to Fig. 4, Fig. 4 is that another kind provided by the embodiments of the present application divides people from depth image The flow diagram of hand and the method for interacting object.

As shown in figure 4, manpower partitioned data set method of the building based on depth image includes:

Step 301, it obtains under manpower and object exchange scenario, multipair color image and depth image.

In the present embodiment, it first can artificially collect and some differ bigger object with manpower skin color.Then, depth is utilized The image under camera shooting manpower and each object exchange scenario is spent, to obtain multipair color image and depth image.In addition, In order to improve data volume, for same object, the image of manpower Yu object distinct interaction posture can be acquired.

When using depth camera acquisition image, fixed-illumination environment, such as the fixation light using same brightness and colour temperature Source, to guarantee the clear shadow-free of color image of acquisition.

Step 302, the object segmentation based on hsv color space is carried out to all color images, obtains every color image In each pixel types value.

In the present embodiment, depth threshold can be first passed through and reject background in all color images and depth image, retain people The image of hand and object.Then, according to existing RGB color to the conversion formula in hsv color space, what be will acquire is all Color image is transformed into hsv color space.Wherein, the parameter in hsv color space is respectively: tone (H), saturation degree (S), lightness (V)。

Later, the corresponding hsv color space of every color image is split, it is each in every color image to obtain The types value of pixel.Specifically, analyzing the distribution of multiple pure hand samples and interaction sampled pixel point in HSV space, sample This overlapping region is the corresponding region of manpower pixel, fits a plurality of Linear Constraints.To all color images into Row analysis, the pixel in constraint are designated as manpower, are designated as object outside constraint.

Step 303, for each pair of color image and depth image, by pixel each in color image, the depth being mapped to Corresponding pixel points in image are spent, the manpower based on depth image is constructed and divides training dataset.

For each pair of color image and depth image, color image and depth image are subjected to pixel alignment, i.e., to depth Estimated respectively with joining inside and outside the camera of color sensor, depth point cloud affine transformation to color camera space is used a kind of It automates mask method and generates the true tag along sort image based on color image, which is also color image The true tag along sort figure of corresponding depth image.Wherein, the types value of each pixel can use 0 in true tag along sort image Indicate background, 1 indicates hand, and 2 indicate object.

In the present embodiment, all depth images and its true tag along sort figure constitute the segmentation of the manpower based on depth image Training dataset.

It further,, can be first before being mapped in one embodiment of the application in order to improve segmentation precision Depth image is pre-processed, is denoised using morphology and profile filtering method, and the background in analysis depth image, The object for only retaining manpower and being interacted with manpower.

It, can be first by the people based on depth image in training pattern after obtaining the data set for training parted pattern Hand segmentation training dataset is divided into training dataset and test data set, wherein training data concentrates the quantity of depth image remote Greater than the quantity that test data concentrates depth image, for training, test data set is used for training completion training dataset Model is tested.

Then, using training dataset, initial parted pattern is trained, and calculates first-loss function.Wherein, First-loss function uses softmax cross entropy loss function, shown in following formula (1):

Wherein, y_iIndicate legitimate reading, x_iIndicating the predicted value of parted pattern output, subscript i indicates different types, under Mark j also illustrates that different types.For example, pixel shares three types, the loss of types value i=0 is calculated first, which isCalculate the loss of types value i=1:Calculate types value i=2's Loss:The loss of so model is

It should be noted that first-loss function is also possible to other loss functions that can be realized segmentation task.

Specifically, the depth image that training data is concentrated is input in initial neural network model, network model is defeated The prediction tag along sort figure of depth image out.Then, according between prediction tag along sort figure and the true tag figure of depth image Gap, feed back to all parameters in network using gradient descent algorithm, and accordingly update network parameter.When next time, input is deep When spending image, the prediction tag along sort figure of network output can be closer to true tag along sort figure.

When the value of training to first-loss function no longer declines, that is to say, that utilize first-loss function, the property of the model When can be optimal, profile errors is used to continue to train as loss function.Wherein, shown in the following formula of profile errors (2):

Wherein, B is fuzzy operation, and the Gaussian kernel of 5 × 5 σ=2.121 such as can be used to carry out Gaussian Blur；S mentions for profile It takes, such as carries out contours extract using the primary operator of rope；M_labelsFor true tag along sort figure, M_logitsFor network output, specially picture The type prediction value of vegetarian refreshments.

When stablizing when the value of profile errors is in, no longer declining, parted pattern can be obtained with deconditioning.Then, it utilizes Test set tests the parted pattern, specifically, the depth image in test set can be input in parted pattern and identified, system Measurement examination concentrates the friendship of all depth images and than (Intersection-over-Union, abbreviation IOU) score, is obtained using IOU Divide to judge whether the parted pattern reaches requirement.

Wherein, IOU refers to the ratio of intersection and union, in the present embodiment, refers to the same legitimate reading of model prediction result Intersection and union ratio, that is, the intersection of model prediction result and legitimate reading, with model prediction result and true knot The ratio of the union of fruit.

Fig. 5 is a kind of training process schematic diagram of parted pattern provided by the embodiments of the present application.Left side is data structure in Fig. 5 Process schematic is built, right side is model training process schematic.When data construct, the color image of depth camera acquisition will be utilized It is aligned with depth image, and generates the true tag along sort figure based on color image using a kind of automation mask method, it is same When be also alignment respective depth image true tag along sort.All depth images and its true tag along sort image construction Manpower based on depth image divides training dataset.

When model training, it is input in attention segmentation network using the depth image in data set, obtains network model The tag along sort figure of prediction compares with true tag along sort figure and calculates loss, and iteration updates network parameter step by step

Fig. 6 is a kind of effect diagram using profile errors provided by the embodiments of the present application.In Fig. 6, what the left side one arranged Object and hand are true tag, and centre one is classified as the network output of unused profile errors, and the column of the right one indicate to have used profile Network output after error.

In the embodiment of the present application, in training parted pattern, by first using general loss function, when general loss letter Numerical value is in when stablizing, i.e., model is optimal under the loss function, using profile errors as loss function training, and Attention Mechanism Model is added in parted pattern, substantially increases the segmentation precision of model as a result,.

Further, in order to enhance the generalization ability of parted pattern, before using training dataset training parted pattern, The operation of data augmentation can be carried out to training dataset, training dataset is added in the depth image that data augmentation is operated.

Wherein, the operation of data augmentation includes rotating freely depth image, being added in random noise, at random overturning depth image At least one.

In order to realize above-described embodiment, the embodiment of the present application also propose one kind divide from depth image manpower with interact object The device of body.Fig. 7 is a kind of knot for dividing manpower and the device for interacting object from depth image provided by the embodiments of the present application Structure schematic diagram.

As shown in fig. 7, should divide manpower from depth image with the device for interacting object includes: building module 610, training Module 620, identification module 630.

Module 610 is constructed, for utilizing the dividing method based on color image, constructs the manpower segmentation based on depth image Data set；

Training module 620, for using the manpower partitioned data set based on depth image, training to obtain segmentation mould Type, the parted pattern are made of encoder, attention TRANSFER MODEL and decoder；

Identification module 630, for being split using the parted pattern to depth image to be processed, obtain with it is described The corresponding tag along sort figure of depth image to be processed, the value of each pixel is each pixel in the tag along sort figure The types value of point, the types value are used to characterize pixel type affiliated in depth image to be processed.

In a kind of possible implementation of the embodiment of the present application, above-mentioned building module 610 is specifically used for:

It acquires under manpower and object exchange scenario, multipair color image and depth image；

Object segmentation based on hsv color space is carried out to all color images, obtains each picture in every color image The types value of vegetarian refreshments；

For each pair of color image and depth image, by pixel each in color image, in the depth image being mapped to Corresponding pixel points construct the manpower based on depth image and divide training dataset.

In a kind of possible implementation of the embodiment of the present application, the depth image is pre-processed, including makes an uproar Sound and background removal.

In a kind of possible implementation of the embodiment of the present application, the manpower partitioned data set based on depth image includes Training dataset and test data set, training module 620, are specifically used for:

Using training dataset, initial neural network model is trained, and calculates first-loss function, wherein the One loss function uses softmax cross entropy loss function；

When the value of first-loss function no longer declines, profile errors is used to continue to train as loss function.

In a kind of possible implementation of the embodiment of the present application, the device further include:

Processing module, for carrying out the operation of data augmentation to the training dataset, the data augmentation operation includes certainly By rotation depth image, random noise, at random at least one of overturning depth image is added.

It should be noted that above-mentioned to dividing explaining for manpower and the embodiment of the method that interacts object from depth image It is bright, it is also applied for the device divided manpower in the slave depth image of the embodiment with interact object, therefore details are not described herein.

Claims

1. a kind of method divided manpower from depth image and interact object characterized by comprising

Using the manpower partitioned data set based on depth image, training obtains parted pattern, and the parted pattern is by encoding Device, attention TRANSFER MODEL and decoder are constituted；

Depth image to be processed is split using the parted pattern, is obtained corresponding with the depth image to be processed Tag along sort figure, the value of each pixel is the types value of each pixel, the type in the tag along sort figure Value is for characterizing pixel type affiliated in the depth image to be processed.

2. the method as described in claim 1, which is characterized in that it is described to utilize the dividing method based on color image, construct base In the manpower partitioned data set of depth image, comprising:

It obtains under manpower and object exchange scenario, multipair color image and depth image；

Object segmentation based on hsv color space is carried out to all color images, obtains each pixel in every color image Types value；

Pixel each in the color image is mapped to the depth image for each pair of color image and depth image Middle corresponding pixel points construct the manpower based on depth image and divide training dataset.

3. method according to claim 2, which is characterized in that it is described by pixel each in the color image, it is mapped to In depth image after corresponding pixel points, further includes:

The depth image is pre-processed, including noise and background removal.

4. method according to claim 2, which is characterized in that the manpower partitioned data set based on depth image includes instruction Practice data set and test data set, it is described to utilize the manpower partitioned data set based on depth image, training parted pattern, packet It includes:

Using the training dataset, initial neural network model is trained, and calculates first-loss function, wherein the One loss function uses softmax cross entropy loss function；

5. method as claimed in claim 4, which is characterized in that described to utilize the training dataset, the training segmentation mould Before type, further includes:

The operation of data augmentation is carried out to the training dataset, the data augmentation operation includes rotating freely depth image, adding Enter at least one of random noise, random overturning depth image.

6. a kind of device divided manpower from depth image and interact object characterized by comprising

Module is constructed, for utilizing the dividing method based on color image, constructs the manpower partitioned data set based on depth image；

Training module, for training and obtaining parted pattern using the manpower partitioned data set based on depth image, described point Model is cut to be made of encoder, attention TRANSFER MODEL and decoder；

Identification module, for being split using the parted pattern to depth image to be processed, obtain with it is described to be processed The corresponding tag along sort figure of depth image, the value of each pixel is the class of each pixel in the tag along sort figure Offset, the types value are used to characterize pixel type affiliated in the depth image to be processed.

7. device as claimed in claim 6, which is characterized in that the building module is specifically used for:

8. device as claimed in claim 7, which is characterized in that further include:

Preprocessing module, for being pre-processed to the depth image, including noise and background removal.

9. device as claimed in claim 7, which is characterized in that the manpower partitioned data set based on depth image includes instruction Practice data set and test data set, the training module be specifically used for:

10. device as claimed in claim 9, which is characterized in that further include:

Processing module, for carrying out the operation of data augmentation to the training dataset, the data augmentation operation includes freely revolving At least one of turn depth image, random noise, random overturning depth image is added.