CN108596256B

CN108596256B - Object recognition classifier construction method based on RGB-D

Info

Publication number: CN108596256B
Application number: CN201810383002.1A
Authority: CN
Inventors: 胡勇; 周锋; 迟小羽
Original assignee: Qingdao Research Institute Of Beihang University
Current assignee: Qingdao Research Institute Of Beihang University
Priority date: 2018-04-26
Filing date: 2018-04-26
Publication date: 2022-04-01
Anticipated expiration: 2038-04-26
Also published as: CN108596256A

Abstract

The invention provides a new construction method of an object recognition classifier based on RGB-D, which mainly solves the problems that the scale of the existing RGB-D database is small and the recognition accuracy of the trained RGB-D classifier on rare objects in the database is not high, and comprises the following steps: the method comprises the steps of collecting RGB modal pictures of an object and depth modal pictures at the same pose, sequentially extracting the characteristics of the RGB modal pictures and the characteristics of the corresponding depth modal pictures, then manually analyzing the collected RGB modal pictures and the depth modal pictures in sequence, and adding labels. An object classifier is constructed by combining the RGB modal features and depth modal features together. The method can be applied to object identification application, and can effectively identify the type of the current object by sampling RGB and depth modal data of the current object.

Description

Object recognition classifier construction method based on RGB-D

Technical Field

The invention belongs to the technical field of computer application, and particularly relates to a RGB-D-based object recognition classifier construction method.

Background

Since the invention of ENIAC computer, which started to operate in Philadelphia on 2/14 of 1946, some researchers and users with advanced insights are thinking and discussing whether computers can have independent and autonomous thinking and problem solving ability like human, which is called early artificial intelligence. To determine whether a machine has intelligence, computer scientists and pioneering graphics of cryptography have proposed in the "computer machines and intelligence" literature the concept of "turing tests" that a computer passes a test if it can answer a series of questions posed by a human tester in 5 minutes, and if more than 30% of the answers are deemed to be answered by humans rather than by computers. The ultimate goal of artificial intelligence is to liberate human beings from complex, dangerous, repeated, monotonous work and the like, improve the lives of people and promote the development of human beings. In the research of biologists, more than 80% of the external information received by human beings comes from both eyes of the human beings, so that the machine vision is particularly important compared with the research of computers. The object recognition task is one of the most basic and important tasks in machine vision.

For the identification of objects, the existing technologies can be roughly classified into three categories: (1) object recognition based on RGB. The method is to extract the characteristic information of RGB modal data, and input the extracted characteristic RGB characteristic information into a specific classifier to identify the object. (2) Object recognition based on depth. The method is to extract the characteristic information of depth modal data, and to input the extracted depth characteristic information into a specific classifier to identify the object. (3) Based on a mode of combining RGB and depth modal information, fusing RGB data and depth data into 4-channel picture data and then extracting features, or respectively extracting features from RGB modal data and depth modal data, and then combining the RGB data and the depth modal data to carry out object identification on a classifier.

The patent with the application number of CN201510402298.3 discloses an RGB-D image classification method and system, which mainly use a method based on the current popular deep learning convolutional neural network CNN to extract the characteristics of RGB and depth, and then artificially splice together, and train an SVM in a dictionary learning mode. The CNN network is a data-driven method, namely a large amount of labeled training data is needed, and the existing RGB-D labeled classification data set is very small compared with the RGB label data set, so that the CNN network provided by the invention cannot be supported sufficiently, and a very serious overfitting problem is easily caused; meanwhile, since many real-world situations are rare, for example, apples bought and sold in fruit stores, and some apples are blocked in a large area due to a large number of trademarks attached to the apples, the situation is basically hard to see for the RGB-D data sets collected by people, and the long tail situation causes that the RGB-D classifier constructed by people is not ideal for solving the classification situation in the situation.

In view of the above, it is an important object of the present invention to provide a new object recognition method based on RGB-D modal data, so as to solve the problem of small scale of the existing RGB-D database and the problem of low accuracy of the trained RGB-D classifier in recognizing rare objects in the database.

Disclosure of Invention

The invention provides a new object recognition classifier construction method based on RGB-D, aiming at the problems that the existing RGB-D database is not enough to support the training of a deep neural network and is easy to cause overfitting, and meanwhile, the large-scale database has serious long tail distribution, and the scheme is as follows:

a construction method of an object recognition classifier based on RGB-D comprises the following steps:

step one, constructing an RGB-D object recognition database

Wherein the RGB modality data is recorded as

depth modal data note

Step two, identifying and classifying the collected RGB-D pictures, manually calibrating the category of each picture, c^*E {1, 2.., C }, where C represents the total number of categories of pictures we captured;

step three, converting the acquired picture by using four conversion operations of T ═ T, s, r and c; creating an agent class for each picture to obtain an RGB modal agent class training set

And depth modal proxy class training set

Wherein t represents the vertical and horizontal movement of the picture, s represents the size conversion operation of the picture, r represents the rotation operation of the picture, and c represents the color conversion operation of the picture;

step four, network training process, utilizing the collected RGB modal data

Created proxy class

Training an RGB network for object recognition; preprocessing the picture input into the RGB training network, and inputting the processed picture into the network to train the RGB network by selectively shielding the most distinguishing area in the picture input into the network;

step five, network training process, utilizing collected depth modal data

Created proxy class

Training a depth network of object recognition, adopting the same preprocessing operation as RGB (red, green and blue) modal data for depth modal data, and inputting the processed pictures into the depth training network for training the depth network;

step six, a network training process, namely fusing the RGB network and the depth network together by a classifier fusion method to form an RGB-D object recognition network;

seventhly, a network reasoning process, namely extracting the characteristics of RGB modal data by utilizing an RGB network in an RGB-D object recognition network;

step eight, extracting features of depth modal data by using a depth network in the RGB-D object recognition network;

step nine, fusing the extracted RGB features and depth features together through fusion of the classifier layer, and recording the fused features as f_rgbd；

Step ten, fusing the characteristics f_rgbdSending the data to a classifier_rgbdThe identification of the object is performed.

Further, the process of extracting the features of the RGB modal data by using the RGB network in the seventh step is as follows: firstly, normalizing the collected pictures, then sending the normalized pictures into a 5-layer convolution network, obtaining a characteristic diagram after convolution and then connecting a pooling layer, inputting the obtained characteristic diagram into a three-layer full-connection network, and obtaining a characteristic diagram f_rgb。

Further, the process of extracting features of depth modal data by using a depth network in the step eight is as follows: firstly, normalizing the collected pictures into the same size, then sending the normalized pictures into a 5-layer convolutional network, obtaining a heat map area of a characteristic map input into the network through a 5-layer multilayer perceptron, shielding one third area in the heat map by randomly selecting the area, inputting the shielded pictures into a pooling layer to obtain the characteristic map, inputting the obtained characteristic map into a two-layer fully-connected network to obtain a characteristic map f_depth。

Further, in the ninth step, the extracted features of the two modalities are fused to construct a fusion feature f_rgbdThe fusion method comprises the following steps: f obtained_rgbAnd f_depthSplicing the materials together according to the channel dimension, and if the materials are convolutional layers, utilizing the formula:

where l denotes the layer i network, feature_lThe first layer is shown as a characteristic diagram, stride_lThe step size of the convolution kernel shift is indicated; if it is a pooling layer, then the formula is used

Where kernel represents the pooled kernel size of the pooled layer.

Further, in the step ten, the classification of the classifier is calculated as follows: the fusion features of a test sample are extracted and input into a classifier, the trained classifier returns C numerical values to the input RGB-D object image through SoftMax, and then the class of the object is predicted by calculating which numerical value is the largest.

Compared with the prior art, the invention has the advantages and positive effects that:

the invention provides a novel RGB-D-based object classifier construction method, which can analyze the class of a collected object according to collected multi-mode information and take the characteristics among the multi-mode information into consideration during training. Most object classifiers utilize RGB texture information to train the classifier by extracting features from RGB images, but in reality, the problem that the classification of two cups is difficult to solve only by texture exists, for example, two mug cups with similar colors are difficult to distinguish by texture, but because the distance relationship exists between the two cups, the two cups can be distinguished by the depth relationship between the two cups. The method provided by the invention combines the RGB characteristic information and depth characteristic information to train the object classifier, so that the method is more suitable for actual conditions.

Detailed Description

The invention provides a classifier construction method based on RGB-D object recognition, which is mainly used for providing a counterlearning module based on a deep learning convolutional neural network CNN which is very popular at present, and providing a sample which is difficult to classify through an artificially constructed classifier (the method used in the method is that the most discriminant area in an image is artificially shielded, for example, if an animal in the image is identified to be a dog or not, the most discriminant area is the dog head for the dog, the head of the dog is artificially shielded), and the difficulty of training of the classifier is increased by utilizing the artificially manufactured difficult example, so that the trained classifier is more discriminant and is more robust. In addition, the proposed method is an end-to-end method, segmentation optimization is not needed, and direct optimization can be performed from beginning to end.

In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described with reference to the following examples.

The embodiment provides a classifier construction method based on RGB-D object recognition, which comprises the following steps:

the method comprises the following steps: constructing an RGB-D object recognition database;

the method comprises the steps of collecting indoor general office articles by using a Microsoft depth sensor Kinect V1, placing the articles on a rotary platform, collecting the articles once every 5 degrees to form an RGB-D object identification database

Wherein the RGB modality data is recorded as

depth modal data note

Step two, identifying and classifying the collected RGB-D pictures by manpower, calibrating the category of each picture by manpower, c^*E {1, 2.., C }, where C represents the total number of categories of pictures we captured;

and step three, transforming the acquired picture by using four transformation operations of T ═ T, s, r and c, wherein T operation represents that the picture is vertically and horizontally translated, s represents a scale factor, the size of the picture is multiplied by the scale factor to achieve the transformation operation of the size of the picture, r represents that the picture is rotated, and c represents a color transformation operation. Through the four operations, an agent class is created for each picture, and an RGB modal agent class training set is obtained

And depth modal proxy class training set

The process of creating an agent class for each captured picture in this step is as follows: a first operation of t, a horizontal and vertical transformation of size t e (-0.2,0.2), a second operation of S, a picture size transformation of interval S e (0.5,1), a third operation of r, an image rotation (-20,20) a random value, a fourth operation of c, a color transformation (converting an RGB picture into an HSV color space, for S and V components, pow (x, a ') exponentiation b' + c ', where pow represents an exponential operation, x represents the component values of the current S and V, a' is a random number between (0.25, 4), b 'is a random number between (0.7,2.1), c' is a random number between (-0.25,0.25), for H component y d '+ e', where y represents the current H component, d 'is a value between (0.7,1.4), e' is a value (-0.1.1), 0.1) of the total).

The step mainly solves the problem that the small sample data set is easy to be over-fitted, and the existing RGB-D data set is combined by the four transformations provided by the invention, so that an agent class can be generated for each picture, and each agent class shares one label data, thereby solving the problem of data set expansion on the basis of not increasing the labor cost. In the step, the influence of the number of the pictures in the proxy class generated by each picture on network training is mainly considered, if the number of the pictures in the proxy class is too small, the phenomenon of over-fitting is still easily caused, and if the number of the pictures in the proxy class is too large, the similarity between the pictures is too high, the information redundancy between data is easily caused, and the extraction of the effective features of the pictures is not beneficial;

step four, network training process, utilizing the collected RGB modal data

Created proxy class

An RGB network for object recognition is trained. Because the real situation is more complicated than the training data sampled by people, in order to enable the RGB network trained by people to well process the complicated situation in the real situation, the pictures input into the RGB training network are preprocessed, the processed pictures are input into the network to train the RGB network by selectively shielding the most distinguishing region of the pictures input into the network;

the embodiment adopts the most intuitive and simple method for artificially manufacturing the training difficult sample, and of course, other methods can be adopted, for example, the most discriminant area of the feature map in the specific layer is shielded in the running process.

Step five, network training process, utilizing collected depth modal data

Created proxy class

firstly, the collected pictures are normalized to be s × s (s represents a fixed value and has no practical meaning, in the embodiment, the pictures are normalized to be 257 × 257), then the normalized pictures are sent to a 5-layer convolution network, the size of a first convolution layer kernel is 11 × 11, the number of convolution kernels is 96, the size of a second convolution kernel is 5 × 5, the number of convolution kernels is 256, the size of a third convolution kernel is 3 × 3, the size of a convolution kernel is 384, the size of a fourth convolution kernel is 3 × 3, and the size of a convolution kernel is 384The fifth convolution kernel size is 3 × 3 and the convolution kernel size is 256, and this layer of convolution is followed by a pooling layer (pooling), resulting in a feature map denoted fm_rgbWill yield fm_rgbInputting the data into a three-layer fully-connected network, wherein the output size of a first fully-connected layer is 4096, the output size of a second fully-connected layer is 4096, and a characteristic diagram obtained after the data passes through the two layers of fully-connected networks is f_rgb。

firstly, normalizing the collected pictures into the pictures with the same size of s multiplied by s, then sending the normalized pictures into a convolution network with 5 layers, wherein the size of a first convolution kernel is 11 multiplied by 11, the number of convolution kernels is 96, the size of a second convolution kernel is 5 multiplied by 5, the number of convolution kernels is 256, the size of a third convolution kernel is 3 multiplied by 3, the size of a convolution kernel is 384, the size of a fourth convolution kernel is 3 multiplied by 3, the size of a convolution kernel is 384, the size of a fifth convolution kernel is 3 multiplied by 3, and the size of a convolution kernel is 256, inputting the characteristic diagram obtained by the layers into the convolution network with 5 layers, the convolution kernel of each layer is 3 multiplied by 3, obtaining the heat map area of the characteristic diagram input into the network by a multilayer perceptron with the 5 layers, shielding by randomly selecting one third area in the heat map, inputting the shielded picture into the next pooling layer, get the characteristic map as fm_depthWill yield fm_rgbInputting the data into a two-layer fully-connected network, wherein the output size of a first fully-connected layer is 4096, the output size of a second fully-connected layer is 4096, and a characteristic diagram obtained after the data passes through the two-layer fully-connected network is f_depth。

Step nine, fusing the extracted RGB features and depth features together through fusion of the classifier layer, and recording the fused features as f_rgbdThe method comprises the following steps:

subjecting the obtained f to_rgbAnd f_depthPieced together according to channel dimensions, i.e. assuming the feature map f obtained by step seven_rgbThe dimension of (a) is n x h x w x c, wherein n represents the number of pictures input into the network at one time, and h represents the number of pictures input into the networkThe length of the characteristic diagram, w represents the width of the characteristic diagram input into the layer network, and c represents the number of channels of the characteristic diagram input into the layer network. For the first layer of the network, i.e. the input layer of the network, where n is set according to its hardware conditions (n ≧ 1), h and w are determined according to the size of the picture in the input network, c is 3 if the input network is an RGB image, and c is 1 if it is a grayscale or depth picture. Then n of each layer is kept unchanged, the size of c is determined by the number of convolution kernels of the previous layer, h and w are determined according to the property of the previous network layer, and if the network layer is a convolution layer, the formula is used:

to calculate. Wherein l denotes a hierarchy of hierarchy_lThe first layer is shown as a characteristic diagram, stride_lThe step size of the convolution kernel shift is indicated. If it is a pooling layer, then the formula is used:

where kernel denotes the pooled kernel size of the pooled layer. h and w are calculated features respectively_l+1The second dimension and the third dimension. Will f is_rgbAnd f_depthConnected according to a fourth dimension.

And inputting the characteristic diagram obtained in the ninth step into a SoftMax classifier for training the classifier. Thus far, an object recognition classifier based on RGB-D is constructed. The classification results are calculated as follows: and finally, comparing which numerical value is the largest of the C numerical values, wherein the category corresponding to the numerical value with the largest numerical value is the category to which the test sample belongs.

The execution environment of the invention adopts a 3.3 central processing unit and a core computer with 8 Gbyte memory, and simultaneously, in order to accelerate the training and reasoning process of an object recognition network, 4 blocks of Yingwei GeForce GTX 1080TI GPU display cards are adopted for accelerating the calculation. Meanwhile, the C + + and python languages are adopted to compile a construction program based on the RGB-D object recognition classifier, and other execution environments can be adopted, which are not described herein again.

The method mainly researches how to train the object recognition deep neural network on the small-scale RGB-D data set without fitting, and generates an agent class for each sample data by providing a series of transformation rules and then applying the transformation rules to the training sample image blocks so as to support the deep neural network to train on the small-scale data set. Aiming at the problem of long tail distribution in a processing data set, namely that some samples are too rare, the number of the rare samples is not enough to support deep neural network learning in a training set, and the problem of low identification accuracy of hard examples in the data set is solved by providing an antagonistic learning network. The accuracy and robustness of the constructed object identification network are improved by the two methods.

The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

Claims

1. A construction method of an object recognition classifier based on RGB-D is characterized by comprising the following steps:

step one, constructing an RGB-D object recognition database

Wherein the RGB modality data is recorded as

depth modal data note

And depth modal proxy class training set

step four, network training process, utilizing the collected RGB modal data

Created proxy class

step five, network training process, utilizingAcquired depth modal data

Created proxy class

Step ten, fusing the characteristics f_rgbdSending the data to a classifier_rgbdIdentifying the object;

the process of extracting the features of the RGB modal data by using the RGB network in the seventh step is as follows: firstly, normalizing the collected pictures, then sending the normalized pictures into a 5-layer convolution network, obtaining a characteristic diagram after convolution and then connecting a pooling layer, and inputting the obtained characteristic diagram into a three-layer full-connection network to obtain a characteristic diagram f_rgb；

The step eight, the process of extracting the features of depth modal data by using the depth network is as follows: firstly, normalizing the collected pictures into the same size, then sending the normalized pictures into a 5-layer convolutional network, obtaining a heat map area of a characteristic map input into the network through a 5-layer multilayer perceptron, and randomly selecting one third of the heat mapsThe shielded picture is input into a pooling layer to obtain a characteristic diagram, and the obtained characteristic diagram is input into a two-layer full-connection network to obtain a characteristic diagram f_depth。

2. The RGB-D based object recognition classifier construction method of claim 1, wherein: in the ninth step, the extracted two modal characteristics are fused to construct a fusion characteristic f_rgbdThe fusion method comprises the following steps: f obtained_rgbAnd f_depthSplicing the materials together according to the channel dimension, and if the materials are convolutional layers, utilizing the formula:

to calculate h and w of the feature map after convolution layer, where l represents the l-th layer network, feature_lThe first layer is shown as a characteristic diagram, stride_lThe step size of the convolution kernel shift is indicated; if it is a pooling layer, then the formula is used

H and w of the feature graph after the pooling layer are calculated, wherein kernel represents the pooling kernel size of the pooling layer; h represents the length of the feature map input into the layer network, and w represents the width of the feature map input into the layer network.

3. The RGB-D based object recognition classifier construction method of claim 2, wherein: in the step ten, the classification of the classifier is calculated as follows: the fusion features of a test sample are extracted and input into a classifier, the trained classifier returns C numerical values to the input RGB-D object image through SoftMax, and then the class of the object is predicted by calculating which numerical value is the largest.