CN115648215B

CN115648215B - Service robot grabbing detection method based on attention mechanism and twin deconvolution

Info

Publication number: CN115648215B
Application number: CN202211376120.2A
Authority: CN
Inventors: 李忠辉; 曹志强; 王硕; 任广力; 谭民; 亢晋立
Original assignee: Beijing Nengchuang Technology Co ltd; Institute of Automation of Chinese Academy of Science
Current assignee: Beijing Nengchuang Technology Co ltd; Institute of Automation of Chinese Academy of Science
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2024-01-26
Anticipated expiration: 2042-11-04
Also published as: CN115648215A

Abstract

The invention belongs to the technical field of service robots, in particular relates to a method, a system and a device for grabbing detection of a service robot based on an attention mechanism and twin deconvolution, and aims to solve the problem that the existing grabbing regression network based on an encoding-decoding structure can present checkerboard artifacts in the decoding process, so that the grabbing detection performance of the service robot is reduced. The method comprises the following steps: acquiring an original color image and an original depth image containing a target object; acquiring a target object bounding box, and obtaining a first depth image area and a first color image area; adjusting the image area to a set size; encoding the resized image region; finishing the coded feature map; decoding the refined feature map; and obtaining an optimal grabbing detection frame, and realizing grabbing detection of the target object. The invention eliminates checkerboard artifacts presented by the grabbing regression network based on the encoding-decoding structure in the decoding process, and improves the grabbing detection performance.

Description

Service robot grabbing detection method based on attention mechanism and twin deconvolution

Technical Field

The invention belongs to the technical field of service robots, and particularly relates to a method, a system and a device for grabbing and detecting a service robot based on an attention mechanism and twin deconvolution.

Background

In recent years, the development of artificial intelligence and computer vision has widened the application range of service robots in daily life. In order to better serve humans, service robots that carry robotic arms and can provide handling functions such as grasping have become a hotspot in research. In order to achieve grabbing of a target object, a target object bounding box is usually obtained by an object detection method (e.g., fast R-CNN) based on deep learning, and then an optimal grabbing position of the target object is obtained by using the grabbing detection method, which is now receiving attention.

The traditional grabbing detection method is generally based on a three-dimensional model of an object, and adopts a model matching mode to carry out grabbing detection, so that the expansibility is poor. In recent years, a capturing detection method based on a convolutional neural network becomes a mainstream, and mainly comprises two types: a candidate-evaluation-based grip detection method and a regression-based grip detection method. The candidate-evaluation-based grabbing detection method generally divides grabbing detection into two stages, firstly obtains a plurality of candidate grabbing positions through sampling and other methods, then evaluates and sorts the candidate grabbing positions, and the candidate grabbing with the highest ranking is used as the best grabbing. The method is time-consuming due to the fact that each candidate grabbing is required to be evaluated, and features can be repeatedly extracted. The regression-based grabbing detection method directly analyzes the whole image, has good real-time performance, and can be further subdivided into a grabbing regression network based on a coding structure and a grabbing regression network based on a coding-decoding structure. The grabbing regression network based on the coding structure directly predicts the optimal grabbing position through regression, but the generated optimal grabbing tends to output the average value of the grabbing positions, and when a plurality of solutions exist, the average value can be an invalid grabbing; the grabbing regression network based on the coding-decoding structure carries out up-sampling on the characteristic diagram output by the coder through the deconvolution-based decoder, so that the optimal grabbing corresponding to each pixel position can be predicted, wherein a characteristic fine modification module is added between the coder and the decoder by researchers, and more discriminant characteristics are provided for the decoder mainly through processing of Channel attention, channel random mixing (Channel Shuffle) operation and the like on the output of the coder. The grabbing regression network based on the encoding-decoding structure can obtain higher detection precision on the premise of ensuring real-time performance, but the problem of uneven superposition of convolution results can occur due to deconvolution in the decoding process, which can cause up-sampling results to present checkerboard artifacts, thereby influencing the grabbing detection performance. How to eliminate the influence of checkerboard artifacts to further improve the performance of the grab detection is yet to be studied more intensively.

Therefore, how to propose a solution to the above-mentioned problems is a problem that a person skilled in the art needs to solve at present.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, in order to solve the problem that the existing grabbing regression network based on the encoding-decoding structure can present checkerboard artifacts in the decoding process, thereby reducing the grabbing detection performance of the service robot, the invention provides a grabbing detection method of the service robot based on an attention mechanism and twin deconvolution, which comprises the following steps:

step S10, the service robot acquires an original color image and an original depth image containing a target object through a vision sensor;

step S20, based on the original color image, obtaining a boundary box of a target object through an object detection method based on deep learning, and taking corresponding areas of the boundary box in the original depth image and the original color image as a first depth image area and a first color image area;

step S30, the first depth image area and the first color image area are respectively adjusted to set sizes and used as a second depth image area and a second color image area;

Step S40, splicing the second depth image area and the second color image area along the channel direction, and inputting the second depth image area and the second color image area into an encoder for grabbing and detecting a convolutional neural network to obtain a first feature map;

step S50, performing feature refinement on the first feature map through a feature refinement module of the grabbing detection convolutional neural network to obtain a second feature map;

step S60, up-sampling the second characteristic map through a decoder of the grabbing detection convolutional neural network to obtain a grabbing quality characteristic map, a width characteristic map, a first angle characteristic map and a second angle characteristic map;

step S70, obtaining an optimal grabbing rectangle based on the grabbing quality feature map, the width feature map, the first angle feature map and the second angle feature map, and further obtaining an optimal grabbing detection frame of the target object in the original color image, so as to realize grabbing detection of the target object;

the values of each pixel point in the grabbing quality feature map, the width feature map, the first angle feature map and the second angle feature map respectively describe grabbing quality evaluation values, widths, a cosine value of a 2-time orientation angle and a sine value of a 2-time orientation angle of a grabbing rectangle taking the corresponding pixel point as a center.

In some preferred embodiments, the grabbing detection convolutional neural network comprises an encoder, a feature refinement module and a decoder;

the encoder comprises three cascaded standard convolution layers;

the characteristic fine modification module consists of a cross-amplitude attention module, a channel random mixing operation and a standard convolution layer which are connected in series;

the decoder comprises three cascaded twin deconvolution modules and four parallel standard convolutions; the three cascaded twin deconvolution modules are connected with four parallel standard convolutions;

the twin deconvolution module comprises two branches, an original branch and a twin branch; the original branch input is a feature map F _in F is to F _in Is standardized for each channel of (a)Deconvolution operation, generating feature map F _t The method comprises the steps of carrying out a first treatment on the surface of the The twin branch is input into a matrix with all element values being 1, and the matrix with all element values being 1 is subjected to deconvolution operation to generate a weight adjustment matrix M _w ；

Feature map F _t Is associated with the matrix M _w Performing element-by-element matrix division operation, and performing channel dimension adjustment on a result obtained by the element-by-element matrix division operation through a convolution layer to obtain a feature map F _out As an output of the twin deconvolution module.

In some preferred embodiments, the first feature map is subjected to feature refinement to obtain a second feature map, and the method includes:

step S501, processing the first feature map by using cross-amplitude attention, to obtain a cross-amplitude attention feature map;

step S502, the cross-amplitude attention feature map sequentially passes through a channel attention module, a channel random mixing operation and a standard convolution layer to obtain a second feature map.

In some preferred embodiments, the first profile is processed using cross-amplitude attention to obtain a cross-amplitude attention profile by:

respectively applying average pooling along the height and width dimensions to the first feature map to obtain a feature map AVG corresponding to the height and width dimensions _h 、AVG _w ；

AVG _h And AVG _w Generating an average cross feature map F by matrix multiplication _avg ；

Respectively applying maximum pooling along the height and width dimensions to the first feature map to obtain a feature map MAX corresponding to the height and width dimensions _h 、MAX _w ；

MAX _h And MAX _w Generating maximum cross feature map F by matrix multiplication _max ；

For F _avg And F _max Performing element-by-element matrix addition, and sequentially performing average pooling, standard convolution, batch normalization and R along the channel direction eLU activates a function, thereby obtaining a cross-attention diagram A _cross ；

Respectively applying maximum pooling and average pooling along the channel dimension to the first feature map to obtain a feature map MAX corresponding to the channel dimension _c And AVG _c ；

MAX _c And AVG _c After element-by-element matrix subtraction, the amplitude attention graph A is obtained by standard convolution, batch normalization and ReLU activation function in sequence _amp ；

Will A _cross And A _amp Splicing along the channel direction, and performing standard convolution processing on the splicing result to obtain attention force diagram A _fuse The method comprises the steps of carrying out a first treatment on the surface of the Will A _fuse And performing element-by-element matrix multiplication on the characteristic map of each channel in the first characteristic map to obtain a cross-amplitude attention characteristic map.

In some preferred embodiments, the method for obtaining the best grabbing rectangle based on the grabbing quality feature map, the width feature map, the first angle feature map and the second angle feature map, and further obtaining the best grabbing detection frame of the target object in the original color image comprises the following steps:

selecting the position of the pixel point corresponding to the maximum grabbing quality evaluation value in the grabbing quality feature map as the center position of the optimal grabbing rectangle, and marking as (u) ^* ,v ^* )；

Respectively acquiring the first angle characteristic diagram and the second angle characteristic diagram at a position (u) ^* ,v ^* ) Corresponding value of the placeAnd->Further calculate the orientation angle theta of the optimal grabbing rectangle ^* ：/>

Obtaining the width characteristic map at a position (u ^* ,v ^* ) Corresponding value at the position as the width w of the best grabbing rectangle ^* Height h of the best grabbing rectangle ^* Set to w ^* Half of (h), i.e. h ^* ＝w ^* /2；

Comprehensive u ^* 、v ^* 、θ ^* 、w ^* And h ^* Obtaining an optimal grabbing rectangle;

acquiring coordinates of four vertexes of the optimal grabbing rectangle, further acquiring corresponding points of the four vertexes of the optimal grabbing rectangle in an original color image, and marking the corresponding points as P respectively ₁ 、P ₂ 、P ₃ 、P ₄ ；

With P ₁ 、P ₂ 、P ₃ 、P ₄ And constructing an optimal grabbing detection frame of the target object in the original color image for the vertexes.

In some preferred embodiments, the loss function of the grab detection convolutional neural network is:

wherein L is _loss In order to grasp and detect the loss function of the convolutional neural network during training, N is the total number of samples of the convolutional neural network during training; q (Q) _i 、W _i 、C _i 、S _i Respectively a grabbing quality characteristic diagram, a width characteristic diagram, a first angle characteristic diagram and a second angle characteristic diagram which are output by the grabbing detection convolutional neural network corresponding to the ith sample,respectively Q _i 、W _i 、C _i 、S _i Corresponding true values, i=1, 2, …, N.

In a second aspect of the present invention, a service robot grabbing detection system based on an attention mechanism and twin deconvolution is provided, including: the device comprises an image acquisition module, an object detection module, a size adjustment module, a coding module, a finishing module, a decoding module and a grabbing detection module;

The image acquisition module is configured to acquire an original color image and an original depth image containing a target object through a visual sensor by a service robot;

the object detection module is configured to acquire a boundary box of a target object through an object detection method based on deep learning based on the original color image, and take corresponding areas of the boundary box in the original depth image and the original color image as a first depth image area and a first color image area;

the size adjusting module is configured to respectively adjust the first depth image area and the first color image area to set sizes as a second depth image area and a second color image area;

the coding module is configured to splice the second depth image area and the second color image area along the channel direction, input a coder for grabbing and detecting a convolutional neural network, and obtain a first feature map;

the refinement module is configured to perform feature refinement on the first feature map through the feature refinement module of the grabbing detection convolutional neural network to obtain a second feature map;

the decoding module is configured to up-sample the second feature map through a decoder of the grabbing detection convolutional neural network to obtain a grabbing quality feature map, a width feature map, a first angle feature map and a second angle feature map;

The grabbing detection module is configured to obtain an optimal grabbing rectangle based on the grabbing quality feature map, the width feature map, the first angle feature map and the second angle feature map, so as to obtain an optimal grabbing detection frame of the target object in the original color image, and grabbing detection of the target object is achieved;

In a third aspect of the present invention, a storage device is presented in which a plurality of programs are stored, said programs being adapted to be loaded and executed by a processor to implement the above-described method of service robot grip detection based on an attention mechanism and twin deconvolution.

In a fourth aspect of the present invention, a processing device is provided, including a processor and a storage device; a processor adapted to execute each program; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above described method of service robot grip detection based on an attention mechanism and twin deconvolution.

The invention has the beneficial effects that:

according to the invention, feature coding is carried out through a coder based on standard convolution, a feature trimming module based on an attention mechanism is adopted to trim the features of the feature image output by the coder, and finally, up-sampling is carried out through a decoder based on twin deconvolution to obtain a grabbing quality feature image, a width feature image, a first angle feature image and a second angle feature image, so that an optimal grabbing rectangle and an optimal grabbing detection frame thereof in an original color image are obtained, and grabbing detection of a target object is realized. The invention eliminates checkerboard artifacts presented by the grabbing regression network based on the encoding-decoding structure in the decoding process, improves the grabbing detection performance, and provides technical support for grabbing detection of service robots in environments such as offices, houses and the like.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings.

FIG. 1 is a flow diagram of a method for serving robot grip detection based on an attention mechanism and twin deconvolution in accordance with one embodiment of the present invention;

FIG. 2 is a schematic diagram of a twin deconvolution module in accordance with one embodiment of the present invention;

Fig. 3 is a schematic diagram of a frame of a service robot grip detection system based on an attention mechanism and twin deconvolution in accordance with an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other.

The invention discloses a service robot grabbing detection method based on an attention mechanism and twin deconvolution, which is shown in fig. 1 and comprises the following steps of:

In order to more clearly describe the method for detecting the grabbing of the service robot based on the attention mechanism and the twin deconvolution, each step in one embodiment of the method of the present invention is described in detail below.

The embodiment is a preferred implementation manner, a grabbing detection convolutional neural network comprising an encoder, a feature refinement module and a decoder is pre-constructed, the grabbing detection convolutional neural network is trained based on a pre-constructed training sample set and combined with a pre-designed loss function (the specific loss function setting will be described in detail below), parameters of the encoder, the feature refinement module and the decoder are obtained, and the grabbing detection method is applied to the service robot grabbing detection method based on the attention mechanism and the twin deconvolution. In addition, the grabbing corresponding to each pixel position adopts grabbing rectangle description, and five-tuple (u, v, theta, w, h) is used for parameterization representation, wherein (u, v) is the center of the grabbing rectangle, To grasp the orientation angle of the rectangle, described by the angle between the width direction of the rectangle and the horizontal direction of the image, w and h are the width and height of the rectangle, respectively, h=0.5 w. For each grabbing rectangle, there is a grabbing quality evaluation value q ε [0,1 ]]Corresponding thereto.

The service robot grabbing detection method based on the attention mechanism and the twin deconvolution realizes the grabbing detection process of the target object, and specifically comprises the following steps:

in the present embodiment, the service robot acquires an original color image and an original depth image containing the target object through a Kinect sensor mounted on itself.

in this embodiment, based on the original color image, the object detection method based on the deep learning is used to detect the object, so as to obtain a bounding box of the object, and further obtain coordinates of four vertices of the bounding box, and an image area corresponding to the four coordinates in the original depth image is referred to as a first depth image area of the object, and is denoted as D _o The method comprises the steps of carrying out a first treatment on the surface of the The image area corresponding to the four coordinates in the original color image is called the first color image area of the target object and is marked as G _o The method comprises the steps of carrying out a first treatment on the surface of the Among them, in the present invention, the object detection method based on deep learning preferably employs fast R-CNN.

in the present embodiment, D is preferably selected from _o Adjusting to 224×224 to obtain the firstTwo depth image region D _m The method comprises the steps of carrying out a first treatment on the surface of the Preferably G is _o Adjusted to 224×224 to obtain a second color image region G _m . The method comprises the following steps:

using the size function pair D in the computer vision library OpenCV _o Scaling is performed to obtain a second depth image area D with 224×224 size _m The method comprises the steps of carrying out a first treatment on the surface of the Using the size function pair G in the computer vision library OpenCV _o Scaling to obtain a second color image region G with 224×224 _m 。

in the present embodiment, the second depth image area D _m Second color image area G _m Splicing along the channel direction, inputting the spliced signals into an encoder for grabbing and detecting a convolutional neural network to obtain a first characteristic diagram F ₁ 。

The encoder is formed by cascading three standard convolution layers: enConv-1, enConv-2 and EnConv-3.EnConv-1, enConv-2, enConv-3 employ convolution kernels of 9×9, 7×7, and 5×05, respectively, the step sizes are (2, 2), and the feature pattern sizes (channel number×height×width) are 32×112×112, 48×56×56, and 72×28×28, respectively; second depth image area D _m And a second color image area G _m The splicing result along the channel direction is sequentially subjected to EnConv-1, enConv-2 and EnConv-3 treatment to obtain a first characteristic diagram F ₁ The size is 72×28×28.

in this embodiment, the feature refinement module of the capturing and detecting convolutional neural network is used to refine the first feature map F ₁ Performing feature refinement to obtain a second feature map F ₂ The characteristic fine modification module adopts an attention mechanism and consists of cross-amplitude attention, a channel attention module, a channel random mixing operation and a standard convolution layer in series.

Step S501, first feature map F using cross-amplitude attention ₁ Processing to obtain cross-amplitude attention characteristic diagram F _CAA The specific process is as follows:

for the first characteristic diagram F ₁ Respectively applying average pooling along the height and width dimensions to obtain a characteristic map AVG corresponding to the height and width dimensions _h 、AVG _w The sizes of the characteristic diagrams are 72 multiplied by 1 multiplied by 28 and 72 multiplied by 28 multiplied by 1 respectively; AVG (Audio video graphics) _h And AVG _w Generating an average cross feature map F by matrix multiplication _avg The corresponding feature map size is 72×28×28; for the first characteristic diagram F ₁ Respectively applying maximum pooling along the height and width dimensions to obtain a feature map MAX corresponding to the height and width dimensions _h 、MAX _w The sizes of the characteristic diagrams are 72 multiplied by 1 multiplied by 28 and 72 multiplied by 28 multiplied by 1 respectively; MAX (MAX) _h And MAX _w Generating maximum cross feature map F by matrix multiplication _max The corresponding feature map size is 72×28×28; for F _avg And F _max Element-wise matrix addition is performed, followed by an average pooling along the channel direction, standard convolution (convolution kernel 7 x 7), batch normalization and ReLU activation functions, further, cross attention diagram A with the size of 1 multiplied by 28 is obtained _cross 。

For the first characteristic diagram F ₁ Respectively applying maximum pooling and average pooling along the channel dimension to obtain a feature map MAX corresponding to the channel dimension _c And AVG _c The sizes of the characteristic diagrams are 1 multiplied by 28; MAX (MAX) _c And AVG _c After element-by-element matrix subtraction, the result is sequentially subjected to standard convolution (convolution kernel 3×3), batch normalization and ReLU activation function to obtain amplitude attention graph A with the size of 1×28×28 _amp 。

Will A _cross And A _amp Splicing along the channel direction, processing the splicing result by a convolution kernel and standard convolution with the step length of 3 multiplied by 3 and (1, 1) respectively to obtain attention diagram A with the size of 1 multiplied by 28 _fuse . Will A _fuse And a first characteristic diagram F ₁ The feature map of each channel in the array is multiplied by element matrix to obtain a cross-amplitude attention feature map F _CAA The size is 72×28×28.

Step S502, cross-amplitude attention profile F _CAA Sequentially passing through a channel attention module, a channel random mixing operation and a standard convolution layer treatment to obtain a second characteristic diagram F after finishing ₂ The feature map size is 72×28×28, wherein the specific implementation of the channel attention module is described in the literature: sanghyun Woo, jongchan Park, joon-Young Lee, in So Kwen. CBAM: convolutional Block Attention Module. European Conference on Computer Vision,2018,3-19; for specific implementation of the channel random mixing operation, see literature: xiangyu Zhang, xinyu Zhou, mengxiao Lin, jian sun. Shufflenet: an Extremely Efficient Convolutional Neural Network for Mobile devices, ieee Conference on Computer Vision and Pattern Recognition,2018,6848-6856; the convolution kernels and step sizes of the standard convolution layers are 3×3 and (1, 1), respectively.

in this embodiment, the decoder of the convolutional neural network is detected by the grabbing to the second feature map F ₂ Up-sampling is performed to generate a grabbing quality feature map Q, a width feature map W, a first angle feature map C and a second angle feature map S, wherein the decoder consists of three cascaded twin deconvolution modules (TDconv-1, TDconv-2 and TDconv-3) and four parallel standard convolutions (Conv-1, conv-2, conv-3 and Conv-4).

A schematic of the structure of a single twin deconvolution module is shown in fig. 2, comprising an original branch and a twin branch, where F _in 、c _in 、h _in And w _in Input feature map of the module and its channel dimension, height and width, F _out 、C _out 、H _out And W is _out Respectively outputting a characteristic diagram and channel dimension, height and width thereof, and h _in ≤H _out 、w _in ≤W _out . Original branch pair input feature map F _in Is standardized for each channel of (a)Deconvolution operation, generating feature map F _t The size is c _in ×H _out ×W _out The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the twin branch takes matrix Ones with all element values of 1 as input, and the size of the characteristic diagram is 1 Xh _in ×w _in The input is deconvoluted to generate a weight adjustment matrix M _w The size of the characteristic diagram is 1 XH _out ×W _out The method comprises the steps of carrying out a first treatment on the surface of the Then, feature map F _t Is associated with the matrix M _w Performing element-by-element matrix division operation, and performing channel dimension adjustment on a convolution layer with a convolution kernel of 1×1 to obtain an output characteristic diagram F of the twin deconvolution module _out The method comprises the steps of carrying out a first treatment on the surface of the Standard deconvolution operations in the original branches employ K _g ×K _g Is set to(s) _g ,s _g ) Wherein K is _g Is a parameter of convolution kernel, s _g The step length is preset; deconvolution operations in twin branches employ K _g ×K _g The internal weights of the convolution kernels are set to 1/m, m= (K) _g ) ² The step sizes are all set to(s) _g ,s _g )。

The corresponding (c) of the twin deconvolution modules TDconv-1, TDconv-2 and TDconv-3 _in ,K _g ,s _g ,C _out ) The values of the parameters are (72,3,2,48), (48,5,2,36) and (36,7,2,18) respectively; TDconv-1, TDconv-2 and TDconv-3 (h _in ,w _in ,H _out ,W _out ) The values of the parameters are (28,28,56,56), (56,56,112,112) and (112,112,224,224) respectively.

In the present embodiment, the second feature map F ₂ Sending the data to a decoder, sequentially processing the data through TDconv-1, TDconv-2 and TDconv-3, sending the output characteristic diagram of TDconv-3 to four standard convolutions Conv-1, conv-2, conv-3 and Conv-4 respectively, and respectively outputting a grabbing quality characteristic diagram Q, a width characteristic diagram W, a first angle characteristic diagram C and a second angle characteristic diagram S after the four standard convolutions are processed respectively, wherein the four standard convolutions adopt convolution kernels of 3 multiplied by 3, the step sizes are 1, and the sizes of the output characteristic diagrams are 1 multiplied by 224. The value of each pixel in the grab quality feature map Q describes the grab of the grab rectangle centered on the corresponding pixel The value of each pixel in the width feature map W describes the width W of the grab rectangle centered around the corresponding pixel, the value of each pixel in the first angle feature map C describes the cosine value of 2 times the orientation angle θ of the grab rectangle centered around the corresponding pixel, and the value of each pixel in the second angle feature map S describes the sine value of 2 times the orientation angle θ of the grab rectangle centered around the corresponding pixel.

in this embodiment, the specific process of obtaining the best grabbing rectangle by grabbing the quality feature map Q, the width feature map W, the first angle feature map C, and the second angle feature map S is as follows:

selecting the position of the pixel point corresponding to the maximum grabbing quality evaluation value in the grabbing quality feature map Q as the center position of the optimal grabbing rectangle, and marking as (u) ^* ,v ^* )；

Respectively acquiring a first angle characteristic diagram C and a second angle characteristic diagram S at a position (u ^* ,v ^* ) Corresponding value of the placeAnd->Further calculate the orientation angle theta of the optimal grabbing rectangle ^* ：/>Obtaining a width characteristic map W at a position (u ^* ,v ^* ) Corresponding value at the position as the width w of the best grabbing rectangle ^* Height h of the best grabbing rectangle ^* Set to w ^* Half of (h), i.e. h ^* ＝w ^* /2；

Comprehensive u ^* 、v ^* 、θ ^* 、w ^* And h ^* Obtaining the optimal grabbing rectangleBy (u) ^* ,v ^* ,θ ^* ,w ^* ,h ^* ) Characterization was performed.

Based on optimal grabbing rectangleThe coordinates of the four vertices of the optimal grabbing rectangle are obtained and are respectively (u) ₁ ,v ₁ )、(u ₂ ,v ₂ )、(u ₃ ,v ₃ )、(u ₄ ,v ₄ ) And obtaining the corresponding points of the four vertexes of the optimal grabbing rectangle in the original color image, which are respectively marked as P ₁ 、P ₂ 、P ₃ 、P ₄ Their pixel coordinates in the original color image are (u) _O +round(u ₁ ×r _fw ),v _O +round(v ₁ ×r _fh ))、(u _O +round(u ₂ ×r _fw ),v _O +round(v ₂ ×r _fh ))、(u _O +round(u ₃ ×r _fw ),v _O +round(v ₃ ×r _fh ))、(u _O +round(u ₄ ×r _fw ),v _O +round(v ₄ ×r _fh ) And (u) _O ,v _O ) For the first colour image region G _o Pixel coordinates of the top left corner of (a) in the original color image, round () is a rounding function, r _fw And r _fh Respectively a first color image area G _o And a second color image area G _m Width and height scales, r _fw ＝w _o /224，r _fh ＝h _o /224，h ₀ And w ₀ Respectively a first color image area G _o Is the height and width of (2); with P ₁ 、P ₂ 、P ₃ 、P ₄ The optimal grabbing detection frame of the target object in the original color image is formed for the vertex, so that grabbing detection of the target object is realized.

The following describes training samples and loss functions of a grabbing detection convolutional neural network including an encoder, a feature refinement module and a decoder.

In this embodiment, the grabbing detection convolutional neural network is trained by using a Cornell grabbing data set, where the Cornell grabbing data set includes 885 images of an RGB-D object to be grabbed, and 5110 grabbing rectangles are manually marked. After randomly cropping, magnifying and rotating 885 images in the Cornell grabbing dataset, 4425 RGB-D images 224 x 224 are generated, wherein each RGB-D image comprises a 224 x 224 depth image area and a 224 x 224 color image area, which are directly fed into the grabbing detection convolutional neural network to train the network. Truth value of grabbing quality feature map corresponding to each RGB-D imageTruth value +_of width characteristic diagram>Truth value of the first angular feature map>And true value of the second angular profile +.>The construction is as follows: first for sizes of 1X 224 Initializing with a pixel value of 0; for each manually marked grabbing rectangle of each RGB-D image, taking the center point, the width 1/3 and the height of the manually marked grabbing rectangle as the center point, the width and the height of a new rectangle; and is about->The pixel points in the region corresponding to the new rectangle are assigned 1, and +. >The pixel points in the region corresponding to the new rectangle are respectively assigned with the width of the original manually marked grabbing rectangle, the cosine value of the 2-time orientation angle and the sine value of the 2-time orientation angle; after all the manually marked grabbing rectangles of the RGB-D image are processed, obtaining a true value +.about.f of a grabbing quality feature map corresponding to the RGB-D image>Truth value +_of width characteristic diagram>Truth value of the first angular feature map>And true value of the second angular profile +.>

The training process of the grabbing detection convolutional neural network adopts an Adam optimizer, and the loss function of the Adam optimizer is shown as a formula (1):

wherein L is _loss In order to grasp and detect the loss function of the convolutional neural network during training, N is the total number of samples of the convolutional neural network during training; q (Q) _i 、W _i 、C _i 、S _i Respectively a grabbing quality characteristic diagram, a width characteristic diagram, a first angle characteristic diagram and a second angle characteristic diagram which are output by the grabbing detection convolutional neural network corresponding to the ith sample,respectively isQ _i 、W _i 、C _i 、S _i Corresponding true values, i=1, 2, …, N.

The invention can improve the grabbing detection performance, provide technical support for grabbing detection of the service robot in the environments of offices, houses and the like, and realize better technical effects.

A service robot gripping detection system based on an attention mechanism and twin deconvolution according to a second embodiment of the present invention, as shown in fig. 3, includes: an image acquisition module 100, an object detection module 200, a resizing module 300, an encoding module 400, a finishing module 500, a decoding module 600, a grabbing detection module 700;

The image acquisition module 100 is configured to acquire an original color image and an original depth image containing a target object through a vision sensor by a service robot;

the object detection module 200 is configured to obtain a bounding box of a target object through an object detection method based on deep learning based on the original color image, and take corresponding areas of the bounding box in the original depth image and the original color image as a first depth image area and a first color image area;

the size adjustment module 300 is configured to adjust the first depth image area and the first color image area to set sizes respectively, and to be a second depth image area and a second color image area;

the encoding module 400 is configured to splice the second depth image area and the second color image area along the channel direction, input an encoder for grabbing and detecting a convolutional neural network, and obtain a first feature map;

the refinement module 500 is configured to perform feature refinement on the first feature map through the feature refinement module of the grabbing detection convolutional neural network to obtain a second feature map;

the decoding module 600 is configured to upsample the second feature map by using a decoder of the grabbing detection convolutional neural network to obtain a grabbing quality feature map, a width feature map, a first angle feature map, and a second angle feature map;

The grabbing detection module 700 is configured to obtain an optimal grabbing rectangle based on the grabbing quality feature map, the width feature map, the first angle feature map and the second angle feature map, so as to obtain an optimal grabbing detection frame of the target object in the original color image, and achieve grabbing detection of the target object;

It should be noted that, in the service robot capture detection system based on the attention mechanism and the twin deconvolution provided in the foregoing embodiment, only the division of the foregoing functional modules is illustrated, in practical application, the foregoing functional allocation may be completed by different functional modules according to needs, that is, the modules or steps in the foregoing embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into a plurality of sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps related to the embodiments of the present invention are merely for distinguishing the respective modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device of a third embodiment of the present invention stores therein a plurality of programs adapted to be loaded and executed by a processor to implement the above-described service robot grip detection method based on an attention mechanism and twin deconvolution.

A processing device according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute each program; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above described method of service robot grip detection based on an attention mechanism and twin deconvolution.

It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the storage device and the processing device and the related description of the foregoing description may refer to the corresponding process in the foregoing method example, which is not repeated herein.

Those of skill in the art will appreciate that the various illustrative modules, method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the program(s) corresponding to the software modules, method steps, may be embodied in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application, but such implementation is not intended to be limiting.

The terms "first," "second," "third," and the like, are used for distinguishing between similar objects and not for describing a particular sequential or chronological order.

Thus far, the technical solution of the present invention has been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of protection of the present invention is not limited to these specific embodiments. Equivalent modifications and substitutions for related technical features may be made by those skilled in the art without departing from the principles of the present invention, and such modifications and substitutions will fall within the scope of the present invention.

Claims

1. The utility model provides a service robot snatchs detection method based on attention mechanism and twin deconvolution which characterized in that, this method includes:

the values of each pixel point in the grabbing quality feature map, the width feature map, the first angle feature map and the second angle feature map respectively describe grabbing quality evaluation values, widths, a cosine value of a 2-time orientation angle and a sine value of a 2-time orientation angle of a grabbing rectangle taking the corresponding pixel point as a center;

The grabbing detection convolutional neural network comprises an encoder, a characteristic fine modification module and a decoder;

the encoder comprises three cascaded standard convolution layers;

the twin deconvolution module comprises two branches; the two branches comprise an original branch and a twin branch; the original branch input is a feature map F _in F is to F _in Performing standard deconvolution operation on each channel of (1) to generate a feature map F _t The method comprises the steps of carrying out a first treatment on the surface of the The twin branch is input into a matrix with all element values being 1, and the matrix with all element values being 1 is subjected to deconvolution operation to generate a weight adjustment matrix M _w ；

Feature map F _t Is associated with the matrix M _w Performing element-by-element matrix division operation, and performing channel dimension adjustment on a result obtained by the element-by-element matrix division operation through a convolution layer to obtain a feature map F _out As output of the twin deconvolution module;

Performing feature refinement on the first feature map to obtain a second feature map, wherein the method comprises the following steps:

step S501, processing the first feature map with cross-amplitude attention, to obtain a cross-amplitude attention feature map:

MAX _h And MAX _w By matrix multiplicationGenerating maximum cross characteristic diagram F by using method _max ；

For F _avg And F _max Performing element-by-element matrix addition, and sequentially performing average pooling, standard convolution, batch normalization and ReLU activation functions along the channel direction to obtain cross attention diagram A _cross ；

Will A _cross And A _amp Splicing along the channel direction, and performing standard convolution processing on the splicing result to obtain attention force diagram A _fuse The method comprises the steps of carrying out a first treatment on the surface of the Will A _fuse Performing element-by-element matrix multiplication on the feature map of each channel in the first feature map to obtain a cross-amplitude attention feature map;

2. The method for detecting the grabbing of a service robot based on an attention mechanism and twin deconvolution according to claim 1, wherein an optimal grabbing rectangle is obtained based on the grabbing quality feature map, the width feature map, the first angle feature map and the second angle feature map, and then an optimal grabbing detection frame of a target object in the original color image is obtained, and the method comprises the following steps:

3. The method for detecting the grabbing of a service robot based on an attention mechanism and twin deconvolution according to claim 2, wherein the loss function of the grabbing detection convolutional neural network is as follows:

4. A service robot grip detection system based on an attention mechanism and twin deconvolution, the system comprising: the device comprises an image acquisition module, an object detection module, a size adjustment module, a coding module, a finishing module, a decoding module and a grabbing detection module;

the encoder comprises three cascaded standard convolution layers;

Feature map F _t Is associated with the matrix M _w Performing element-by-element matrix division operation, and passing the result obtained by the element-by-element matrix division operation through the convolution layerTrack dimension adjustment to obtain a feature map F _out As output of the twin deconvolution module;

5. A storage device in which a plurality of programs are stored, characterized in that said programs are adapted to be loaded and executed by a processor to implement the method for serving robot gripping detection based on an attention mechanism and twin deconvolution as claimed in any of claims 1-3.

6. A processing device, comprising a processor and a storage device; a processor adapted to execute each program; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the method for detection of a service robot grip based on an attention mechanism and twin deconvolution as claimed in any of claims 1-3.