CN116416307B

CN116416307B - Prefabricated part hoisting splicing 3D visual guiding method based on deep learning

Info

Publication number: CN116416307B
Application number: CN202310074563.4A
Authority: CN
Inventors: 舒江鹏; 高一帆; 张晓武; 肖文楷; 夏哲
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2024-04-02
Anticipated expiration: 2043-02-07
Also published as: CN116416307A

Abstract

The invention discloses a prefabricated part hoisting splicing 3D visual guiding method based on deep learning, which comprises the following steps: firstly, extracting the features of prefabricated components based on RGB-D images and carrying out mask processing; then processing the prefabricated part pixel blocks subjected to mask processing based on the CNN architecture to iteratively estimate the 6D gesture of the prefabricated part; and finally, performing 3D visual guidance on the hoisting and splicing of the prefabricated parts based on RRT-Star path planning through the 6D gesture of the end effector of the robot to move to the target point so as to guide the hoisting robot to automatically plan the movement path of the end effector of the hoisting robot to enable the grabbed prefabricated parts to move to the position of the hoisting and splicing. The invention can realize the robot automatic assembly technology of the prefabricated component through the computer 3D vision intelligent perception recognition and feature extraction technology.

Description

Prefabricated part hoisting splicing 3D visual guiding method based on deep learning

Technical Field

The invention relates to application of computer vision, in particular to application of computer vision in the field of robot intelligent construction, and particularly relates to a prefabricated part hoisting splicing 3D vision guiding method based on deep learning.

Background

At present, the fabricated structural system still has a development bottleneck with higher cost than the traditional cast-in-situ system, which affects the popularization of the fabricated structural system to a certain extent. The reason for the higher cost is mainly that the assembly automation degree of the prefabricated components is not high, the operation needs to be carried out under the traction of operators, and the labor cost is not obviously reduced. In the field of intelligent construction, technologies such as deep learning, computer vision perception and the like begin to play an increasingly important role, and a new implementation way is provided for the automatic assembly technology of prefabricated components. The construction robot has 3D vision capability by using a computer vision technology, and performs positioning, hanging and assembling by means of vision guiding prefabricated components. However, the existing method mainly uses an algorithm based on a convolutional neural network (Convolutional Neural Network, CNN) architecture, only considers the position difference of the prefabricated parts in x and y coordinate axes to carry out hoisting path planning and prediction, and lacks the comparison of the position and the posture (included angles with x, y and z axes) difference of the front end/top end of the overhanging steel bar of the prefabricated part in a third dimension z coordinate axis in the to-be-hoisted prefabricated part and the to-be-hoisted connecting position. This may result in the prefabricated part being lifted to the assembly point (x, y) but not being able to match the z coordinate axis position and posture of the front/top end of the overhanging steel bar of the existing prefabricated part at the lifting position for high-precision assembly. Therefore, the invention further explores a 3D visual guiding method for hoisting and splicing the prefabricated parts on the basis of the prior art.

Disclosure of Invention

The invention aims to provide a prefabricated part hoisting splicing 3D visual guiding method based on deep learning aiming at the defects of the prior art.

The aim of the invention is realized by the following technical scheme: a prefabricated part hoisting splicing 3D visual guiding method based on deep learning comprises the following steps:

(1) Based on the RGB-D image, extracting the features of the prefabricated part and carrying out mask processing;

(2) Processing the prefabricated part pixel blocks obtained by mask processing in the step (1) based on a CNN architecture to iteratively estimate the 6D gesture of the prefabricated part;

(3) And (3) acquiring a 6D gesture of the robot end effector to be moved to the target point, and performing 3D visual guidance on the hoisting and splicing of the prefabricated parts based on RRT-Star path planning so as to guide the hoisting robot to automatically plan the movement path of the end effector to enable the grabbed prefabricated parts to be hoisted to the position of the hoisting and splicing.

Optionally, the RGB-D image in the step (1) is acquired by an Intel RealSense depth camera to obtain a color image and a depth image.

Optionally, the color image comprises surface color information and texture information of an object in the simulated hoisting scene; the depth image includes spatial shape information simulating objects in the hoisted scene.

Optionally, the masking in the step (1) is implemented based on a fast R-CNN architecture, which includes a feature extraction network, a region candidate network, a region of interest pooling, and a classification regression full network.

Optionally, the CNN architecture in step (2) includes:

a full convolution network for processing color information, wherein each pixel point in the prefabricated component pixel block obtained by masking in the step (1) is mapped to a color feature space to be embedded as a color feature;

the coordinate conversion unit is used for converting depth channel information in the prefabricated part pixel blocks obtained by mask processing in the step (1) into point cloud data and mapping each data point into a geometric feature space to be embedded as geometric features; and

the CNN network is used for processing pixel-level image fusion and is used for combining color feature embedding and geometric feature embedding and iteratively outputting 6D gestures of the prefabricated component pixel blocks obtained through mask processing based on the unsupervised confidence score.

Optionally, the iteratively estimating the preform 6D pose in step (2) is performed on a multi-frame dense video stream.

Optionally, in the step (3), the motion position and joint angle of each joint of the robot are solved by giving the 6D gesture of the end effector of the robot to be moved to the target point, so as to perform 3D visual guidance on the hoisting and splicing of the prefabricated parts, and control the mechanical arm to work.

The invention has the beneficial effects that the invention takes the automatic construction technology of the robot as the core, integrates the system composition of the computer vision and the intelligent construction technology, provides the 3D vision capability for the construction robot through the RGB-D light sequence processed by the real-time feature extraction mask, and realizes the intelligent high-precision positioning and assembly of the prefabricated components. The whole technology automatically realizes the process from the RGB-D image processing of the prefabricated part to the control and digitization of the mechanical arm, and has the following advantages in practical engineering application: the invention builds the RGB-D image database of the common prefabricated component, and builds the color, shape and space feature extraction method and process of the common prefabricated component, so that the feature extraction of the prefabricated component can be more accurately and efficiently performed based on computer 3D vision; the invention can be used for replacing manual assembly of prefabricated components, wherein a worker is needed to assist the data transmission inside the robot platform; compared with the traditional method, the number of workers required in the construction process can be obviously reduced; the assembly type structure system has high assembly automation degree and high construction efficiency.

Drawings

FIG. 1 is a flow chart of a pre-fabricated part hoisting splicing 3D visual guiding method based on deep learning according to an embodiment of the invention;

FIG. 2 is a schematic diagram of recording an RGB-D image sequence of a simulated hoist scene including a precast element using an Intel RealSense depth camera;

FIG. 3 is a schematic diagram of an RGB-HSV color space transform;

fig. 4 shows a component hoisting actuator (KUKA industrial robot equipped with a linear slide rail on the ground).

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Referring to fig. 1, the flow chart of the deep learning-based prefabricated part hoisting splicing 3D visual guiding method of the present invention includes the following steps:

(1) Extracting and masking features of the prefabricated parts: based on the RGB-D image, the prefabricated part features are extracted and masking processing is performed.

In this embodiment, the RGB-D image sequence is acquired: color and depth image (RGB-D) sequences of simulated lifting scenes of common prefabricated components (such as prefabricated beams, prefabricated columns, prefabricated floors, prefabricated wallboards, etc.) are recorded by using an Intel RealSense depth camera, as shown in fig. 2. The color image contains surface color information and texture information of objects in the simulated hoisted scene, and the depth image contains spatial shape information of the objects in the simulated hoisted scene.

It will be appreciated that for the identification of the preformed component, in an actual scenario, the image contains a broad background and no annotation of the location of the target preformed component. The preform may appear anywhere in the image, with a variety of different sizes and shapes (e.g., different sizes and shapes of preforms and preforms).

RGB-D image noise reduction: in the process of acquiring a continuous video stream based on an RGB-D image sequence, noise, such as a high-brightness pixel point or pixel block, object pixel blurring, geometric distortion and the like, which cause a strong visual effect, may be introduced due to factors of mutual shielding of objects in a scene image, dense object distribution, noisy environment and the like. Noise is interference information existing in image data and can adversely affect the processing (such as feature extraction) of subsequent images, so that the invention performs noise reduction processing on the acquired RGB-D image sequence. The RGB color space is defined by Red, green and Blue chromaticity, and other colors are generated according to the corresponding color triangle of the pixel. Images acquired under uneven illumination or low light conditions can produce chrominance shift phenomena that are affected by saturation and brightness. Unlike the RGB color space, the HSV color space separates luminance, saturation, and chrominance information. The HSV color space consists of three mutually independent channels, hue, saturation and Value. According to the method, an RGB image is transformed into an HSV space after being subjected to Gaussian filtering (Gaussian Binary Mask), as shown in fig. 3, a hue channel is kept unchanged, noise reduction processing based on a Bayesian estimation threshold is carried out on a brightness channel and a saturation channel, the execution times of Gaussian convolution are reduced, the noise reduction execution efficiency is improved, and finally, new HSV components are subjected to color space inverse transformation, so that a noise-reduced RGB-D image is obtained. The method avoids the defect that color distortion is easy to cause when noise reduction is directly carried out in the RGB space.

Masking processing based on local feature extraction: the local feature extraction is to extract features of the region of interest in the image, so that the region of interest can have a high degree of distinction. In this embodiment, CNN is used to extract local color, shape and spatial features of the prefabricated component in the RGB-D image sequence after noise reduction. The color, shape and spatial features describe the surface pixel RGB composition of the object in the image area, the outline and the mutual spatial position of the object, respectively. First, the CNN is trained using a set of RGB images containing pre-formed component colors, shapes, and spatial features. Then, after the training result is obtained, the training CNN is used to identify the prefabricated component in the new image sequence and give a boundary, and the prefabricated component is separated from other parts of the image. And finally, calculating the value of each pixel in the image again by using a mask kernel, shielding pixels except the target prefabricated part area, and realizing the frame cutting operation of the target prefabricated part pixel area.

In this embodiment, a fast R-CNN architecture is adopted to establish a CNN that can be trained on a multi-frame dense video stream, so as to shorten training/learning time of a network, and improve speed of extracting color, shape and spatial features of a prefabricated component, so as to implement mask processing of a prefabricated component image.

The Faster R-CNN architecture model contains four modules: feature extraction network, region candidate network, region of interest pooling and classification regression full network. Specifically, the input of the feature extraction network is a picture, and the red, green and blue channel features of the picture are output as a picture for the subsequent area candidate network. The region candidate network is input as the red, green and blue channel characteristics in the first step and output as a plurality of regions of interest. Each region of interest is specifically represented by a probability value (for judging whether it is foreground or background) and four coordinate values, the probability value represents the probability that the region of interest has an object, and the probability is obtained by performing two classifications on each region through a normalized exponential function. The coordinate value is the predicted position of the object, and regression is performed by using the coordinate and the real coordinate when training, so that the predicted position of the object is more accurate when testing. The interest domain pooling takes the interest domain output by the region candidate network and the red, green and blue channel characteristics output by the characteristic extraction network as inputs, synthesizes the two to obtain a region characteristic diagram with fixed size, and outputs the region characteristic diagram to the following fully-connected network for classification. And inputting the classification regression whole network into the region characteristic diagram obtained in the upper layer, and outputting the classification regression whole network into the category of the object in the region of interest and the accurate position of the object in the image. The layer classifies the images through a normalized exponential function and corrects the accurate position of the object through frame regression.

(2) Iterative estimation of prefabricated part 6D gesture: and (3) processing the prefabricated part pixel blocks obtained in the mask processing in the step (1) based on the CNN architecture to iteratively estimate the 6D pose of the prefabricated part.

In this embodiment, the 6D pose iterative estimation is to iteratively estimate the spatial position and direction of the target object under the camera coordinate system. The hoisting and splicing between prefabricated components are currently realized by adopting a mode of butting the transverse/longitudinal steel bars extending from one side of the prefabricated components with reserved steel bar insertion holes of the prefabricated components on the other side, for example, in the process of hoisting the prefabricated columns, the top end extending steel bars of the lower column penetrate into reserved grouting sleeve openings at the bottom end of the upper column, and concrete is poured into the grouting sleeve openings to realize the connection between the upper column and the lower column, and the hoisting and splicing between the prefabricated components needs accurate positioning and alignment technology.

It should be appreciated that the 6D pose of the target preform in the camera coordinate system may be obtained by a preform 6D pose iterative estimation algorithm, including the to-be-hoisted preform and the existing preform at the to-be-hoisted splice location.

Specifically, after the target prefabricated part pixel block is successfully cut out from the RGB-D image by masking, the target prefabricated part is subjected to 6D gesture iterative estimation by fully utilizing two complementary data sources of color (RGB) and depth image channels, and the CNN architecture comprises the following three parts:

(1) a full convolution network (Fully Convolutional Network, FCN) for processing color information maps each pixel point in the masking-processed preform pixel block to a color feature space as a color feature insert.

(2) And the coordinate conversion unit is used for converting depth channel information in the prefabricated component pixel blocks obtained by masking processing into point cloud data based on camera internal parameters, and mapping each data point into a geometric feature space to be embedded as geometric features.

(3) A CNN network for processing pixel-level image fusion that combines two embeddings (color and depth, i.e., color feature embeddings in (1) and geometric feature embeddings in (2)) and iteratively outputs the 6D pose of the masking-processed preformed component pixel block based on the unsupervised confidence score.

The embodiment establishes a fast R-CNN (Region-CNN) which can be trained on multi-frame dense video streams, so as to shorten the training/learning time of a network and improve the speed of estimating the 6D gesture of the prefabricated component. It should be appreciated that the 6D pose estimation for the prefabricated elements is performed on a multi-frame dense video stream.

(3) Prefabricated component robot automatic hoisting splice 3D vision guide: and (3) acquiring a 6D gesture of the robot end effector to be moved to the target point, and performing 3D visual guidance on hoisting and splicing of the prefabricated parts based on RRT-Star path planning so as to guide the hoisting robot to automatically plan the movement path of the end effector to enable the grabbed prefabricated parts to be hoisted to move to the position to be hoisted and spliced.

Hoisting robot hand-eye coordinate conversion: KUKA industrial robot (model KR1000titan, bearing capacity: 750-1300kg, arm extension: 3202-3601 mm) carrying a ground linear slide rail is adopted as a prefabricated part hoisting actuator, as shown in FIG. 4. In order to establish a coordinate conversion relationship between the camera (i.e., the eye of the hoisting robot) and the hoisting end effector (i.e., the hand of the hoisting robot), a method of fixing the calibration plate at the hoisting end effector is adopted. And (3) solving to obtain a transformation matrix from the hoisting end effector (namely the point in the calibration plate) to the camera end by recording the 6D gesture of the point in the calibration plate relative to the camera coordinate system, namely the calculated hoisting robot hand-eye calibration matrix. And converting the 6D gesture of the to-be-hoisted splicing position under the camera coordinate system into the 6D gesture under the hoisting end effector coordinate system through the obtained hand-eye calibration matrix.

RRT-Star path planning solution: the RRT-Star path planning algorithm is a path planning algorithm based on sampling, and is suitable for solving the path planning problem of the mobile robot under the high-dimensional space and complex constraint. The basic idea is to search for and advance to the target point by one step in a mode of generating random points, effectively avoid obstacles, avoid the path to sink into a local minimum value, and have the advantage of high convergence rate. The prior art designs of the path planning algorithm for the mechanical arm consider the components as particles, and the building components are rigid bodies with specific shapes in motion. For example, during the motion of the robotic arm grasping the pre-fabricated column, the pre-fabricated column is a rigid body rather than a mass point. Only taking the building components to be grabbed by the mechanical arm as mass points to carry out motion planning can deviate from a real scene, so that the obstacle avoidance function algorithm is not fully considered. The invention establishes a path planning and obstacle avoidance algorithm for automatic installation operation of the mechanical arm based on the rigid body model of the to-be-grabbed component. The specific method is to carry a Minkowski difference operator in an RRT-Star path planning architecture. The minkowski difference is the difference between the two sets of points in euclidean space. When the difference value has a value of 0, the two point sets have overlapping portions, and the collision detection result is "expected occurrence". The RRT-Star and Minkowski difference solving superposition algorithm is innovative, a multi-dimensional kinematic constraint is added for an independent variable redundant equation set, and an optimal motion path of the robot on a linear slide rail and an optimal motion track of each joint of the robot can be obtained, so that the work is controlled.

The invention develops algorithms such as prefabricated part feature extraction and mask processing, prefabricated part 6D gesture iterative estimation, prefabricated part robot automatic hoisting splicing 3D visual guidance and the like. The method can finally issue a control signal to the digital processor of the mechanical arm to guide the mechanical arm to automatically install the prefabricated component, and effectively solves the problems of low assembly automation degree, low construction efficiency and the like of the traditional assembly type structural system.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The 3D visual guiding method for hoisting and splicing the prefabricated parts based on deep learning is characterized by comprising the following steps of:

(1) Based on the RGB-D image, extracting the features of the prefabricated part and carrying out mask processing; wherein, for the identification of the prefabricated parts, the image contains a large-scale background and no mark of the position of the target prefabricated parts, the prefabricated parts appear at any position of the image and have various sizes and shapes;

the CNN architecture in the step (2) includes:

the CNN network is used for processing pixel-level image fusion, is used for combining color feature embedding and geometric feature embedding and iteratively outputting the 6D gesture of the prefabricated member pixel block obtained by mask processing based on the unsupervised confidence score;

(3) Acquiring a 6D gesture of the robot end effector to be moved to a target point through the step (2), and performing 3D visual guidance on hoisting and splicing of the prefabricated parts based on RRT-Star path planning so as to guide the hoisting robot to automatically plan the movement path of the end effector to enable the grabbed prefabricated parts to be hoisted to the position of the hoisting and splicing;

in the step (3), the motion position and joint angle of each joint of the robot are solved by giving the 6D gesture of the end effector of the robot to be moved to the target point, so as to carry out 3D visual guidance on hoisting and splicing of the prefabricated parts and control the mechanical arm to work;

a path planning and obstacle avoidance algorithm of automatic mechanical arm installation operation is established based on a rigid body model of a to-be-grabbed component, and the specific method is that a Minkowski difference operator is carried in an RRT-Star path planning framework, wherein the Minkowski difference is the difference value of point sets of two Euclidean spaces; and the RRT-Star and Minkowski difference are solved and overlapped, and a multi-dimensional kinematic constraint is added for an independent variable redundant equation set so as to obtain an optimal motion path of the robot on the linear slide rail and an optimal motion track of each joint of the robot, and the mechanical arm is controlled to work.

2. The 3D visual guidance method for hoisting and splicing prefabricated components based on deep learning according to claim 1, wherein the RGB-D images in the step (1) are acquired by an Intel RealSense depth camera to obtain color images and depth images.

3. The deep learning based prefabricated part hoisting splice 3D visual guidance method of claim 2, wherein the color image comprises surface color information and texture information of objects in a simulated hoisting scene; the depth image includes spatial shape information simulating objects in the hoisted scene.

4. The deep learning based prefabricated part hoisting splice 3D visual guidance method according to claim 1, wherein the masking process in the step (1) is implemented based on a fast R-CNN architecture, which includes a feature extraction network, a region candidate network, a region of interest pooling, and a classification regression full network.

5. The 3D visual guidance method for hoisting and splicing prefabricated parts based on deep learning according to claim 1, wherein the iterative estimation of the 6D pose of the prefabricated parts in the step (2) is performed on a multi-frame dense video stream.