CN117037062A

CN117037062A - Target object grabbing method, system, electronic equipment and storage medium

Info

Publication number: CN117037062A
Application number: CN202311021072.XA
Authority: CN
Inventors: 乔辉; 赵祯; 胡鹏杰; 关俊涛; 游冰; 杨建光; 李刚; 贺提胜
Original assignee: Sinomach Internet Research Institute Henan Co ltd
Current assignee: Sinomach Internet Research Institute Henan Co ltd
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-11-10

Abstract

The application discloses a target object grabbing method, a target object grabbing system, electronic equipment and a storage medium, and belongs to the technical field of image processing technology. The target object grabbing method comprises the following steps: acquiring an original image of a target object photographed under a plurality of photographing parameters; labeling the original image, and extracting a foreground image corresponding to the target object from the original image according to the labeling result; performing data expansion on the foreground image; shooting a scene image of a preset scene, and fusing the expanded foreground image with the scene image to obtain a fused image; training an image segmentation model by using the fusion image; and receiving an image of a working area of the mechanical arm shot by a camera, and identifying the object contour of the target object from the image of the working area of the mechanical arm by using an image segmentation model so as to control the mechanical arm to grasp the target object according to the object contour. The application can accurately identify the outline of the target object and improve the grabbing precision of the mechanical arm on the target object.

Description

Target object grabbing method, system, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and a system for capturing a target object, an electronic device, and a storage medium.

Background

Mechanical arms are an important device developed in the mechanical and automatic production process, and are widely applied to an automatic production line to complete various expected operation tasks through programming. In order to complete various operation tasks, the grabbing and placing function is one of the necessary functions of the mechanical arm, and along with the application development of the intelligent robot, the environment sensing capability of the mechanical arm is increased, so that the intelligent automatic grabbing can be performed, and the intelligent automatic grabbing becomes an important research content in the field. With the improvement of the performance of the vision sensor, related algorithms are continuously and deeply researched, and a vision-based manipulator grabbing technology has become a preferred scheme of multiple application scenes.

In the related art, detection of a target object is generally realized by using an image segmentation model, but the image segmentation model has high requirements on the classification of the target and labeling of edge pixels, the data set has high manufacturing cost, the outline of the target object cannot be accurately identified, and the grabbing precision of the mechanical arm is low.

Therefore, how to accurately identify the outline of the target object and improve the grabbing precision of the mechanical arm on the target object are technical problems that need to be solved currently by those skilled in the art.

Disclosure of Invention

The application aims to provide a target object grabbing method, a target object grabbing system, electronic equipment and a storage medium, which can accurately identify the outline of a target object and improve the grabbing precision of a mechanical arm on the target object.

In order to solve the technical problems, the present application provides a target object capturing method, which includes:

acquiring an original image of a target object photographed under a plurality of photographing parameters; the target object is in a preset scene, and the shooting parameters comprise a shooting angle and a shooting scale;

labeling the original image, and extracting a foreground image corresponding to the target object from the original image according to a labeling result;

performing data expansion on the foreground image, and updating a mask matrix corresponding to the expanded foreground image;

shooting a scene image of the preset scene, and fusing the expanded foreground image with the scene image to obtain a fused image;

training an image segmentation model by utilizing the fusion image;

and receiving an image of a working area of the mechanical arm shot by a camera, and identifying the object contour of the target object from the image of the working area of the mechanical arm by utilizing the image segmentation model so as to control the mechanical arm to grasp the target object according to the object contour.

Optionally, identifying an object contour of the target object from the robot arm working area image using the image segmentation model includes:

extracting a region of interest containing the target object from the mechanical arm working region image by using a target detection model;

and inputting the region of interest into the image segmentation model to obtain the object contour of the target object.

Optionally, the image segmentation model is a U2-Net model, the target detection model is a YOLOv5 model, the multi-scale feature fusion network of the YOLOv5 model includes a first feature layer, a second feature layer, a third feature layer and a fourth feature layer which are sequentially connected, the size of the first feature layer is 20×20, the size of the second feature layer is 40×40, the size of the third feature layer is 80×80, and the size of the fourth feature layer is 160×160.

Optionally, fusing the expanded foreground image and the scene image to obtain a fused image, including:

and performing poisson fusion operation on the expanded foreground image and the scene image to obtain the fusion image.

Optionally, labeling the original image includes:

And labeling the original image at the pixel level by using a labelimg data labeling tool.

Optionally, controlling the mechanical arm to grasp the target object according to the object profile includes:

generating a corresponding closed figure according to the object outline;

determining boundary coordinate values of the closed graph; the boundary coordinate point comprises an X-axis coordinate maximum value, an X-axis coordinate minimum value, a Y-axis coordinate maximum value and a Y-axis coordinate minimum value;

determining the circumscribed rectangle of the closed graph according to the boundary coordinate value;

rotating the closed graph for N times, and recording boundary coordinate values and circumscribed rectangles of the closed graph after each rotation;

setting the circumscribed rectangle with the smallest area as the smallest circumscribed rectangle of the target object;

and controlling the mechanical arm to grasp the target object according to the boundary coordinate value of the minimum circumscribed rectangle.

Optionally, controlling the mechanical arm to grasp the target object according to the boundary coordinate value of the minimum bounding rectangle includes:

determining a grabbing point and a clamping jaw angle according to the boundary coordinate value of the minimum circumscribed rectangle; the grabbing points are center points of two long sides of the minimum circumscribed rectangle, and the clamping jaw angles are parallel to two short sides of the minimum circumscribed rectangle;

And controlling the mechanical arm to grasp the target object according to the grasping point and the clamping jaw angle.

The application also provides a target object grabbing system, which comprises:

the image shooting module is used for acquiring original images shot by the target object under a plurality of shooting parameters; the target object is in a preset scene, and the shooting parameters comprise a shooting angle and a shooting scale;

the labeling module is used for labeling the original image and extracting a foreground image corresponding to the target object from the original image according to a labeling result;

the data expansion module is used for carrying out data expansion on the foreground image and updating a mask matrix corresponding to the expanded foreground image;

the fusion module is used for shooting a scene image of the preset scene, and fusing the expanded foreground image with the scene image to obtain a fused image;

the training module is used for training the image segmentation model by utilizing the fusion image;

and the grabbing control module is used for receiving the image of the working area of the mechanical arm shot by the camera, and identifying the object contour of the target object from the image of the working area of the mechanical arm by utilizing the image segmentation model so as to control the mechanical arm to grab the target object according to the object contour.

The present application also provides a storage medium having stored thereon a computer program which, when executed, implements the steps of the above-described target object gripping method.

The application also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps executed by the target object grabbing method when calling the computer program in the memory.

The application provides a target object grabbing method, which comprises the following steps: acquiring an original image of a target object photographed under a plurality of photographing parameters; the target object is in a preset scene, and the shooting parameters comprise a shooting angle and a shooting scale; labeling the original image, and extracting a foreground image corresponding to the target object from the original image according to a labeling result; performing data expansion on the foreground image, and updating a mask matrix corresponding to the expanded foreground image; shooting a scene image of the preset scene, and fusing the expanded foreground image with the scene image to obtain a fused image; training an image segmentation model by utilizing the fusion image; and receiving an image of a working area of the mechanical arm shot by a camera, and identifying the object contour of the target object from the image of the working area of the mechanical arm by utilizing the image segmentation model so as to control the mechanical arm to grasp the target object according to the object contour.

The method comprises the steps of obtaining an original image of a target object under a plurality of shooting parameters, and obtaining a foreground image corresponding to the target object by marking the original image; the application also expands the data of the foreground image and updates the mask matrix corresponding to the expanded foreground image; according to the application, the target object is shot under a plurality of shooting parameters, and the foreground image is subjected to data expansion to obtain the corresponding foreground image of the target object under a plurality of shooting scenes. According to the application, the expanded foreground image and the scene image are fused to obtain the fused image, so that the number of data set samples used for training the image segmentation model is increased, the outline of the target object can be accurately identified based on the image segmentation model trained by the fused image, and the grabbing precision of the mechanical arm on the target object can be increased. The application also provides a target object grabbing system, a storage medium and an electronic device, which have the beneficial effects and are not described herein.

Drawings

For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a flowchart of a target object capturing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a camera mounting position according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a multi-scale feature fusion network Neck of a YOLOv5 model according to an embodiment of the present application;

FIG. 4 is a flow chart of a data set required for training a U2-Net model according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a detection result of a YOLOv5 model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a minimum bounding rectangle according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a target object capturing system according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, fig. 1 is a flowchart of a target object capturing method according to an embodiment of the present application.

The specific steps may include:

s101: acquiring an original image of a target object photographed under a plurality of photographing parameters;

the embodiment can be used for running electronic devices such as a computer and a mechanical arm with an image segmentation model (also called a saliency target detection model, such as U-Net, U2-Net and FCN), and the target object is an object to be grabbed by the mechanical arm, for example, a workpiece, a circuit board, goods to be carried and the like.

Before the step, a plurality of shooting angles and a plurality of shooting scales can be combined to obtain a plurality of shooting parameters, and then at least one image of a target object is shot under each shooting parameter, so that a plurality of original images are obtained. The shooting angle is used for describing the shooting direction of the camera, and the shooting scale is used for describing the proportion of the target object in an imaging picture.

In the process of executing S101, the target object is always in a preset scene; the preset scene can be a working scene of a mechanical arm such as a storage platform, a production workshop and a warehouse.

S102: labeling the original image, and extracting a foreground image corresponding to the target object from the original image according to a labeling result;

On the basis of obtaining an original image, a foreground image and a background image in the original image can be marked to obtain a marked result, and then the foreground image corresponding to the target object is extracted from the original image according to the marked result. The step may further store a mask matrix corresponding to the foreground image, that is, a foreground mask matrix.

As a possible implementation, the labelimg data labeling tool can be used to label the original image required for training at the pixel level.

S103: performing data expansion on the foreground image, and updating a mask matrix corresponding to the expanded foreground image;

after obtaining the foreground image, performing data expansion operations such as saturation adjustment, brightness adjustment, gaussian noise addition, motion blur processing, random scaling, random rotation, random affine transformation and the like on the foreground image to obtain a new foreground image; the embodiment can collectively refer to the original foreground image and the new foreground image as an expanded foreground image. After the data expansion, the mask matrix of the foreground image can be updated to ensure that the mask matrix is consistent with the shape edge of the foreground image.

S104: shooting a scene image of the preset scene, and fusing the expanded foreground image with the scene image to obtain a fused image;

The step can shoot a scene image of a preset scene, namely, a scene image obtained by shooting the preset scene when a target object is not placed in the preset scene. As a possible embodiment, a scene image of a preset scene may be photographed under a plurality of photographing parameters.

The application can fuse the expanded foreground image with the scene image to obtain a fused image, the number of the expanded foreground image and the scene image can be multiple, and the expanded foreground image and the scene image can be combined to obtain multiple fused images. The embodiment can use a poisson fusion algorithm to fuse the foreground image and the scene image.

S105: training an image segmentation model by utilizing the fusion image;

after the fusion image is obtained, a data set containing the fusion image can be constructed, the data set is divided into a training set, a test set and a verification set, and the image segmentation model is trained by utilizing the data set, so that the image segmentation model has the capability of extracting the outline of the target object. According to the embodiment, a large amount of salient object detection data can be acquired and marked, a data set is established, a training set, a testing set and a verification set are divided, and training of an image segmentation model is guided. The significance target detection essence is a classification task, only the foreground and the background are distinguished, the labeling cost is relatively low, and the algorithm speed is high.

S106: and receiving an image of a working area of the mechanical arm shot by a camera, and identifying the object contour of the target object from the image of the working area of the mechanical arm by utilizing the image segmentation model so as to control the mechanical arm to grasp the target object according to the object contour.

The mechanical arm is connected with the camera, and the camera can be installed independently of the mechanical arm, so that the camera shoots images of the working area of the mechanical arm. The image of the working area of the mechanical arm refers to: an image of the robot arm captured by the camera when in a preset scene.

According to the embodiment, the object contour of the target object can be identified from the working area image of the mechanical arm by using the image segmentation model, and then the mechanical arm is controlled to grasp the target object according to the object contour.

The method comprises the steps that original images of a target object under a plurality of shooting parameters are obtained, and foreground images corresponding to the target object are obtained through labeling the original images; the embodiment also expands the data of the foreground image and updates the mask matrix corresponding to the expanded foreground image; according to the embodiment, the target object is shot under a plurality of shooting parameters, and the foreground image is subjected to data expansion to obtain the foreground image corresponding to the target object under a plurality of shooting scenes. According to the embodiment, the expanded foreground image and the scene image are fused to obtain the fused image, so that the number of data set samples used for training the image segmentation model is increased, the outline of the target object can be accurately identified based on the image segmentation model trained by the fused image, and the grabbing precision of the mechanical arm on the target object can be improved.

As a further introduction to the corresponding embodiment of fig. 1, the present embodiment may use a target detection model (such as R-CNN, YOLO, etc.) and an image segmentation model to cooperate with each other to obtain an object profile of the target object. The specific process is as follows: extracting a region of interest containing the target object from the mechanical arm working region image by using a target detection model; and inputting the region of interest into the image segmentation model to obtain the object contour of the target object.

The image segmentation model can be a U2-Net model, and the target detection model can be an improved YOLOv5 model. The improved multi-scale feature fusion network of the YOLOv5 model comprises a first feature layer, a second feature layer, a third feature layer and a fourth feature layer which are sequentially connected, wherein the size of the first feature layer is 20 multiplied by 20, the size of the second feature layer is 40 multiplied by 40, the size of the third feature layer is 80 multiplied by 80, and the size of the fourth feature layer is 160 multiplied by 160. The embodiment adjusts the network structure of the feature extraction part in the YOLOv5 model, so that the model still has strong feature extraction capability for small-scale features after multiple downsampling and convolution operations are carried out on the model.

According to the embodiment, on the basis of an original YOLOv5 model, a 160-160 feature layer is added to a multiscale feature fusion network Neck part, so that the extraction capacity of the model for small-scale features is improved, and the detection effect of the model on small-scale targets is improved.

The YOLO model is a deep learning target detection method, which is one of the most common target detection methods at present, as a one-stage target detection algorithm. YOLOv5 is an upgrade version of YOLO, the network structure is composed of 4 independent convolution blocks, each convolution block has different convolution kernel sizes and depths, features of different scales can be extracted in a self-adaptive mode, detection accuracy is improved, feature pyramids are adopted to process input images of different scales, up-sampling and down-sampling operations are carried out on feature graphs of different levels, a multi-scale feature pyramid can be generated, and an algorithm can detect target objects of different scales. The robustness and generalization performance of the model are improved by adopting a data enhancement technology and a distributed training strategy.

U2-Net is an efficient and accurate image segmentation model, and has been widely used in salient object detection tasks. The model depth is increased by using two parallel U-NETs, thereby improving the performance of the model. Specifically, the U2-Net processes high resolution and low resolution image features through two U-Nets, respectively, and the two feature maps are spliced together as the final output of the model.

As a further introduction to the corresponding embodiment of fig. 1, in order to reduce the prediction accuracy of the boundary of the fused foreground image and the scene image on the image segmentation model, this embodiment may perform a poisson fusion operation on the expanded foreground image and the scene image, to obtain the fused image.

As a further introduction to the corresponding embodiment of fig. 1, the present embodiment may determine a minimum bounding rectangle corresponding to the object contour, and determine the grabbing point according to the coordinate of the minimum bounding rectangle in the coordinate system of the mechanical arm, so that the mechanical arm performs the corresponding grabbing operation according to the coordinate of the grabbing point.

Specifically, the embodiment may generate a corresponding closed figure according to the object contour; determining boundary coordinate values of the closed graph; the boundary coordinate points comprise an X-axis coordinate maximum value Xmax, an X-axis coordinate minimum value Xmin, a Y-axis coordinate maximum value Ymax and a Y-axis coordinate minimum value Ymin; determining the circumscribed rectangle of the closed graph according to the boundary coordinate value; rotating the closed graph for N times, and recording boundary coordinate values and circumscribed rectangles of the closed graph after each rotation; setting the circumscribed rectangle with the smallest area as the smallest circumscribed rectangle of the target object; and controlling the mechanical arm to grasp the target object according to the boundary coordinate value of the minimum circumscribed rectangle.

The coordinates of the four vertexes of the circumscribed rectangle of the closed graph determined according to the boundary coordinate value are as follows: (Xmax, ymax), (Xmax, ymin), (Xmin, ymax), (Xmin, ymin).

As a possible embodiment, the target object may be grasped by: determining a grabbing point and a clamping jaw angle according to the boundary coordinate value of the minimum circumscribed rectangle; the grabbing points are center points of two long sides of the minimum circumscribed rectangle, and the clamping jaw angles are parallel to two short sides of the minimum circumscribed rectangle; and controlling the mechanical arm to grasp the target object according to the grasping point and the clamping jaw angle.

And converting the object profile obtained by the image segmentation model into the gesture estimation required by the grabbing task, and guiding the tail end clamping jaw of the mechanical arm to carry out autonomous grabbing. The result of the image segmentation model is a mask matrix of the foreground in the detection area, and the mask matrix needs to be converted into pose information, such as angles, sizes and the like, of the target to be grabbed.

The flow described in the above embodiment is explained below by way of an embodiment in practical application.

Conventional robotic arm gripping is typically programmed off-line using a teach pendant to cause the robot to perform point-to-point movements along the travel path. The robot arm has certain requirements on the placement position of the grabbing object, and can not run to places without teaching, so that the task which can be completed is single. In addition, this method requires the skilled operator to know the programming scheme of the current robotic arm, and if the gripped object changes or pose changes, the programming needs to be performed again.

The vision-based mechanical arm grabbing method is basically divided into two cases: a camera is fixed on a clamping jaw at the tail end of a mechanical arm, the camera moves along with the movement of the mechanical arm, the position between the camera and the clamping jaw at the tail end of the mechanical arm is relatively fixed, the position of the camera and the position of a target object can be updated in real time in the mechanical movement process, and accurate grabbing can be achieved. However, the field of view depends on the pose of the mechanical arm, and particularly when the tail end of the mechanical arm is not oriented to or close to the target object, the field of view of the camera cannot completely present the target object, and the recognition algorithm cannot determine the grabbing point on the target object. If the camera visual field is increased by rotating the mechanical arm, the efficiency is low, meanwhile, the mechanical arm has higher power, the power consumption can be increased and the service life of the mechanical arm can be shortened when the mechanical arm moves for a long time, so that the camera visual field is not lost. The other is that the camera is independent of the mechanical arm, the camera does not change position along with the movement of the mechanical arm, the relative position between the camera and the mechanical arm base coordinate system is fixed, and the visual field range is large. But far from the target object, the target image in the image is smaller, and the recognition accuracy is lower.

The automatic grabbing method of the mechanical arm based on visual perception uses a computer visual algorithm to identify the object and estimate the gesture of the object, including the position and the gesture of the object. The object identification method relies on the existing object detection algorithm, a rectangular frame is used for selecting an object in an image acquired by a camera, so that coordinates of the object in the image are acquired, the coordinates of the object in a real world coordinate system are calculated through a transformation matrix, and the mechanical arm is guided to carry out grabbing operation. The conventional pose estimation processes the target in the rectangular frame through algorithms such as image enhancement and edge detection in machine vision, and then analyzes the processed image by using complex geometric calculation, thereby acquiring the pose information of the target. With the continuous development of deep learning and computer vision, the present pose estimation task can be realized by a deep learning method, wherein the pose solving can not be performed when the depth information is wrong or missing depending on RGB color images acquired by a depth camera and a depth image algorithm. The method only requiring RGB color image depends on image segmentation algorithm, and the method requires classification information and pixel-level edge labeling of the target, and has high data set manufacturing cost and low operation speed.

The traditional demonstrator method has the advantages of various demonstrators, high learning cost, high requirements on the placement position and angle of the target objects, and poor flexibility, and needs to occupy a robot for practical programming. Under the current camera erection method, the target detection algorithm has poor detection effect on small-scale targets. The traditional gesture estimation algorithm has higher requirements on the actual conditions of the grabbing scene, the influence of illumination change, motion blur and the like on the gesture estimation result is larger, the stability is poor, the image enhancement algorithm and the geometric calculation depend on the professional experience of an algorithm designer, and the development cost is higher. In the attitude estimation algorithm based on the deep learning, the method depending on the depth image has higher requirements on the environment and has poor universality; the method depending on image segmentation has high labeling requirements on the category and edge pixels of the target, the data set has high manufacturing cost, and the algorithm speed is low, so that the requirement of instantaneity can not be met.

In order to solve the technical problems, the application provides an automatic grabbing scheme of a monocular camera mechanical arm based on deep learning, which can be applied to a mechanical arm connected with a camera. Referring to fig. 2, fig. 2 is a schematic diagram of a mounting position of a camera according to an embodiment of the application, the camera is mounted independent of a mechanical arm, and the mechanical arm with clamping jaws is arranged on an operation table. The embodiment can perform the orthodontic treatment on the camera and calculate the transformation matrix of the camera image coordinate system and the real world coordinate system (taking the origin of the mechanical arm coordinate system as the origin of the world coordinate system).

The embodiment comprises structural adjustment of a YOLO model, establishment of a data set required by U2-Net, and a gesture estimation strategy based on a U2-Net significance target detection result.

Firstly, a convolution layer is added on the basis of a YOLOv5 model to improve the extraction capability of the model to low-dimensional features, and a data set of a target object to be grabbed is used for training to obtain the YOLOv5 model.

According to the method, a data set required by training U2-Net can be manufactured by using images of the targets to be grabbed, camera original image data (foreground and background are not separated) of the targets under multiple angles and multiple sizes is obtained, the images are segmented by a labelimg manual labeling method to obtain target foreground images, the segmented target foreground images and grabbed scene images (the re-acquired scene images which do not contain target objects and are not subjected to foreground separation) are combined in a permutation and combination mode, image processing methods such as illumination change, motion blurring, random rotation and random overturning are added in the process, training data are expanded, sample diversity of the data set is improved, a U2-Net model is trained by using the data set, and a salient target detection model is obtained and is used for posture estimation of the targets. According to the method, the data set for achieving the significance target detection is manufactured through the data synthesis and data expansion method, the generated data is vivid and available through the poisson fusion algorithm, and compared with a traditional data labeling method, labor cost and time cost required by labeling are greatly reduced.

When the capturing is started, the adjusted YOLOv5 model is used for carrying out target detection on the camera image, the position and the identification frame of the target are obtained, the identification frame is used as the ROI (region of interest) of the U2-Net, the foreground and the background are separated, the minimum circumscribed rectangle is drawn by taking the foreground mask matrix as the boundary, the corresponding capturing strategy is generated, and the capturing operation is executed.

According to the embodiment, the convolution layer structure of the YOLOv5 model is adjusted, so that the detection accuracy of the model to a small-scale target is improved, and meanwhile, the Eye-to-Hand (camera is installed independently of a mechanical arm) mode is adopted, so that the visual field is wide, the energy consumption is low, imaging is stable, and the attitude estimation of the target to be grabbed by a deep learning algorithm is facilitated; and the U2-Net model is adopted to estimate the gesture of the target to be grabbed, so that the data set has lower manufacturing cost, lower calculation force requirement and higher processing speed.

The YOLOv5 model was modified as follows:

in the grabbing of the mechanical arm, because the camera is erected in a Eye-to-Hand mode and the placement positions of the target objects are random, the imaging of a plurality of target objects in the camera image is small, and the carried effective characteristic information causes information loss after a plurality of downsampling and convolution operations, so that the identification accuracy is poor and the recall rate is low. The YOLOv5 model consists of 4 core parts, namely an input, a Backbone network (Backbone), a multiscale feature fusion network (neg) and a prediction classifier input (Head). The Neck part of the Yolov5 model contains three feature layers, 20×20, 40×40 and 80×80, in the middle, middle and bottom layers, respectively. In order to improve the recall rate of the original algorithm to the small target object, a 160×160 small-scale prediction layer is newly added on the top layer to strengthen the extraction capability of the low-dimensional features under the condition that the 3-layer detection layer of the algorithm is kept unchanged. The prediction layer is characterized in that 256 channels of upsampling is added after the original 80×80 upsampling layer, a 160×160 feature map with deep semantic information is generated for feature fusion, and finally a small-scale detection layer with a denser segmentation network is generated. The prediction scale of the algorithm is changed from the original 3-scale detection to the 4-scale detection.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a multi-scale feature fusion network Neck of a YOLOv5 model according to an embodiment of the present application, in which C3 in the figure represents a backbone network, conca represents feature fusion, upsample represents upsampling, conv represents vector convolution operation, yolhead represents a detection head of the YOLOv5 model, and four feature layers in the multi-scale feature fusion network Neck are respectively 20×20×

1024. 40×40×512, 80×80×256, and 160×160×128.

The preparation process of the saliency target detection data set is as follows:

referring to fig. 4, fig. 4 is a flow chart of a data set required for training a U2-Net model according to an embodiment of the present application, which includes the following steps: collecting an original image, and carrying out sample labeling and sample expansion on the original image to obtain an expanded foreground image and a mask matrix corresponding to the expanded foreground image; and acquiring a scene image, and carrying out image fusion on the scene image and the expanded foreground image to obtain a required data set.

The training model needs a large amount of marking data, the acquisition of the data can be carried out by a camera to shoot in multiple angles and multiple scales, and if the data marking part uses a traditional method, the labor cost and the time cost for marking the acquired data pixel by pixel are extremely high. In order to solve the problem, the embodiment generates a large amount of labeling data by a data synthesis mode, and the specific implementation mode comprises the following steps A1 to A6.

Step A1: acquiring image data containing an object to be grabbed by using a camera, wherein the image needs to contain various shooting angles and various placing angles of a target object to be grabbed;

step A2: accurately labeling the acquired image data pixel by using a labelimg data labeling tool, and storing a corresponding foreground mask matrix;

step A3: performing data expansion on the labeling sample, and performing saturation adjustment, brightness adjustment, gaussian noise addition, motion blur processing and random scaling, random rotation and random affine transformation on the image shape aiming at the image quality, wherein the mask matrix corresponding to the adjustment of the foreground shape is synchronously transformed to ensure the consistency of the mask matrix and the edge of the foreground image shape;

step A4: acquiring image data of a use scene by using a camera, wherein the image needs to contain various shooting angles and various illumination conditions of the use scene;

step A5: combining the expanded foreground sample with the acquired scene image, taking the scene image as a background, taking the expanded sample as a foreground, and randomly positioning. Since the original image to which the foreground sample belongs and the background image to be combined subsequently are not the same image, if the foreground image is not processed and is directly covered with the background image, an obvious boundary exists between the edge part of the foreground image and the background image, and the part can influence the judgment of the model in the characteristic extraction stage of the model, and the characteristic of the combined boundary is misjudged as an important basis for influencing the segmentation boundary, the data synthesis part of the method adopts a poisson fusion method. The core idea of poisson fusion is to realize fusion by making a target image grow a new image according to a guiding field of a source image in a fusion part, so that a fusion result is more natural. Poisson fusion can be divided into three types according to the different ways of handling the pilot field: the method adopts NORMAL CLONE, MIXED CLONE and MONOCHROME TRANSFER, and the type applies the guiding field of the source image to the target image, so that the gradient field of the fusion edge part approximates to the source image, and meanwhile, the illumination and tone conditions of the source image are kept, and the method is most suitable for the image fusion requirement required by the method. The MIXED CLONE applies the guiding fields of the source image and the target image to the fusion part, so that the source image and the target image have certain contribution in the fusion part, and the fusion weakens the image characteristics of the foreground image, which is not beneficial to the follow-up characteristic advanced operation of the fusion picture; MONOCHROME TRANSFER the guiding field of the source image is applied to the target image, but the processing in a monochromatic mode can lose the color characteristics of the foreground image in the subsequent characteristic extraction stage, which is not in line with the image fusion requirement of the method.

Step A6: dividing the data set into a training set, a testing set and a verification set, wherein the dividing ratio is 75% of the training set, 20% of the testing set and 5% of the verification set;

the minimum circumscribed rectangle solving and grabbing strategy based on the U2-Net model is as follows:

the automatic grabbing of the mechanical arm based on the monocular camera, which is realized by the embodiment, mainly depends on improving the target positioning of YOLOv5 and the pose estimation of U2 Net. A diagram of a detection frame obtained by Yolov5 is shown in FIG. 5, and FIG. 5 is a schematic diagram of a detection result of a Yolov5 model according to an embodiment of the present application. After the U2-Net acquires a foreground mask matrix of an object to be grabbed, acquiring edge information of the mask matrix, solving a minimum circumscribed rectangle of a graph formed by edge points, wherein center points of two long sides of the rectangle are grabbing points, and angles of clamping jaws are parallel to two short sides of the minimum circumscribed rectangle. Referring to fig. 6, fig. 6 is a schematic diagram of a minimum circumscribed rectangle according to an embodiment of the present application, wherein two points A, B are gripping points of the gripping jaws.

The embodiment of the application also provides a grabbing strategy of the mechanical arm, which comprises the following steps: the method comprises the steps of collecting camera images, obtaining a target detection frame by using an improved YOLOv5 model, obtaining a foreground mask matrix by using U2-Net, converting the mask matrix into a closed graph, solving a minimum circumscribed rectangle, and generating a grabbing scheme.

Specifically, the process of solving the minimum circumscribed rectangle comprises the following steps B1 to B5:

step B1: drawing a closed graph S aiming at the foreground edge of the obtained target to be grabbed;

step B2: the simple circumscribed rectangle algorithm is realized, and the simple circumscribed rectangle refers to a circumscribed rectangle parallel to an X axis or an outer shaft. Finding the maximum value Xmax and the minimum value Xmin of the X axis of S, the maximum value Ymax and the minimum value Ymin of the Y axis, taking (Xmin, ymin) as the upper left corner point,

(Xmax, ymax) drawing a rectangle for the lower right corner, namely a simple circumscribed rectangle of S;

step B3: the rotating pattern S (-90 degrees to 90 degrees), the size of the rotating space directly influences the precision and the operation efficiency, for example, the rotating space is controlled to be 5 degrees to 10 degrees, the simple external rectangle after each rotation is solved, and the area, the rotating vertex coordinates and the rotating angle of the obtained simple external rectangle are recorded. Let (X2, Y2) be the point on the plane where the point (X1, Y1) is rotated counterclockwise by an angle θ around the point (X0, Y0), then:

X2＝(X1–X0)cosθ-(Y1-Y0)sinθ+X0；

Y2＝(X1-X0)sinθ+(Y1-Y0)cosθ+Y0；

- θ in the clockwise direction;

step B4: traversing the areas of the simple circumscribed rectangles under all rotation angles, wherein the simple circumscribed rectangle with the smallest area is the smallest circumscribed rectangle of the graph S, and acquiring the rotation angle theta min and the rotation vertex coordinate O of the simple circumscribed rectangle;

Step B5: and rotating the external rectangle. And rotating the simple circumscribed rectangle obtained in the last step by a corresponding rotation angle theta min around the opposite direction of the rotation vertex (or rotating in the positive direction by 360-theta min), wherein the obtained rectangle is the minimum circumscribed rectangle of the polygon S, namely the approximate minimum circumscribed rectangle of the object to be grabbed.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a target object capturing system according to an embodiment of the present application, where the system may include:

an image capturing module 701, configured to obtain an original image captured by a target object under a plurality of capturing parameters; the target object is in a preset scene, and the shooting parameters comprise a shooting angle and a shooting scale;

the labeling module 702 is configured to label the original image, and extract a foreground image corresponding to the target object from the original image according to a labeling result;

a data expansion module 703, configured to perform data expansion on the foreground image, and update a mask matrix corresponding to the expanded foreground image;

a fusion module 704, configured to capture a scene image of the preset scene, and fuse the expanded foreground image with the scene image to obtain a fused image;

the training module 705 is configured to train an image segmentation model by using the fused image;

And the grabbing control module 706 is configured to receive an image of a working area of the mechanical arm, and identify an object contour of the target object from the image of the working area of the mechanical arm by using the image segmentation model, so as to control the mechanical arm to grab the target object according to the object contour.

The process of the grabbing control module 706 using the image segmentation model to identify the object profile of the target object from the robot arm working area image includes: extracting a region of interest containing the target object from the mechanical arm working region image by using a target detection model; and inputting the region of interest into the image segmentation model to obtain the object contour of the target object.

Further, the image segmentation model is a U2-Net model, the target detection model is a YOLOv5 model, the multi-scale feature fusion network of the YOLOv5 model comprises a first feature layer, a second feature layer, a third feature layer and a fourth feature layer which are sequentially connected, the size of the first feature layer is 20×20, the size of the second feature layer is 40×40, the size of the third feature layer is 80×80, and the size of the fourth feature layer is 160×160.

Further, the process of the fusion module 704 fusing the expanded foreground image and the scene image to obtain a fused image includes: and performing poisson fusion operation on the expanded foreground image and the scene image to obtain the fusion image.

Further, the labeling module 702 labeling the original image includes: and labeling the original image at the pixel level by using a labelimg data labeling tool.

Further, the process of controlling the mechanical arm to grasp the target object by the grasping control module 706 according to the object profile includes: generating a corresponding closed figure according to the object outline; determining boundary coordinate values of the closed graph; the boundary coordinate point comprises an X-axis coordinate maximum value, an X-axis coordinate minimum value, a Y-axis coordinate maximum value and a Y-axis coordinate minimum value; determining the circumscribed rectangle of the closed graph according to the boundary coordinate value; rotating the closed graph for N times, and recording boundary coordinate values and circumscribed rectangles of the closed graph after each rotation; setting the circumscribed rectangle with the smallest area as the smallest circumscribed rectangle of the target object; and controlling the mechanical arm to grasp the target object according to the boundary coordinate value of the minimum circumscribed rectangle.

Further, the process of controlling the mechanical arm to grasp the target object by the grasping control module 706 according to the boundary coordinate value of the minimum bounding rectangle includes: determining a grabbing point and a clamping jaw angle according to the boundary coordinate value of the minimum circumscribed rectangle; the grabbing points are center points of two long sides of the minimum circumscribed rectangle, and the clamping jaw angles are parallel to two short sides of the minimum circumscribed rectangle; and controlling the mechanical arm to grasp the target object according to the grasping point and the clamping jaw angle.

Since the embodiments of the system portion and the embodiments of the method portion correspond to each other, the embodiments of the system portion refer to the description of the embodiments of the method portion, which is not repeated herein.

The present application also provides a storage medium having stored thereon a computer program which, when executed, performs the steps provided by the above embodiments. The storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The application also provides an electronic device, which can comprise a memory and a processor, wherein the memory stores a computer program, and the processor can realize the steps provided by the embodiment when calling the computer program in the memory. Of course the electronic device may also include various network interfaces, power supplies, etc.

In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of capturing a target object, comprising:

training an image segmentation model by utilizing the fusion image;

2. The target object capturing method of claim 1, wherein identifying an object profile of the target object from the robot arm working area image using the image segmentation model comprises:

3. The method for capturing an object according to claim 2, wherein the image segmentation model is a U2-Net model, the object detection model is a YOLOv5 model, a multi-scale feature fusion network of the YOLOv5 model includes a first feature layer, a second feature layer, a third feature layer and a fourth feature layer which are sequentially connected, the size of the first feature layer is 20×20, the size of the second feature layer is 40×40, the size of the third feature layer is 80×80, and the size of the fourth feature layer is 160×160.

4. The method for capturing an object according to claim 1, wherein fusing the expanded foreground image and the scene image to obtain a fused image comprises:

5. The target object capturing method according to claim 1, wherein labeling the original image includes:

6. The target object gripping method according to claim 1, wherein controlling the robot arm to grip the target object according to the object profile includes:

generating a corresponding closed figure according to the object outline;

7. The method according to claim 6, wherein controlling the robot arm to grasp the target object according to the boundary coordinate value of the minimum bounding rectangle, comprises:

8. A target object gripping system, comprising:

9. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the object grabbing method of any one of claims 1 to 7 when the computer program in the memory is invoked by the processor.

10. A storage medium having stored therein computer executable instructions which, when loaded and executed by a processor, implement the steps of the target object gripping method according to any of claims 1 to 7.