CN116363505A

CN116363505A - Target picking method based on picking robot vision system

Info

Publication number: CN116363505A
Application number: CN202310210531.2A
Authority: CN
Inventors: 陈鹏; 张明年; 章军; 夏懿; 张波; 庞春晖; 孟维庆; 杜健铭; 王儒敬
Original assignee: Hefei Intelligent Agriculture Collaborative Innovation Research Institute Of China Science And Technology
Current assignee: Hefei Intelligent Agriculture Collaborative Innovation Research Institute Of China Science And Technology
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-06-30

Abstract

The invention relates to a target picking method based on a picking robot vision system, which comprises the following steps: acquiring an original fruit image by using a camera; preprocessing the collected original fruit image, forming a data set by the preprocessed fruit image, and dividing the data set into a training set, a verification set and a test set; establishing a Mask R-CNN model, and training; optimizing the training process to obtain a Mask R-CNN model with highest training precision; and loading a Mask R-CNN model with highest training precision on a CPU of the picking robot, and finally realizing target identification picking based on a vision system of the picking robot. According to the invention, a high-pixel camera is used for acquiring mature fruit images, a deep learning technology is applied to intelligent fruit picking, a network structure is adjusted according to actual use scenes, a Mask R-CNN model is trained, finally the Mask R-CNN model can automatically detect mature fruits, intelligent picking is realized, and the problem that a picking robot cannot well identify and extract fruits and vegetables due to a complex surrounding environment is solved.

Description

Target picking method based on picking robot vision system

Technical Field

The invention relates to the technical field of deep learning and artificial intelligence, in particular to a target picking method based on a picking robot vision system.

Background

With the development of the age and the progress of science and technology, the life quality of people is continuously improved, and the demand for fruits is gradually increased. The continuous expansion of fruit planting area, gradual reduction of agricultural practitioners and aggravation of population aging trend, and manual picking can not meet the concentrated and rapid picking process in the fruit maturity stage. In addition, the traditional manual picking mode has the defects of low picking efficiency, high labor intensity, high-altitude picking difficulty, high potential safety hazard and the like. These have severely limited the long-term development of the planting industry. In order to save manpower and material resources and improve the economic income of fruit growers, the mechanization of the planting industry becomes a necessary trend of development.

The large-scale mechanical picking device mainly aims at a large farm, has high picking efficiency and large scale, but the price of the required purchasing picking equipment is relatively high, and the maintenance cost of the machine are small, so that the large-scale mechanical picking device is not suitable for small-area fruit planting taking families as units in China. Secondly, because the machine is manually operated, the fruit is inevitably damaged to a certain extent in the picking process. Moreover, mechanical picking is not selective, and may involve some unhealthy, immature or damaged grapes, and may shake some leaves, etc., which may increase the difficulty of fruit screening.

The existing picking robot has complex running environment and a plurality of uncertainty factors, so that the picking difficulty is high. Efficient and quick fruit and vegetable picking requires accurate recognition and three-dimensional positioning support. The vision system of the picking robot operates through 4 phases: target detection, target recognition, three-dimensional reconstruction and three-dimensional positioning. The target detection requires a detection algorithm to timely detect a target object in an image; target recognition requires that picked objects and other interfering items be identified; the three-dimensional reconstruction is to acquire a two-dimensional image of a target object through a camera, and then acquire three-dimensional information of fruits in space through algorithms such as feature extraction, stereo matching and the like; and obtaining space coordinates through three-dimensional reconstruction to complete three-dimensional positioning. The precision of target identification and positioning directly determines the picking efficiency of the picking robot, whether crops are damaged, whether the picking robot body is damaged by collision, and the like. In picking operations, the factors that cause inaccuracy in target identification and localization are quite numerous and can be summarized in the following ways: 1) A change in natural illumination; 2) A complex growth environment; 3) The fruits are overlapped or blocked by branches, leaves, stems and the like; 4) The vibration of the mechanical arm causes inaccurate imaging of the sensor; 5) Radio frequency interference, robot controllers, cameras, sensors, etc.; 6) Robot mechanical failure, etc. With such a number of interfering factors, accurate fruit and vegetable identification and positioning presents a significant challenge.

Disclosure of Invention

The invention aims to provide a target picking method based on a picking robot vision system, which can automatically detect mature fruits and realize intelligent picking, and solves the problem that a picking robot cannot well identify and extract fruits and vegetables due to a complex surrounding environment.

In order to achieve the above purpose, the present invention adopts the following technical scheme: a method of picking a target based on a vision system of a picking robot, the method comprising the sequential steps of:

(1) Acquiring an original fruit image by using a camera;

(2) Preprocessing the collected original fruit image, forming a data set by the preprocessed fruit image, and dividing the data set into a training set, a verification set and a test set;

(3) Establishing a Mask R-CNN model, inputting a training set into the Mask R-CNN model for training, and obtaining a trained Mask R-CNN model;

(4) Optimizing the training process to obtain a Mask R-CNN model with highest training precision;

(5) And loading a Mask R-CNN model with highest training precision on a CPU of the picking robot, and finally realizing target identification picking based on a vision system of the picking robot.

The step (2) specifically comprises the following steps:

(2a) Primary screening: screening 1056 original fruit images which are clear in image and contain fruit targets according to actual requirements;

(2b) Labeling: labeling the preliminarily screened fruit images by using a labelme tool, labeling the ripe fruits as 1, labeling the immature fruits as 0, and setting the areas except the fruits as the background without labeling, and establishing a label image as a target detection label;

(2c) Data set classification: the marked fruit images are composed into a data set, and the data set is processed according to the following steps of 8:1:1 is divided into a training set, a verification set and a test set;

(2d) Data amplification: and 5 data amplifications are carried out on each image in the training set and the verification set, namely rotation is carried out by 90 degrees, 180 degrees and 270 degrees, color dithering and Gaussian noise are turned horizontally and vertically, so that the training set contains 4000 images, the verification set contains 640 images, and the test set contains 640 images.

In the step (3), the Mask R-CNN model includes a main network for extracting an input image feature map, a region candidate network, a region of interest alignment layer and a region convolution neural network, the Mask R-CNN model adopts a residual network and a feature pyramid as extraction features of the main network, the main network outputs the feature map to the region candidate network, the region candidate network generates a region of interest, a candidate object boundary box is provided, the region of interest alignment layer matches the feature map output by the region of interest and the main network, feature aggregation and pooling are completed to be of a fixed size, the feature map feature aggregation and pooling is output to the region convolution neural network through a full connection layer, the region convolution neural network includes a first branch, a second branch and a third branch, the first branch realizes classification of fruits through a softmax classifier, the second branch realizes more accurate target positioning through a boundary box regressing device, the third branch completes contour segmentation of mature fruits through the full convolution network, a Mask is generated, and finally, each branch outputs information comprehensively, and images including classification, positioning boundary boxes and accurate positioning of fruits are realized.

The step (4) specifically refers to: and adding a mosaic enhancement method in the training process, improving the local capacity of the Mask R-CNN model, optimizing the long tail distribution of the data set, improving the training precision, and correspondingly obtaining the Mask R-CNN model with the highest training precision when the training precision reaches 98.7%.

The step (5) specifically refers to: the method comprises the steps of obtaining fruit images to be picked by using a camera of a picking robot, processing the fruit images to be picked based on a Mask R-CNN model with highest training precision, outputting a target recognition extraction result by the Mask R-CNN model with highest training precision, and picking by the picking robot according to the target recognition extraction result.

According to the technical scheme, the beneficial effects of the invention are as follows: firstly, acquiring mature fruit images by using a high-pixel camera, applying a deep learning technology to intelligent fruit picking, adjusting a network structure according to an actual use scene, training a Mask R-CNN model, finally enabling the Mask R-CNN model to automatically detect mature fruits, and realizing intelligent picking; secondly, the Mask R-CNN model is applied to a vision system of the picking robot so as to realize self-adaptive recognition and extraction of the target, the problem that the picking robot cannot recognize and extract fruits and vegetables well due to a complex surrounding environment is solved, and experimental results show that the method can better solve the recognition and extraction problem of the target in the complex environment; thirdly, the Mask branch layer of the Mask R-CNN model can realize classification of pixel level, so that the target recognition rate is higher than that of the fast R-CNN, and the Mask branch layer has a certain screening effect on mature and immature fruits during fruit recognition and extraction.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of a vision processing system of the picking robot in the present invention;

FIG. 3 is a diagram of a visual system object recognition and extraction framework.

Detailed Description

As shown in fig. 1, a method for picking targets based on a vision system of a picking robot, the method comprising the following sequential steps:

(1) Acquiring an original fruit image by using a camera;

(4) Optimizing the training process to obtain a Mask R-CNN model with highest training precision, namely an optimal model;

The step (2) specifically comprises the following steps:

The step (4) specifically refers to: and adding a mosaic enhancement method in the training process, improving the local capacity of the Mask R-CNN model, optimizing the long tail distribution of the data set, improving the training precision, and correspondingly obtaining the Mask R-CNN model with the highest training precision when the training precision reaches 98.7%. After training is completed, an initial training result of 4000 images can be obtained, the initial training result is analyzed to find that the training precision does not reach the expected value, a mosaic enhancement method is added, the local capacity of a Mask R-CNN model is improved, the long tail distribution of a data set is optimized, the model is improved, the training precision is improved, the final training precision reaches 98.7%, and the detection speed is also improved before improvement.

The invention is further described below in connection with fig. 1 to 3.

First, image acquisition

The Sony A7R4 camera is adopted, the pixel position of the camera is 6100 ten thousand, the image resolution is 9504 multiplied by 6336 pixels, and the hand-held camera enters an orchard and a greenhouse to collect manual data of tomatoes. In the case of picture data acquisition, the following considerations apply:

1. ensuring that the acquired picture data are shot from all angles, and conforming to the actual scene;

2. in the image dataset, the detected object must have at least one similar, more or less, shape, object to the side, relative size, rotation angle, etc.;

3. the quality of the collected data is guaranteed, the size is best and is close to the size of a used scene, the diversity of the data is collected in the scene as much as possible, and various photos in a natural scene are collected as much as possible;

4. the number of the data sets is as large as possible, so that experimental contingency is avoided, and the target detection precision is improved;

5. the data set comprises the object to be detected and the object not to be detected, but only the object to be detected is marked if the object to be detected is marked;

6. if multiple classes are detected, then the number of times each target detected class appears in your data is nearly the same;

7. the number of data sets may be increased by means of image enhancement.

2. Image processing

And screening the photos collected by the camera, removing the photos with abnormal angles, imaging or blurred photos and the like, and manufacturing the remaining screened photos into a training set and a verification set for target detection.

The training set is used for training the model and determining parameters;

the verification set is used for determining the network structure and adjusting the super parameters of the model;

a test set for checking generalization ability of the model;

parameters, which refer to variables obtained by learning a model, such as weights and biases;

super-parameters refer to parameters set according to experience, such as iteration times, the number of hidden layers, the number of neurons in each layer, learning rate and the like. Thus, a complete scientific dataset was created with 800 pictures as training sets, 128 as validation sets, and 128 as test sets.

The data amplification can improve the generalization performance of the neural network. Therefore, 5 data amplification modes are respectively carried out on each image in the training set and the verification set, namely rotation by 90 degrees, 180 degrees and 270 degrees, horizontal and vertical overturn color dithering and Gaussian noise. After the data is enhanced, the corresponding labeling information is updated, so that the task of expanding the data set can be completed well, and the training precision based on the Mask R-CNN model is improved better. Finally, the training set had 4000 images and the validation set had 640 images. The test set had 640 images.

The vision system of the picking robot mainly comprises the following three parts: the camera acquires the scene image, the vision system processes the scene image, saves and returns the processing result, and the whole flow is shown in figure 2.

The Mask R-CNN model is a neural Network framework for completing object segmentation and recognition aiming at a single image, adopts an Anchor technology adopted by an R-CNN series Network, optimizes recognition effects of objects with different scales by combining an image pyramid Network (FPN), and realizes accurate object segmentation by introducing a full convolution Network (Fully Convolutional Networks, FCN), wherein the whole implementation flow chart of the model is shown in a figure 3.

The Mask R-CNN model can detect a target object in an image and mark a target object area, and can detect rice lodging. The Mask R-CNN model is added with a branch target Mask prediction network Mask on the basis of the fast R-CNN, so that target detection is realized and high-quality segmentation of target instances is realized.

In summary, the high-pixel camera is used for acquiring the mature fruit image, the deep learning technology is applied to intelligent fruit picking, the network structure is adjusted according to the actual use scene, the Mask R-CNN model is trained, and finally the Mask R-CNN model can automatically detect the mature fruit and realize intelligent fruit picking; according to the invention, the Mask R-CNN model is applied to a vision system of the picking robot so as to realize self-adaptive recognition and extraction of the target, so that the problem that the picking robot cannot recognize and extract fruits and vegetables well due to a complex surrounding environment is solved, and experimental results show that the recognition and extraction problem of the target in the complex environment can be well solved; according to practical inspection, the algorithm not only solves the problem that the picking robot cannot well identify and extract tomatoes due to the complex surrounding environment, but also has good screening effect on mature and immature tomatoes.

Claims

1. A picking method of targets based on a picking robot vision system is characterized in that: the method comprises the following steps in sequence:

(1) Acquiring an original fruit image by using a camera;

2. The picking robot vision system-based target picking method of claim 1, characterized by: the step (2) specifically comprises the following steps:

3. The picking robot vision system-based target picking method of claim 1, characterized by: in the step (3), the Mask R-CNN model includes a main network for extracting an input image feature map, a region candidate network, a region of interest alignment layer and a region convolution neural network, the Mask R-CNN model adopts a residual network and a feature pyramid as extraction features of the main network, the main network outputs the feature map to the region candidate network, the region candidate network generates a region of interest, a candidate object boundary box is provided, the region of interest alignment layer matches the feature map output by the region of interest and the main network, feature aggregation and pooling are completed to be of a fixed size, the feature map feature aggregation and pooling is output to the region convolution neural network through a full connection layer, the region convolution neural network includes a first branch, a second branch and a third branch, the first branch realizes classification of fruits through a softmax classifier, the second branch realizes more accurate target positioning through a boundary box regressing device, the third branch completes contour segmentation of mature fruits through the full convolution network, a Mask is generated, and finally, each branch outputs information comprehensively, and images including classification, positioning boundary boxes and accurate positioning of fruits are realized.

4. The picking robot vision system-based target picking method of claim 1, characterized by: the step (4) specifically refers to: and adding a mosaic enhancement method in the training process, improving the local capacity of the Mask R-CNN model, optimizing the long tail distribution of the data set, improving the training precision, and correspondingly obtaining the Mask R-CNN model with the highest training precision when the training precision reaches 98.7%.

5. The picking robot vision system-based target picking method of claim 1, characterized by: the step (5) specifically refers to: the method comprises the steps of obtaining fruit images to be picked by using a camera of a picking robot, processing the fruit images to be picked based on a Mask R-CNN model with highest training precision, outputting a target recognition extraction result by the Mask R-CNN model with highest training precision, and picking by the picking robot according to the target recognition extraction result.