CN117058669A

CN117058669A - Deep learning-based litchi fruit identification method

Info

Publication number: CN117058669A
Application number: CN202311061583.4A
Authority: CN
Inventors: 彭红星; 张淇淇
Original assignee: South China Agricultural University
Current assignee: South China Agricultural University
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-11-14

Abstract

The invention discloses a litchi fruit identification method based on deep learning, which comprises the following steps of S1, acquiring litchi images as sample images; s2, preprocessing the obtained litchi image; s3, marking the preprocessed image, constructing a litchi data set and dividing the data set; s4, constructing a litchi identification model based on deep learning; s5, inputting the training set into the litchi identification model constructed in the step S4, and training the litchi identification model; s6, inputting the image to be identified into the litchi identification model trained in the step S5, and obtaining the litchi identification result of the image to be identified. The invention can automatically identify litchis with different maturity under natural environment, solves the problem of poor identification effect of the traditional litchi identification method, and the litchi identification model used in the scheme introduces a FasterNet module and an EMA attention mechanism based on deep learning YOLOv8, so that important features can be effectively extracted, and network identification performance is improved.

Description

Deep learning-based litchi fruit identification method

Technical Field

The invention relates to the technical field of computer vision, in particular to a litchi fruit identification method based on deep learning.

Background

China is a large country for litchi planting, the planting area is about 810 mu, and the yield of litchi in 2022 is 222.97 mu ton. With the development of scientific technology, more and more industries change from manual production to mechanized production, and agriculture is no exception. The artificial intelligence technology is introduced into litchi picking work, so that automatic picking of litchi can be realized, thereby helping farmers to better manage litchi orchards, improving the quality and yield of litchi, reducing the loss and waste of litchi, increasing the income of farmers, and protecting the growing environment and biodiversity of litchi. The litchi picking robot is mainly responsible for identifying litchi to be picked in a field environment, which is a key point and a difficulty in developing the litchi picking robot.

There are many problems to be solved in the existing litchi identification technology: such as litchi datasets in natural environments that still lack large scale; meanwhile, the existing litchi identification technology is generally limited to specific scenes, and has weak generalization capability, because in a real litchi growth environment, the situations of fruit overlapping, branch and leaf shielding, illumination change, and unstable litchi shape and color caused by inconsistent litchi maturity and the like often occur, and the identification precision of litchi can be influenced.

Disclosure of Invention

The invention aims to provide a litchi fruit identification method based on deep learning, which aims to solve the problems that the litchi fruit identification method is limited to a specific scene and has weak generalization capability, and because the situations of fruit overlapping, branch and leaf shielding, illumination change, unstable litchi maturity and the like often occur in a real litchi growth environment, the litchi fruit identification precision is influenced.

In order to achieve the above purpose, the present invention provides the following technical solutions: the method comprises the following steps:

s1, acquiring a litchi image and taking the litchi image as a sample image;

s2, preprocessing the litchi image obtained in the step S1;

s3, marking the image preprocessed in the step S2, constructing a litchi data set, and dividing the data set into a training set, a verification set and a test set;

s4, constructing a litchi identification model based on deep learning according to the data set in the step S3;

s5, inputting the sample image of the training set in the step S3 into the litchi identification model constructed in the step S4, and training the litchi identification model;

s6, inputting the image to be identified into the litchi identification model trained in the step S5, and obtaining the litchi identification result of the image to be identified.

Preferably, the litchi recognition model in the step S4 is obtained by training a network model including the data set in the step S3 and introducing a FaterNet module and an attention mechanism EMA by using Yolov8 as a basic network.

Preferably, the litchi fruits included in the sample image of the litchi dataset in step S3 include immature litchi fruits, semi-mature litchi fruits and completely mature litchi fruits.

Preferably, the preprocessing in step S2 is as follows:

and (3) expanding the litchi data set by performing data enhancement of the geometric transformation class and the color transformation class on the sample image in the step (S1).

Preferably, the data enhancement of the geometric transformation class includes, but is not limited to, horizontal flipping, and the data enhancement of the color transformation class includes, but is not limited to, random luminance transformation, gaussian blurring, and gaussian noise addition.

Preferably, the image labeling process in the step S3 is as follows:

and (3) manually marking the image preprocessed in the step (S2), wherein the data set is divided into a training set, a verification set and a test set, and the ratio of the data set to the verification set to the test set is 7:2:1 in sequence.

Compared with the prior art, the invention has the beneficial effects that:

1. compared with the prior art, the invention firstly builds the litchi data set under the natural environment, aims at solving the problem of small image data samples, then introduces the FasterNet module and the attention mechanism to improve the YOLOv8 network model, enhances the feature extraction capacity of the feature acquisition module, enhances the fusion effect of the feature fusion module, and finally enhances the detection precision, the generalization capacity and the robustness of the model; finally, the trained litchi recognition model is used for recognizing litchi, so that the litchi in different maturity stages can be accurately recognized. The litchi identification method provided by the invention has higher accuracy in natural environment, and can provide an effective litchi identification method for the litchi picking robot.

Drawings

FIG. 1 is a flow chart of the overall structure of a litchi fruit recognition method based on deep learning;

FIG. 2 is a litchi diagram in litchi data set based on a deep learning litchi fruit identification method of the present invention;

FIG. 3 is a FasterNet structure diagram in a litchi fruit identification method based on deep learning;

FIG. 4 is a diagram showing the structure of an EMA module in a deep learning-based litchi fruit recognition method of the present invention;

fig. 5 is a diagram of recognition results of a test set in the litchi fruit recognition method based on deep learning.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-5, the present invention provides a technical solution: the method comprises the following steps:

s1, acquiring a litchi image and taking the litchi image as a sample image;

s2, preprocessing the litchi image obtained in the step S1, wherein the preprocessing process is as follows:

and (3) expanding the litchi data set by carrying out data enhancement of geometric transformation class and color transformation class on the sample image in the step (S1).

Further, data enhancement of the geometric transformation class includes, but is not limited to, horizontal flipping, data enhancement of the color transformation class includes, but is not limited to, random luminance transformation, gaussian blurring, and adding gaussian noise.

S3, marking the image preprocessed in the step S2 by using Labelimg marking software, simultaneously constructing a litchi data set, dividing the data set into a training set, a verification set and a test set, wherein litchi covered by a sample image of the litchi data set comprises immature litchi fruits, semi-mature litchi fruits and completely mature litchi fruits, and the image marking process is as follows:

S4, constructing a litchi identification model based on deep learning according to the data set in the step S3, wherein the litchi identification model is constructed by the data set comprising the step S3 and is based on the litchi identification model of YOLOv8, improving the feature extraction part, simultaneously replacing the backbone network with FaterNet, and adding an EMA attention mechanism.

Further, the YOLOv8 network model is the most recently introduced neural network model of the YOLO (You Only Look Once) series of object detection models, which uses a single neural network to predict bounding boxes and categories of objects in an image. The YOLOv8 network model consists of several key parts, including a Backbone network part (Backbone), a Neck network part (neg), and a Head of detection part (Head). The backbone network portion is used to extract a feature map from the input image, and the neck network portion and the detection head portion are used to predict bounding boxes and categories of objects in the feature map.

The main network part of YOLOv8 has 10 layers, wherein layers 1, 2, 4, 6 and 8 are CBS modules, layers 3, 5, 7 and 9 are C2f modules, and layer 10 is an SPPF module. The CBS module firstly carries out convolution operation on input data, extracts characteristics of the input data, then uses a BN layer for normalization, improves stability and generalization capability of a network, and finally uses a SiLU activation function to carry out nonlinear transformation on output of the convolution layer, thereby enhancing expression capability of the network. The C2f module carries out convolution operation on input data to extract characteristics, then the Split module is used for dividing the characteristics into two parts, one part is input into the Bottleneck module to acquire more gradient flow information, the other part carries out Concat operation on the other part and the outputs of the Bottleneck modules, and finally the convolution operation is carried out again. The C2f module refers to the design of the C3 module and the ELAN concept, so that the YOLOv8 can obtain more abundant gradient flow information while ensuring the light weight. The SPPF module carries out convolution operation on the input feature map, then carries out pooling operation of different scales on the feature extracted by convolution, extracts feature information of different scales, fuses the feature information of different scales by Concat operation, and finally carries out convolution operation on the feature information.

The neck network part of YOLOv8 has 12 layers, the 11 th layer and the 14 th layer of the network are up-sampling modules, the 12 th layer, the 15 th layer, the 18 th layer and the 21 th layer are Concat modules, the 13 th layer, the 16 th layer, the 19 th layer and the 22 th layer are C2f modules, and the 17 th layer and the 20 th layer are CBS modules. The neck network part firstly carries out 11 th layer up-sampling operation on the output characteristic diagram of the main network part, and the length and the width of the characteristic diagram are doubled. And the 12 th layer performs splicing operation on the characteristic diagram output by the 11 th layer and the output by the 6 th layer, so that the channel number of the characteristic diagram is increased. Layer 13 convolves the feature map with a C2f module to extract features therein. The up-sampling module of the 14 th layer changes the length and width of the characteristic diagram output by the 13 th layer into twice of the original length and width. The Concat module of the 15 th layer performs splicing operation on the feature map output by the 14 th layer and the output by the 5 th layer, and fuses the feature information of different scales. Layer 16 convolves the feature map with a C2f module to extract features therein. The CBS module of the 17 th layer carries out convolution operation on the characteristic diagram, and the length and the width of the characteristic diagram are changed into one half of the original length and the width. The Concat module of the 18 th layer performs splicing operation on the feature map output by the 17 th layer and the output of the 13 th layer, and fuses the feature information of different scales. Layer 19 convolves the feature map with a C2f module to extract features therein. The 20 th layer CBS module carries out convolution operation on the characteristic diagram, and the length and the width of the characteristic diagram are changed into one half of the original length and the width of the characteristic diagram. The Concat module of the 21 st layer performs splicing operation on the feature map output by the 20 th layer and the output of the 10 th layer, and fuses the feature information of different scales. And the 22 nd layer uses a C2f module to carry out convolution operation on the feature map so as to extract the features fused with the information of different scales.

The YOLOv8 sense header section places the outputs of layers 16, 19, 22 as inputs into the same decoupling header structure. The decoupling head structure consists of two branches, each consisting of two CBS modules and a convolution module, one branch predicting the bounding box of the target and one branch predicting the class.

The classification Loss of YOLOv8 used BCE Loss and the regression Loss used CIOUs Loss and DFL (Distribution Focal Loss).

The formula of BCE Loss is: loss= -w [ p log (q) + (1-p) log (1-q) ]

Wherein p and q are theoretical labels and actual predicted values respectively, and w is a weight. The log here corresponds to ln mathematically.

The formula of CIOU Loss is:

CIOU loss＝1-CIOU

wherein do is the Euclidean distance between the center points of the target frame and the prediction frame, dc is the diagonal distance of the target frame, and w ^gt And h ^gt

Is the width and height, w, of the real target frame ^p And h ^p Is the width and height of the prediction box.

The formula of the DFL is:

wherein y is a theoretical label, y _i And y _i+1 Is a value near y, p _i And p _i+1 Probability is distributed for the predicted bounding box.

In the technical scheme, a method for rapidly identifying litchi by using FasterNet and EMA attention mechanisms in Yolov8 is provided. According to the invention, the feature extraction part of the YOLOv8 is optimized, and the FasterNet network is used for replacing the main network of the YOLOv8, so that the detection speed and the detection precision of the YOLOv8 are improved, and the accuracy of a model is improved. The EMA attention mechanism is used for enhancing the YOLOv8 network so as to improve the distinguishing capability and the multi-scale processing capability of the characteristics of the model and improve the small target detection effect.

FasterNet is a lightweight deep convolutional network based on a partial convolutional module PConv that can efficiently extract spatial features with reduced redundant computation and memory access. The FasterNet is simplified through reasonable model design and matching framework, and the running speed can be greatly improved while the detection effect is maintained.

FIG. 3 is a block diagram of FasterNet. The network structure of FaterNet is 8 layers, the 1 st layer is a PatchEbed layer, the 2 nd, 4 th, 6 th and 8 th layers are FaterNet Block modules, and the 3 rd, 5 th and 7 th layers are PatchMerging layers. The FaterNet firstly extracts the patch characteristic through the PatchEbed module, then enters into the multistage FaterNet Block module, the FaterNet totally defines 4 stages, and each Stage comprises a plurality of FaterNet Block modules. The first Stage contains 1 FasterNet Block module, and the feature map is halved through the first PatchMerging module. The second Stage contains 2 FasterNet Block modules, and the feature map is halved by the second PatchMerging module. The third Stage contains 8 FaterNet Block modules, and the feature map is halved by the third PatchMerging module. Finally, the fourth Stage contains 2 FaterNetBlock modules.

The FasterNet Block module carries out convolution operation of the PConv module on input data, then connects with a 1×1 convolution layer, normalizes by using a BN layer, carries out nonlinear change by using a ReLU activation function, superimposes a 1×1 convolution layer, carries out residual connection on the characteristic and the input of the FasterNet Block module, and takes the result as the final output of the module.

EMA (effective Multi-Scale Attention) is an Efficient Multi-Scale Attention mechanism that achieves a flexible channel relationship learning by modeling the dependency between channels. The core idea of EMA is: dividing the input feature diagram X into G groups, and carrying out self-adaptive average pooling on each group in the height and width directions to obtain the context information in the height and width directions. They are then Concat in the channel dimension, passed through interactions between 1x1 convolution learning channels, and then activated by Sigmoid as an attention weight.

Fig. 4 is a structural diagram of EMA. EMA will divide the input features into G sub-features across the channel dimension direction for learning different semantics. EMA is composed of three parallel branches to extract the attention weight descriptors of the packet feature map. The first and second branches are 1x1 branches, and the third branch is 3x3 branches. The first two branches are respectively subjected to average pooling operation along the X, Y direction, are spliced together and are convolved by using 1x1, the output of the 1x1 convolution is decomposed into two tensors, the two tensors are multiplied by a grouping feature map after being subjected to Sigmoid operation, and then the channels are normalized by being subjected to groupnum operation. Only one 3x3 convolution kernel is stacked in the 3x3 branch to capture the multi-scale features. The EMA uses two-dimensional global averaging pooling to encode global spatial information for the output of the 1x1 branch, and a nonlinear function Softmax is used at the output of the two-dimensional global averaging pooling to fit the linear transformation. The output of the above processing is multiplied by the original matrix to obtain a first spatial attention map. The global space information is coded at the 3x3 branch by two-dimensional global average pooling, linear transformation is fitted by adopting Softmax, and a second space attention diagram is obtained through matrix multiplication operation. Finally, after the attention weight is subjected to Sigmoid activation operation, multiplying the attention weight by the grouping feature map to realize an attention mechanism.

EMA uses a cross-spatial information aggregation method in different spatial dimension directions to achieve richer feature aggregation. Through the attention weights in the two directions of the height and the width, the dependence relationship among the channels can be modeled, so that the interaction among different channels can be learned by the network. The EMA has the advantages that: 1) The calculation is efficient, and the excessive calculation amount is not increased; 2) Non-local channel dependencies can be modeled; 3) Square complexity in self-attention is avoided.

In the implementation process of the embodiment: the present example adopts the Precision rate P (Precision), recall rate R (Recall), and mAP50 (mean Average Precision) as the evaluation index of the experiment. P and R represent the accuracy of the detection algorithm in the positive and correct samples, respectively. mAP50 is the average AP value of all class detection results when the predicted box and the real box are intersected, and the threshold of IoU (Intersection Over Union) is set to 0.5. The higher the mAP50, the higher the accuracy of the representation model, and the value range is 0-1 (the smaller the value, the worse the segmentation effect). The calculation formula of the evaluation index is as follows:

wherein TP is the number of true positive samples, FP is the number of false positive samples, FN is the number of false negative samples, C is the number of categories, N is the number of reference thresholds, k is a threshold, P (k) is the accuracy rate, and R (k) is the recall rate.

This example was done in the environment of AMD Ryzen 7 5800h for CPU, NVIDIA GeForce RTX 3060Laptop GPU, cuda 11.7.1, python 3.8.16, torch 1.13.1, ubuntu 18.04.6LTS for GPU.

The accuracy rates of the mature litchi fruits, the semi-mature litchi fruits and the immature litchi fruits are respectively 80.3%, 68.9% and 81.5%, the recall rates are respectively 83.6%, 67.4% and 71.6%, and mAP50 values are respectively 87.6%, 72.4% and 72.5% after the test on 197 litchi images, so that the effectiveness of the method provided by the embodiment is verified.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The litchi fruit identification method based on deep learning is characterized by comprising the following steps of: the method comprises the following steps:

s1, acquiring a litchi image and taking the litchi image as a sample image;

s2, preprocessing the litchi image obtained in the step S1;

2. The litchi fruit identification method based on deep learning as claimed in claim 1, wherein the method comprises the following steps: the litchi recognition model in the step S4 is obtained by training a network model which comprises the data set in the step S3 and introduces a FasterNet module and an attention mechanism EMA by taking Yolov8 as a basic network.

3. The litchi fruit identification method based on deep learning as claimed in claim 1, wherein the method comprises the following steps: the litchi fruits included in the sample image of the litchi dataset in step S3 include immature litchi fruits, immature litchi fruits and completely mature litchi fruits.

4. The litchi fruit identification method based on deep learning as claimed in claim 1, wherein the method comprises the following steps: the preprocessing process in the step S2 is as follows:

5. The litchi fruit identification method based on deep learning as claimed in claim 4, wherein the method comprises the following steps: data enhancements of the geometric transformation class include, but are not limited to, horizontal flipping, and data enhancements of the color transformation class include, but are not limited to, random luminance transformation, gaussian blurring, and adding gaussian noise.

6. The litchi fruit identification method based on deep learning as claimed in claim 1, wherein the method comprises the following steps: the image labeling process in the step S3 is as follows: