CN112766361A

CN112766361A - Target fruit detection method and detection system under homochromatic background

Info

Publication number: CN112766361A
Application number: CN202110061551.9A
Authority: CN
Inventors: 贾伟宽; 张中华; 邵文静; 刘杰; 侯素娟
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2021-05-07

Abstract

The present disclosure provides a method and a system for detecting target fruits under the background of the same color system, comprising the following steps: acquiring image data of a target fruit under a homochromatic background, and preprocessing the image; extracting image features of the acquired image data by adopting a depth convolution network, and fusing the image features by using a feature pyramid network to obtain a fused prediction feature map; and respectively predicting the feature map of each level of the feature pyramid network, and generating a predicted value of the target fruit by a full convolution method through classification and regression of two branches. The method integrates a deep convolution network and a pyramid network to extract the characteristic diagram, predicts in a single-stage and full convolution mode, can efficiently identify the fruits in the aspects of precision and speed, has strong robustness in the same-color background environment, and meets the requirement of actual operation.

Description

Target fruit detection method and detection system under homochromatic background

Technical Field

The disclosure relates to the related technical field of intelligent agriculture, in particular to a target fruit detection method and a target fruit detection system under the background of the same color system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In the whole production cycle of the traditional fruit and vegetable industry, manual operation is mainly used in each stage at present, and the problems of time consumption, labor consumption, high cost, low efficiency and the like exist, so that the production automation in a complex orchard environment is a necessary trend for the development of the industry, the time for variable spraying of pesticide and fertilizer, yield estimation and intelligent picking time are generally determined by detecting the actual condition of fruits, and the accurate and rapid detection of fruit targets has important significance.

The inventor finds that the identification of target fruits in a real orchard environment is generally accompanied by interference such as branch shielding, fruit overlapping, illumination change and the like, and for green fruits, the mutual mixed detection of leaves and green fruits is easily caused due to the reason that the colors of the green fruits are very similar to the background colors of the leaves, so that the fruit identification difficulty is rapidly increased, and the intelligent process of orchard management is influenced.

At present, a certain research foundation is accumulated in the field, and machine learning and deep learning are mostly used. Among them, the identification method based on machine learning usually accompanies operations such as preprocessing, feature selection, etc., and cannot realize an end-to-end detection process, and the identification effect is easily affected by various interferences in the natural environment. Although the recognition method based on deep learning has the advantages that the precision is obviously improved, and the end-to-end detection process can be realized, the operation such as convolution and the dependence of a model on an anchor frame cause that a large amount of calculation and storage resources are consumed, and the recognition speed can not meet the real-time requirement.

Disclosure of Invention

The present disclosure provides a target fruit detection method and a target fruit detection system under the same color system background, which can meet the detection requirements of intelligent agricultural applications such as intelligent picking, variable pesticide and fertilizer spraying, yield estimation, and the like, and can simultaneously improve the detection speed and precision.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

one or more embodiments provide a method for detecting a target fruit in a homochromy background, comprising the following steps:

acquiring image data of a target fruit under a homochromatic background, and preprocessing the image;

extracting image features of the acquired image data by adopting a depth convolution network, and fusing the image features by using a feature pyramid network to obtain a fused prediction feature map;

and respectively predicting the feature map of each level of the feature pyramid network, and generating a predicted value of the target fruit by a full convolution method through classification and regression of two branches.

One or more embodiments provide a target fruit detection system in a homochromatic background, comprising:

an image acquisition module: the fruit image preprocessing system is configured to be used for acquiring image data of a target fruit under the background of the same color system and preprocessing the image;

the characteristic diagram extraction and fusion module: the system comprises a depth convolution network, a feature pyramid network and a prediction feature graph, wherein the depth convolution network is used for extracting image features of acquired image data, and the feature pyramid network is used for fusing to obtain a fused prediction feature graph;

a prediction module: and the prediction method is configured to predict the feature map of each level of the feature pyramid network respectively, and generate a predicted value of the target fruit in a full convolution method through classifying and regressing two branches.

An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the above method.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the above method.

Compared with the prior art, the beneficial effect of this disclosure is:

(1) according to the method, on the premise of ensuring the precision, the speed is increased, the relation between the accuracy and the efficiency is balanced, the actual operation requirements of various automatic applications in the orchard environment are combined, and the target fruit detection model with high precision, high speed, strong robustness and good adaptability is provided.

(2) The method can quickly and accurately detect the position of the target fruit under the homochromatic background, predict in a single-stage and full-convolution mode, efficiently identify the fruit in both precision and speed, has strong robustness under the homochromatic background environment, and meets the requirement of actual operation.

(3) The method and the device have the advantages that the dependence of a mainstream detection algorithm on an anchor frame is eliminated, the complexity, the detection speed, the occupied storage capacity, the adaptability and the like of the algorithm are obviously improved, and the stability and the applicability of the model when the model is deployed to various intelligent agricultural applications are effectively improved.

Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and not to limit the disclosure.

FIG. 1 is a flowchart of the overall detection method of embodiment 1 of the present disclosure;

fig. 2 shows fruit images collected under different interference scenes in the same color system background according to embodiment 1 of the present disclosure;

FIG. 3 is a diagram of a structure of a result prediction stage of a detection model corresponding to the detection method in embodiment 1 of the present disclosure;

FIG. 4 is a diagram of the effect of different-scale fruits of embodiment 1 of the present disclosure predicted by the detection method of embodiment 1

Fig. 5 is a schematic diagram of the partition of mapping a forward sampling region onto an input target fruit picture according to embodiment 1 of the present disclosure;

FIG. 6 is an overall flowchart of a single iteration in the training process of the detection model according to embodiment 1 of the present disclosure;

fig. 7 shows the detection effect of example 1 on fruits in two same color system backgrounds according to example 1 of the present disclosure.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise. It should be noted that, in the case of no conflict, the embodiments and features in the embodiments in the present disclosure may be combined with each other. The embodiments will be described in detail below with reference to the accompanying drawings.

Example 1

In one or more embodiments, as shown in fig. 1, a method for detecting a target fruit in a same color family background includes the following steps:

step 1, acquiring image data of a target fruit under the background of the same color system, and preprocessing the image;

step 2, extracting image features of the acquired image data by adopting a depth convolution network, and fusing the image features by using a feature pyramid network to obtain a fused prediction feature map;

and 3, respectively predicting the feature map of each level of the feature pyramid network, and generating a predicted value of the target fruit by a full convolution method through classification and regression of two branches.

In the embodiment, the deep convolution network and the pyramid network are fused to extract the feature map, the prediction is carried out in a single-stage and full convolution mode, the fruits can be efficiently identified in the aspects of precision and speed, the robustness is strong in the background environment of the same color system, and the requirement of actual operation is met.

In the step 1, different kinds of fruit images can be collected under the background of the same color system through an image collecting device such as a camera and the like;

the preprocessing method comprises filling and cropping the image.

In step 2, a method for extracting image features of the acquired image data by adopting a deep convolution network and fusing the image features by using a feature pyramid network specifically comprises the following steps: extracting image features by taking a convolutional neural network (ResNet) as a backbone framework; fusing the feature maps output by each residual block in ResNet according to a top-down and transverse connection mode to enable the deep feature map and the shallow feature map to have the same level of semantic capacity, so that the semantic representation capacity of the bottom feature map is gradually enriched, and the obtained feature pyramid is the fused prediction feature map;

in this embodiment, the method for extracting image features specifically includes: outputting the image to a residual error network ResNet by taking batch as a unit, and performing convolution and pooling operation; feature expression capabilities contained in deep feature maps are gradually enriched by convolution and pooling operations.

In this embodiment, the method for fusing image features specifically includes: and fusing the feature maps with different sizes output by each residual block in ResNet according to a top-down and transverse connection mode, so that the deep feature map and the shallow feature map have the same level of semantic capacity, and a feature pyramid is obtained.

Optionally, the cross-connect changes the number of channels to a fixed value, such as 256, by 1 × 1 convolution;

optionally, the top-down method may be upsampling to the same size by a nearest neighbor interpolation method, and finally, the deep layer feature and the shallow layer feature are fused in a pixel-level addition manner.

The prediction model constructed by combining the deep convolutional network and the characteristic pyramid network can effectively improve the segmentation effect of the prediction model on different scales, particularly small-scale target fruits.

In step 3, feature maps of each level of the feature pyramid network are respectively predicted, a category-sensitive semantic map is predicted as the probability of existence of a fruit by a full convolution method through classification and regression of two branches, and a mapping relation between an original image center point and a frame coordinate corresponding to a positive sample is the predicted value of the generated target fruit, wherein the method specifically comprises the following steps of:

step 31, distributing the target fruit marking frame to feature maps of different levels for prediction according to the scale of the target fruit marking frame, obtaining a positive sampling area of the marking frame of the layer for prediction on the feature maps by using a shrinkage factor, and determining each spatial position on the feature maps as a positive sample or a negative sample;

the allocation according to the scale specifically comprises: according to the area of the target fruit marking frame, the target fruit marking frame is distributed to a characteristic diagram which is most suitable for prediction to take charge of prediction;

the positive and negative sample judgment specifically comprises the following steps: and recording the current characteristic diagram as Pl, obtaining a corresponding area on the Pl by using a labeling frame which is responsible for prediction according to a down-sampling multiple s, and shrinking by using a shrinkage coefficient sigma to obtain a positive sampling area Rpos, wherein a spatial position in the Rpos is a positive sampling point, and otherwise, the spatial position is a negative sampling point.

And step 32, predicting the confidence coefficient of the positive sample belonging to the fruit and the regularization offset between the positive sample and the real labeling box through classification and frame regression branches for each positive sample.

Optionally, as shown in fig. 6, the implementation process in the step 1-3 may be implemented in a full convolution neural network model fused with a pyramid network, the target fruit detection model in this embodiment is the full convolution neural network model fused with the pyramid network, the model structure is simpler, and the model structure may include a backbone network responsible for extracting features; the system comprises a feature pyramid responsible for fusing features and a prediction branch network responsible for generating a prediction result; the backbone network, the characteristic pyramid and the prediction branch network are sequentially connected, wherein the backbone network and the prediction branch network respectively adopt a convolutional neural network.

Optionally, the method for training the full convolution neural network model fused with the pyramid network includes the following steps:

(1) and acquiring a fruit image containing different types of interference under the background of the same color system, preprocessing the fruit image, and labeling the fruit image by using the minimum external matrix of the target fruit in the image to obtain the labeling information of the fruit image.

Optionally, green fruits are selected and pictures are taken in the same color system background, so that the acquired images contain different types of interference as much as possible to represent the real orchard environment as much as possible, as shown in fig. 2.

The size of the collected images is unified to 600 x 400, then the minimum external matrix of the target fruit in the images is labeled, label files can be generated according to the format of an MS COCO data set by means of labelme image labeling software, and the training targets of the models can be generated conveniently in the follow-up process.

(2) And extracting image features of the acquired image data by adopting a depth convolution network, and fusing the image features by using a feature pyramid network.

Extracting image features by adopting ResNet, and performing feature fusion by using the outputs of the last three residual blocks conv3, conv4 and conv5 to construct a feature pyramid network, which is marked as { C₃,C₄,C₅}。

In this embodiment, the feature fusion step: c₅Changing the number of channels to 256 by 1 × 1 convolution to obtain a new feature map and marking as P₅(ii) a Then convolution downsampling is carried out with the step length of 2 to obtain P in sequence₆、P₇(ii) a Then for { C₃,C₄,C₅Get { P } in a horizontal connection and top-down configuration₃,P₄,P₅And (9) layer.

The cross-connect changes the number of channels to a fixed value, which may be 256, with a 1 × 1 convolution;

from top to bottom, the method adopts a nearest neighbor interpolation method to perform up-sampling to the same size, and finally, the deep layer feature and the shallow layer feature are fused in a pixel-level addition mode. Obtaining the final fused feature pyramidTower { P₃,P₄,P₅,P₆,P₇}。

(3) Respectively predicting the feature map of each level of the feature pyramid network, and generating a predicted value of the target fruit by a full convolution method through classification and regression of two branches;

this step corresponds to the method of step 3 described above. Specifically, as shown in FIG. 3, for { P₃,P₄,P₅,P₆,P₇Predicting each layer feature map separately, and assuming the current feature map is P_l∈R^H×W×CRespectively inputting the semantic graph into a classification and regression subnet, and predicting a category-sensitive semantic graph P in a full convolution mode_l ^cls∈R^H×W×1As the probability of fruit existence, and the mapping relation P of the center point of the original image and the frame coordinate corresponding to a positive sample_l ^reg∈R^H×W×4. Taking the regression subnet as an example, it first generates the prediction value by 4 convolutions of 3 × 3, each convolution containing C convolution kernels and being activated by ReLU, and finally, by one convolution of 3 × 3 containing 4 convolution kernels.

(4) Determining a training target of each space position on the feature map according to the labeling information (labeling frame) in the step (1), wherein the steps are as follows:

(4-1) dimension assignment: according to the size of a target fruit marking frame of the target fruit image marking information, distributing the target fruit marking frame to feature graphs of different levels; as in step 31 previously described.

Specifically, in this step, for more stable training of the model, it is { P }₃,P₄,P₅,P₆,P₇Assigning a basic scale r to each layer of feature map_l32,64,128,256,512, the dimension range of the target fruit labeling box (which is the real box) for predicting the characteristic diagram of the l-th layer is as follows:

[r_l/η,r_l·η] (1)

and (3) adjusting the value of the hyper-parameter eta of the control scale range according to the data set formed by the data acquired in the step (1) and the evaluation performance on the verification set divided by the data set. By adopting the method, the number of the spatial positions of the positive samples in each layer is increased, the problem of unbalance of the positive and negative samples is relieved to a certain extent, and the effect of optimizing the semantic expression of the adjacent level feature map can be achieved.

Predicting the image according to the feature level allocation strategy, wherein the prediction effects of all spatial positions in a forward sampling region on the feature map responsible for prediction are mapped to the original image, as shown in fig. 4, the prediction effects comprise two fruits, namely persimmon and apple, the fruits have different scales and are allocated to feature maps of different levels to be responsible for prediction, and the feature map layer with deep color in the image is an allocated region.

(4-2) positive and negative sample determination: the positive sample region is divided by a downsampling multiple and a shrinkage factor. The method is the same as in step 31.

In this embodiment, specifically, a real box G (x)₁,y₁,x₂,y₂And (2) the real frame is a marked frame marked in the target fruit image in the step (1), the real frame is predicted by a feature map Pl of the current level, the downsampling ratio of the l-layer feature map and the original image is marked as sl, G is mapped to Pl, and a feature region G '(x'₁,y’₁,x’₂,y’₂)；

Calculating center point coordinate (c ') of G'_x，c’_y) And width (w '), height (h') are:

then shrinking G' by the shrinkage coefficient sigma to obtain a positive sampling region R^posWhose coordinates are expressed as

Training phase, R^posAll spatial positions in the sample are regarded as positive samples, and the class target corresponding to the positive samples is the labeling class corresponding to the labeling class G, R^posAll spatial positions outside are treated as negative samples and are mapped onto the input picture as shown in fig. 5.

And (4-3) determining a training target of each spatial position on the feature map, wherein the training target comprises a classification target and a regression target, and specifically, the classification target and the regularization frame offset target can be obtained according to the labeling frame information corresponding to the positive sampling point.

Because the scale change of the target fruit in the target fruit image is large, and the direct regression frame numerical value is not stable, in this embodiment, a method of regression predicting the regularized offset between the frame and the real frame is adopted instead. The positive sample point (x, y) is first passed through a down-sampling ratio s_lMapped onto the input picture and then its regularized offsets from the four edges of the real box G are computed.

R^posDirectly calculating the regularization distance from the positive sample (x, y) to the four edges of the real box G and taking the regularization distance as a regression target of the point, wherein the regularization distance is expressed as (t)_x1,t_y1,t_x2,t_y2) Specifically, the following are defined:

wherein s is_lTo sample ratio, r_lThe base scale assigned to each layer of the feature map.

(5) And (4) calculating the loss between the predicted value of the target fruit and the training target in the step (3), updating network parameters through gradient back propagation (SGD), and performing iterative training and evaluation to obtain an optimal model.

In this embodiment, the loss between the predicted value of the target fruit and the training target includes classification loss and regression loss.

Optionally, for the classification Loss, because the number of positive samples is relatively small, there is a certain imbalance problem of positive and negative samples, and optionally, a Focal local Loss function is adopted in this embodiment to calculate the Loss value.

Alternatively, for the regression Loss, a Smooth L1 Loss function may be used for the calculation.

The total loss can be represented by the formula (6):

wherein,

as shown in the above formula (6), L_cls,L_regThe losses due to the classification and regression branches of the network respectively,

respectively, the classification and regression target corresponding to the ith spatial position on the feature map, p_i,t_iThe classification and regression prediction values corresponding to the ith spatial position on the feature map are respectively, and lambda is responsible for regulating and controlling the balance between the two losses. When the Focal local is adopted to calculate the classification Loss, the alpha parameter is used for regulating and controlling the imbalance between the number and the importance of positive and negative samples; the gamma parameter is used for regulating and controlling the imbalance among the difficult and easy samples, and model degradation caused by loss leading training generated by simple negative samples is avoided; when the Smooth L1 Loss is adopted to calculate the regression Loss, the beta parameter is used for regulating and controlling different Loss functions called due to different Loss ranges, and the defects that the convergence speed of the L1 Loss is low and the L2 Loss is sensitive to outliers are overcome. In addition, two loss functions are respectively passed through N_clsAnd N_regAnd carrying out regularization. Last for L_FoveaBoxAnd updating the model parameters by gradient back propagation, repeatedly iterating and evaluating to obtain an optimal model, wherein the training process of the single iteration network is shown in fig. 6. The final predicted effect of the model is shown in FIG. 7, which contains fruits of persimmon and apple in the same color system background, and thus can be seenAnd (4) accurately detecting the target fruit.

Example 2

Based on embodiment 1, this embodiment provides a target fruit detection system under the background of the same color system, including:

Example 3

The present embodiment provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein the computer instructions, when executed by the processor, perform the steps of the method of embodiment 1.

Example 4

The present embodiment provides a computer readable storage medium for storing computer instructions which, when executed by a processor, perform the steps of the method of embodiment 1.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A method for detecting target fruits under the background of the same color system is characterized by comprising the following steps:

2. The method for detecting the target fruit in the background of the same color system as the claim 1, which is characterized in that: the method for extracting the image characteristics of the obtained image data by adopting a depth convolution network and fusing the image characteristics through a characteristic pyramid network to obtain a fused prediction characteristic diagram comprises the following steps:

transmitting the target fruit image to a residual error network ResNet, and performing convolution and pooling operations;

and fusing the feature maps with different sizes output by each residual block in ResNet according to a top-down and transverse connection mode, so that the deep feature map and the shallow feature map have the same level of semantic capacity, and a feature pyramid is obtained.

3. The method for detecting the target fruit in the background of the same color system as the claim 2, which is characterized in that: merging from top to bottom with a transverse connection, wherein the transverse connection changes the number of channels to a fixed value by 1 × 1 convolution; or, upsampling from top to bottom to the same size by adopting a nearest neighbor interpolation method, and finally fusing the deep layer feature and the shallow layer feature in a pixel-level addition mode.

4. The method for detecting the target fruit in the background of the same color system as the claim 1, which is characterized in that: the method for respectively predicting the feature map of each level of the feature pyramid network and generating the predicted value of the target fruit by a full convolution method through classification and regression of two branches specifically comprises the following steps:

according to the scale of the target fruit marking frame, distributing the target fruit marking frame to different levels of feature maps for prediction, then obtaining a positive sampling area of the target fruit marking frame, which is responsible for prediction, of each level on the feature maps by using a shrinkage factor, and determining each spatial position on the feature maps as a positive sample or a negative sample;

and predicting the confidence coefficient of the positive sample belonging to the fruit and the regularization offset between the positive sample and a target fruit labeling box through classification and frame regression branches for each positive sample.

5. The method for detecting the target fruit in the background of the same color system as the claim 1, which is characterized in that: the implementation process of the target fruit detection method is implemented in a full convolution neural network model fused with a pyramid network, and the model structure comprises a backbone network responsible for extracting features, a feature pyramid responsible for fusing the features and a prediction branch network responsible for generating a prediction result; the backbone network, the characteristic pyramid and the prediction branch network are sequentially connected, wherein the backbone network and the prediction branch network respectively adopt a convolutional neural network.

6. The method for detecting the target fruit under the background of the same color system as the claim 5, wherein the method for training the full convolution neural network model fused with the pyramid network comprises the following steps:

acquiring fruit images containing different types of interference under the background of the same color system, preprocessing and marking the fruit images to obtain marking information of the fruit images;

extracting image features of the acquired image data by adopting a depth convolution network, and fusing the image features by using a feature pyramid network;

respectively predicting the feature map of each level of the feature pyramid network, and generating a predicted value of the target fruit by a full convolution method through classification and regression of two branches;

determining a training target of each space position on the feature map according to the labeling information;

and calculating the loss between the predicted value of the target fruit and the training target, updating network parameters through gradient back propagation, and performing iterative training and evaluation to obtain an optimal model.

7. The method for detecting the target fruit in the background of the same color system as the claim 1, which is characterized in that: the method for determining the training target of each spatial position on the feature map according to the labeling information specifically comprises the following steps:

according to the size of a target fruit marking frame of the target fruit image marking information, distributing the target fruit marking frame to feature graphs of different levels;

according to the downsampling multiple and the shrinkage coefficient, obtaining a positive sampling area of a target fruit labeling frame which is responsible for prediction of each layer on the feature map, and determining each spatial position on the feature map as a positive sample or a negative sample;

and obtaining a training target with a classification target and a regularization frame offset target as each spatial position on the feature map according to the marking frame information corresponding to the positive sampling point.

8. A target fruit detection system under the background of the same color system is characterized by comprising:

9. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 7.