CN116091770A

CN116091770A - Grape leaf lesion image segmentation method based on cross-resolution transducer model

Info

Publication number: CN116091770A
Application number: CN202310045185.7A
Authority: CN
Inventors: 穆维松; 张馨心; 张慧; 徐子睿; 邹彬
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2023-01-30
Filing date: 2023-01-30
Publication date: 2023-05-09

Abstract

The invention belongs to the technical field of agricultural information, and particularly relates to a grape leaf spot segmentation method based on a trans-resolution transducer model, which is used for solving the problem of grape leaf spot segmentation of a complex background in a natural field environment. The trans-resolution transducer model allows for retention of high resolution features and powerful semantic information, as compared to the transducer-based model, which previously considered acquisition of high resolution features or semantic information alone. The trans-resolution transducer model is of an encoder-decoder structure, wherein the encoder consists of 4 stages, i multi-resolution transducer blocks connected in parallel are distributed in the ith stage, and the multi-resolution transducer blocks are distributed in a pyramid shape; the decoder is a hamburger decoder. Each Transformer block includes a large core mining attention mechanism, a multi-path feed forward network, and a cross-resolution fusion strategy. And finally, fusing the characteristic information with different scale resolutions through a hamburger decoder.

Description

Grape leaf lesion image segmentation method based on cross-resolution transducer model

Technical Field

The invention belongs to the technical field of agricultural information, and particularly relates to a grape leaf lesion image segmentation method based on a trans-resolution transducer model.

Background

Plant leaf lesions have become a major obstacle to the development of grape agriculture, directly leading to reduced yield and quality. Monitoring disease information and making appropriate measures at an early stage of disease can effectively control agricultural economic loss. Automatic segmentation is an important basis for plant disease detection and identification, so that automatic segmentation of grape leaf lesions is helpful for preventing disease spread. However, the background of the field grape leaf disease is complex, the edge textures of the small disease areas are rich, disease symptoms are similar, and the accuracy of disease segmentation is seriously affected. To address this challenge, common segmentation methods are generally as follows:

(1) Convolutional neural network

Convolutional neural networks are widely used in the agricultural field, such as the deep series, the Unet, the PSPnet, and the like. These network architectures are mainly aimed at improving the segmentation performance of the model by increasing the network depth or introducing residual learning. While the above networks have met with great success in extracting plaque characteristics, they suffer from a number of problems that limit their performance: (1) The fixed convolution kernel constrains the size of the receptive field, so that the natural image segmentation effect on the complex background is poor; (2) The local connection makes the key of long-distance semantic information interaction ignored; (3) small target disease areas are difficult to precisely segment. In order to solve the above-mentioned problems, it is necessary to extract local features while understanding the global information in depth, and explore more scene-level semantic information in the entire natural scene.

(2)Transformer

Transformer has further advanced the vision field and has shown superior performance in segmentation tasks over convolutional neural network based models. Due to the self-attention mechanism with robust characterization capability, the transducer can explicitly model global context information when inputting high resolution natural images with complex background. Some researchers have improved the performance of plant lesion segmentation by improving the transducer. In addition, many work extracts lesion low resolution features by introducing convolution operators, downsampling feature maps, employing pyramid hierarchies, and redesigning markers. However, they still follow a tandem topology, gradually letting each stage output resolution features of the same scale, but neglecting the information interaction between different resolutions of the same stage, which results in the inability to generate high quality segmented images. In identifying grape leaf lesions with complex backgrounds and small targets, segmentation performance is reduced and edge detail information is lost. Empirically, high resolution feature maps can obtain finer granularity information, especially for grape leaf disease edges. Low resolution feature maps typically contain stronger semantic characterization information, especially for small target disease areas that are difficult to segment. For this reason, maintaining both high resolution feature maps and deep semantic information is critical to efficiently process segmentation tasks.

(3) Attention mechanism

The goal of the attention mechanism is to focus the attention of the network on the most important small part of the data by increasing the weight of some parts of the input data and decreasing the weight of others. The attention mechanism is roughly divided into two branches: spatial attention, channel attention. Different types of attention bear different functions. For example, the purpose of spatial attention is to enhance the spatial characterization of critical areas. The purpose of channel attention is to model the correlation between different channels. However, the current popular transducer architecture ignores the importance of channel dimension adaptability. The challenges presented by segmentation of grape leaf lesion images can be addressed from the perspective of inheriting the advantages of both channel attention and spatial attention.

Disclosure of Invention

The invention aims to tailor a model for a grape leaf spot image segmentation task, namely a trans-resolution transducer model, and is used for solving the problem of grape leaf spot image segmentation of a complex background in a natural field environment. In contrast to the previous Transformer-based models, which were considered alone to obtain high resolution features or semantic information, the trans-resolution Transformer model allows for retention of high resolution features and powerful semantic information.

In particular, the cross-resolution transducer model is of an 'encoder-decoder' structure, the encoder consists of 4 stages, and i multi-resolution transducer blocks connected in parallel are distributed in the ith stage and are distributed in a pyramid shape; the decoder is a hamburger decoder.

According to the design concept of parallel transformers, the input is first downsampled with the CONV-BN-ReLU block to effectively extract the low resolution feature map.

In particular, the network architecture of the multi-resolution Transformer block includes a large core mining attention mechanism, a multi-path feed forward network, and a cross-resolution fusion strategy.

Specifically, the size of the convolution kernel is set to 11×11 to enlarge the receptive field, input data is embedded and remodeled through overlapping features and then is transmitted into the large-kernel mining attention mechanism, and output features of the attention mechanism are calculated through a Hadamard operator.

Specifically, given input data

H×W and C are the number of each input mark and the channel feature dimension respectively, and after overlapping feature embedding and remodelling, the input marks are transmitted into a large-core mining attention mechanism. The similarity-based score matrix a is calculated by large kernel depth convolution with a kernel size of k x k. In the present invention, the convolution kernel size is set to k=11 to effectively enlarge the receptive field. Based on the self-attention mechanism by matrix +.>

Multiplying the embedded input to obtain the value V. In stage i, specific details are formulated as follows:

A＝DWConv _k×k (W _i X)

wherein the method comprises the steps of

And->

Is a weight matrix derived from linear projection, matrix a represents the similarity or correlation between each pair of input labels, DWConv (·) represents the depth separable convolution operation. The values of V and A calculated by the Hadamard operator are taken as the attention output, and the specific formula is as follows:

Attention(X)＝A⊙V

compared to the attention mechanism, where the computation amount grows twice, the large-kernel convolution is a full convolution, so its complexity and parameter amount remain linearly increasing. The large-core mining attention mechanism can realize effective information interaction between channels. When a high resolution image is used as input, such as a grape leaf disease image, the weight matrix of the encoded space and channel information generated by the large kernel mining attention mechanism can be adapted to the input.

In particular, the multi-path feedforward network considers the importance of multi-level semantic mining, and is implemented by directly applying k×1 and 1×k (k=3, 5) double-branch convolution pairs and an expansion ratio r (r=4), where the multi-path feedforward network can be expressed as:

x ₃ ＝Conv _3×1 (Conv _1×3 (Linear(x _in )))

x ₅ ＝Conv _5×1 (Conv _1×5 (Linear(x _in )))

x _out ＝Cat(x ₃ ,x ₅ )+x _in

wherein LN normalizes and excitesThe active layer (GELU) is omitted, x _in Representing features output from the large kernel mining attention mechanism, x _out Representing the characteristics of the output from the multipath feedforward network. The design of the multipath feedforward network can further capture receptive fields of different scales, and is beneficial to improving the capacity of multi-scale information aggregation.

Experiments show that the exchange of information between resolutions of different scales is beneficial to the generation of high-quality high-resolution images, so that the invention adopts a cross-resolution fusion strategy to transfer semantic information between two continuous stages, and realizes information interaction between adjacent stages.

Specifically, according to the cross-resolution fusion strategy, a static two-dimensional matrix is constructed through binary circulation, semantic information is fused through up-sampling or down-sampling, the semantic information of a low-resolution feature map branch is up-sampled to a high-resolution feature map branch so as to extract semantic features of a larger receptive field, and the high-resolution feature map is down-sampled to the low-resolution feature map so as to keep more image details, so that accurate segmentation of small target diseases with complex backgrounds is realized.

More specifically, let the feature resolution feature of the input branch be j and the feature resolution feature of the output be n. To obtain high-level features of larger receptive fields, low-resolution features are upsampled and incorporated into the high-resolution features. I.e., j > n, the same number of channels for j and n layers is maintained using a 1 x 1 convolution while the spatial dimension is upsampled by the adjacent interpolation. In order to keep the low resolution features more image detail, the low resolution features are combined with the downsampled high resolution features, i.e. j < n, the high resolution spatial dimension is reduced and the number of channels output is matched using a depth separable convolution with a step size of 2 (j-n) +1. When j=n, the jump connection direct output feature is employed. The cross-resolution semantic fusion strategy inherits the advantages of high-resolution characterization and semantic information with higher low resolution, and is beneficial to realizing accurate segmentation on small diseases with complex backgrounds.

In particular, the hamburger decoder uses a matrix decomposition method to model global space information, aggregates the context information of the last three layers to fuse the feature information of different scale resolutions, and only aggregates the context information of the last three stages to aggregate the information from the low resolution feature and the high resolution feature because the first stage has more low-level features, if the aggregation of the first stage results in higher calculation cost.

The invention has the advantage that the trans-resolution Transformr model is considered to retain high resolution features and powerful semantic information at the same time as the trans-former model based on the trans-former model, which was previously considered to acquire high resolution features or semantic information alone. The novel trans-resolution architecture is provided, trans-resolution information transmission is carried out in a parallel mode, and the advantages of trans-resolution are utilized to improve characterization learning and extract robust semantic information; introducing a large kernel mining attention mechanism, wherein large kernel convolution is used for remodelling a pixel weight matrix, adaptively channeling and spatially informationized without increasing the calculation cost, and mining context information from the whole scene; a multipath feed forward network and a hamburger decoder are designed to further expand the multiscale receptive field and enhance the capability of multiscale information aggregation. The invention can effectively solve the problem of grape leaf lesion segmentation in a complex background in natural fields.

Drawings

FIG. 1 is a diagram of the overall architecture of a cross-resolution transducer model;

FIG. 2 is a diagram of a transducer block framework;

FIG. 3 is a schematic diagram of a hamburger decoder;

FIG. 4 is a schematic diagram of a cross-resolution fusion strategy.

Detailed Description

The whole frame diagram of the grape leaf lesion image segmentation method based on a trans-resolution transducer model is shown in figure 1, and the model is an encoder-decoder model; FIG. 2 is a diagram of a transducer block framework of the present invention; as shown in fig. 3, a schematic diagram of the decoder of the present invention, i.e., a hamburger decoder; FIG. 4 is a schematic diagram of a cross-resolution fusion strategy employed by the model. In the training stage, the experiment and other experiments of the invention are deployed in a pyrach and mmsegment library to carry out semantic segmentation real-timeAnd (5) checking. All models were trained on NVIDIA Tesla V100 GPU. The present invention follows the same training strategy of the previous work, considering the fairness of the comparison. Specifically, the resolution size of the training image is cut out to 1024×1024. The model of the invention was optimized during the training phase using AdamW with a weight decay of 0.01. LRT was trained using the "poly" LR strategy (lr=baselr× (1-epoch/maxiter) ^power ) Wherein the "poly" LR strategy factor is set to 1, the initial learning rate is 6×10 ^-5 A total of 16 tens of thousands of iterations.

The present invention is trained and evaluated on three data sets, including a Field-PV (Field-PV) data set, a Plant Village (Plant Village) data set, and a synthetic-PV (Syn-PV) data set. The field-PV data set is acquired by an OLYMPUS OM-D camera used by forestry and fruit tree research institute of Beijing and forestry academy of sciences, china, and 400 original images containing natural scenes of grape grad mold disease are shot in total. The plant village data set is a public and fair data set specially used for identifying crop diseases and insect pests, and consists of 54303 high-resolution images, including different disease types and healthy leaf images of 38 plants, which are obtained in a controlled laboratory, wherein 1383 grape black measles disease images and 1180 Zhang Putao black rot disease images are utilized and are manually marked; the synthetic-PV dataset is a natural field image synthesized from plant village segmentation images obtained from a controlled laboratory by background replacement, and a background replacement method is used to synthesize a grape disease image with a complex background.

All the data sets are manually marked with disease areas and leaf areas by using a labelme tool, and the marked data are saved in json format and converted into PASCAL VOC 2012 data format with foreground and background object semantic labels. The invention uses Augmentor modules to perform geometric transformations such as random left/right flipping, random clipping, random sampling, color and brightness enhancement or reduction, etc. In the training process, the method for enhancing the data provided by the semantic segmentation library is applied.

To evaluate the effectiveness of a Cross-resolution transform (i.e., cross-Resolution Transformer, CRFormer) model, the present invention compares the model to other image segmentation methods. Four indicators precision, ioU, recall and Dice were used to measure the performance of the model, with darkening representing maximum and underlined representing suboptimal results. Meanwhile, parameters (parameters) and kilomega floating point operation Seconds (FLPs) of each model were also analyzed, and the results are shown in tables 1 to 5.

Table 1 quantitative comparison of the CRFormer with other segmentation methods for Black measurements and Black rot segmentation on the plant village dataset

TABLE 2 quantitative comparison of background and grape leaf segmentations on plant village datasets by CRFormer and other segmentation methods

TABLE 3 quantitative comparison of the results of the CRFormer and other methods for the black measurements and black rot segmentation on the Syn-PV dataset

Table 4 quantitative comparison of CRFormer with other segmentation methods for background and grape leaf segmentation on Syn-PV dataset

Table 5 quantitative comparison of CRFormer with other segmentation methods for background and gray mold segmentation in Field-PV datasets

Experimental results show that the CRFormer has better segmentation performance on grape leaf lesions than the most advanced transform method and the deep learning-based method at present. The invention has the advantage that the image segmentation performance and the training and running cost are comprehensively considered, and the invention has optimal performance in complex grape leaf lesion segmentation tasks.

Claims

1. A grape leaf spot image segmentation method based on a trans-resolution transducer model is characterized in that the trans-resolution transducer model is of an encoder-decoder structure, an encoder consists of 4 stages, i multi-resolution transducer blocks connected in parallel are distributed in the ith stage, and the multi-resolution transducer blocks are distributed in a pyramid shape; the decoder is a hamburger decoder.

2. The multi-resolution fransformer block of claim 1, wherein the network architecture comprises a large core mining attention mechanism, a multi-path feed forward network, and a cross-resolution fusion strategy.

3. The large kernel mining attention mechanism of claim 2, wherein the convolution kernel is sized to 11 x 11 to expand the receptive field, the input data is embedded and remodeled with overlapping features and then transferred into the large kernel mining attention mechanism, and the output features of the attention mechanism are calculated by hadamard operators.

4. The multi-path feedforward network according to claim 2, wherein the multi-path feedforward network considers the importance of multi-level semantic mining, and is implemented by using k x 1 and 1 x k double-branch convolution pairs and an expansion ratio r, wherein the value of k is set to 3 and the value of 5,r is set to 4, and the specific formula is:

x ₃ ＝Conv _3×1 (Conv _1×3 (Linear(x _in )))

x ₅ ＝Conv _5×1 (Conv _1×5 (Linear(x _in )))

x _out ＝Cat(x ₃ ,x ₅ )+x _in

wherein L isN normalization and activation layer (GELU) is omitted, x _in Representing features output from the large kernel mining attention mechanism, x _out Representing the characteristics of the output from the multipath feedforward network.

5. The cross-resolution fusion strategy of claim 2 wherein a static two-dimensional matrix is constructed by binary cycles and semantic information is fused by upsampling or downsampling, semantic information of low resolution feature map branches is upsampled to high resolution feature map branches to extract semantic features of larger receptive fields, high resolution feature maps are downsampled to low resolution feature maps to maintain more image detail, thereby achieving accurate segmentation of small target diseases with complex backgrounds.

6. The hamburger decoder of claim 1 wherein the decoder models global spatial information using a matrix decomposition method to aggregate the last three layers of context information to fuse feature information of different scale resolutions.