CN117611924A

CN117611924A - Plant leaf phenotype disease classification method based on graphic subspace joint learning

Info

Publication number: CN117611924A
Application number: CN202410067314.7A
Authority: CN
Inventors: 王崎; 张家伟; 吴雪; 王亚洲; 高珍冉
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-02-27
Anticipated expiration: 2044-01-17
Also published as: CN117611924B

Abstract

The invention discloses a plant leaf phenotype disease classification method based on graphic subspace joint learning, which comprises the following steps: firstly, projecting two modes of image and text projection data into a subspace shared by the modes, learning the commonality of the two modes, and then respectively projecting the two modes into a subspace special for the modes, and obtaining corresponding characteristic representation. These feature representations provide an overall view of the multimodal data for feature fusion to enhance the final classification effect. According to the invention, which disease category of which plant belongs to can be predicted more accurately according to the given text and image, the classification effect of the plant leaf disease is good, and higher accuracy is obtained.

Description

Plant leaf phenotype disease classification method based on graphic subspace joint learning

Technical Field

The invention belongs to the technical field of intelligent processing of digital images, and particularly relates to a plant leaf phenotype disease classification method based on graphic subspace joint learning.

Background

In the traditional plant leaf phenotype disease classification task, the disease on leaves is mainly identified manually, but the method has great limitation, because different disease expression forms on leaves can be greatly different, and an expert in the field is required to accurately classify the disease. However, this classification method is too costly in terms of time and effort, and thus is difficult to be successfully applied to a large-scale classification task. In recent years, with the rapid development of natural language processing technology and computer vision technology, more and more researchers have begun to explore plant leaf phenotype disease classification methods that use text information to aid in images. The keywords related to the diseases are extracted from the text to assist in classifying the images, so that the accuracy and stability of classification are improved, and the method has good effect even on tasks with high fine granularity requirements. In the plant leaf disease classification task, the technology of text-guided image classification can help the classifier to more accurately identify multiple plant and multiple disease types. For example, if a leaf has a yellow brown spot but is not apparent on the image, it can be more easily identified by the classifier with the aid of text. In addition, in the disease identification process, text guidance is used to help the classifier to better and comprehensively understand the characteristics and the expression form of each disease, so that the classification accuracy and the classification robustness are improved. For example, chinese patent publication No. CN 115050014A discloses a small sample tomato disease recognition system and method based on image text learning in 2022, 09, 13, comprising: the system comprises an image classification module, a text classification module and a joint classification module; the image classification module is used for obtaining a first prediction probability of tomato disease types based on the tomato images; the text classification module is used for obtaining a second prediction probability of tomato disease types based on tomato text information; and the joint classification module is used for jointly outputting the first prediction probability and the second prediction probability to obtain disease categories. Although the text mode auxiliary image is used for classification, the identification method is only aimed at disease classification of single plants, the classification quantity is low, universality is not achieved, and the relevance and irrelevance of different modes are ignored, so that detailed features in some complex scenes are not effectively utilized.

In summary, for the current multi-plant multi-disease classification technology, the manual classification method and the single-mode image classification method cannot achieve the effect of high accuracy, so the technology of using text to guide image classification has wide application prospect in plant leaf phenotype disease classification, and provides more effective technical means for health management in the agricultural field in the future.

Disclosure of Invention

The invention aims to overcome the defects and provide the plant leaf phenotype disease classification method based on graphic subspace joint learning, which has good plant leaf disease classification effect and higher accuracy.

The invention discloses a plant leaf phenotype disease classification method based on graph and text subspace joint learning, which comprises the following steps:

step 1, data preprocessing:

traversing the catalog of the data set, firstly arranging the relative paths and label information of each group of samples into a row, writing the row into a csv file, reading the csv file with the written information, sampling the csv file, and performing sampling according to the following steps of 2:8, dividing the test set and the training set in proportion, thereby obtaining test.csv and train.csv files;

step 2, constructing a network model:

in the feature extraction stage, for data of a text mode, performing self-attention coding on input by using a BERT model in deep learning, learning a context relation, obtaining BERT output, and then performing averaging operation on the output in a dimension of 1 by using a mask in the input to obtain a Uttriance corpus level expression of the text input modeThe method comprises the steps of carrying out a first treatment on the surface of the For the visual mode information in the model, adopting a ViT model in deep learning to obtain the utterance corpus level expression of the visual mode>The method comprises the steps of setting the size of an image to be 128, setting the size of each patch to be 16, setting the number of categories to be classified to be 200, setting the dimension of each characterization to be 1024, setting the depth of a model to be 6, setting the number of attention heads to be 16, and setting the dimension of a multi-layer perceptron to be 2048;

in the modal representation learning phase, a framework is used to project text and image modalities into two different subspaces, one of which is modal-invariant and the other is modal-specific, by、 />Andthe three self-encoders pass the text and the representation of the visual mode obtained in the feature extraction stage through the self-encodersAnd->Obtaining +.>And representing a visual modality specificity characterizationText message->And visual information->Input to shared self-encoder->Obtaining +.>And +.about.representing a constant characterization of the visual modality>；/>And->Projecting into the two different subspaces to obtain mode specific information and mode sharing information, and obtaining four vectors +.>Representing the specific and modal invariant representations of the text and visual modes respectively, performing a series operation on them in the dimension dim=0 to obtain a new matrix M, transforming this matrix M using a multi-head attention mechanism, obtaining a new matrix h, the parameters of the multi-head attention mechanism being: the input dimension is 128, head=2; finally, a fusion operation is carried out on the output h, and the fusion network comprises the following layers: first, a full connection layer with an input dimension of 512 and an output dimension of 384; then a drop_rate of 0.5 dropout layer, then a ReLU activation layer, and finally a full connection layer, wherein the input dimension is 384, and the output dimension is 200, so that the final classification effect is obtained;

step 3, model training and loss calculation:

inputting the train.csv file in the plant leaf disease data set obtained in the step (1) into the network model constructed in the step (2), and training the network model, wherein the optimizer selects Adam, and the loss function comprises: similarity lossA differential loss->And a cross entropy loss for classification +.>The method comprises the steps of carrying out a first treatment on the surface of the Similarity loss->Based on the center distance difference (CMD) to calculate, CMD measures the difference between two matrices by matching their moment differences, the value of CMD becomes smaller as the similarity of the two distributions increases, let +.>For interval->On, respectively by probability->Sampling the obtained bounded random sample, then +.>Wherein->Is the desire of the sample, +.>Is->Definition of central moment of->Is->；In the training process, set +.>And->Respectively represent +.>Each hidden vector of the modalities, the uncorrelation constant of the modality vector is +.>Wherein->Representing the Frobenius paradigm, the loss is defined asCross entropy loss is used->As a predictive loss function for downstream tasks, the formula is as follows:

the total loss function is as follows:

；

and 4, training a network to obtain the best accuracy 0.999257.

The plant leaf phenotype disease classification method based on image-text subspace joint learning comprises the following steps: the feature extraction stage described in step (2) also applies 0.1 dropout and emb_dropout.

The plant leaf phenotype disease classification method based on image-text subspace joint learning comprises the following steps: the step (2)、 />And->The self-encoder is three self-encoders which are completely consistent in structure, and comprises a fully-connected layer with 128 input and output and a sigmoid activation layer; />For processing text,/text>To process visual vision information.

The plant leaf phenotype disease classification method based on image-text subspace joint learning comprises the following steps: the method for projecting the image to the subspace with unchanged mode in the step (2) comprises the following steps: the corpus vector is processed through a full-connection layer with 128 inputs and outputs and a sigmoid activation layer.

The plant leaf phenotype disease classification method based on image-text subspace joint learning comprises the following steps: the method of projecting to the modality-specific subspace in the step (2) is as follows: first, each corpus vector is encoded, a full-connected layer with 128 output neurons is used for encoding, and then the encoded result is mapped to a range between 0 and 1 through a sigmoid activation function.

Compared with the prior art, the invention has obvious beneficial effects, and the technical scheme can be known from the following: the invention uses a plant leaf disease data set which is truly photographed from the field, the plant types are rich, the disease types are comprehensive, the adopted framework is to project each mode into two different subspaces, one subspace is mode-unchanged, and the cross-mode representation learns the commonality of the plant leaf disease data set and reduces the mode gap. The other subspace is modality-specific, it is private to each modality, and captures its features. By training a network model with good plant leaf disease classification effect through the framework, which disease class of which plant belongs to can be predicted more accurately according to given text and images, and higher accuracy is obtained. The invention uses the image-text data set of the true plant leaf diseases for model training, has more practicability in the application of the agricultural field, and also provides assistance for the research of the agricultural diseases.

Drawings

FIG. 1 is a schematic diagram of a plant leaf phenotype disease classification network according to the present invention;

FIG. 2 is a graph showing the results of classifying plant leaf phenotype diseases according to the present invention.

Detailed Description

The following is a detailed description of specific embodiments, structures, features and effects of a plant leaf phenotype disease classification method based on graphic subspace joint learning according to the present invention with reference to fig. 1 and fig. 2.

step 1, data preprocessing:

traversing the catalog of the data set, firstly arranging the relative paths and label information of each group of samples into a row, writing the row into a csv file, reading the csv file with the written information, sampling the csv file, and performing sampling according to the following steps of 2:8, dividing a test set (the test set is used for evaluating model performance) and a training set (the training set is used for optimizing model super-parameters and model selection), thereby obtaining test.csv and train.csv files;

step 2, constructing a network model:

in the feature extraction stage, for text modal data, self-attention coding is carried out on the input by using a BERT model in deep learning, rich semantic representation is obtained by learning context relation, after BERT output is obtained, the mask in the input is used for carrying out average value solving operation on the output in the dimension of 1, and the user corpus level expression of the text input mode (more abstract and global) is obtainedThereby better capturing the characteristics and semantic information of the text data; for the visual modality information in the model, a ViT model in deep learning is adopted. By using the ViT model, a representation of the image can be generated, resulting in a utarence of the visual modalityCorpus-level expression->In order to control the performance of the ViT model, some relevant parameters are set. The size of the image is set to 128, the size of each patch is 16, the number of categories to be classified is 200, the dimension of each characterization is 1024, the depth of the model is 6, the number of attention heads is 16, the dimension of the multi-layer perceptron is 2048, and the dropout and the emb_dropout of 0.1 are also applied to improve the generalization capability of the model.

In the modal representation learning phase, according to the modal representation learning in fig. 1, one framework is used to project text and image modalities into two different subspaces, one of which is modal-invariant, aiming to learn commonalities of the two modalities across modalities and reduce differences between them. The other subspace being modality-specific for capturing features of each modality by、/>And->The three self-encoders represent the text and visual modalities obtained in the feature extraction stage, wherein: />、/>And->The self-encoder is three self-encoders which are completely consistent in structure, and comprises a fully-connected layer with 128 input and output and a sigmoid activation layer; />For processing the text of the text,for processing visual vision information via a self-encoder +.>And->The content obtained is +.>And +.about.representing a characterization of the specificity of the visual modality>，/>Is a shared self-encoder for processing text and visual information simultaneously; text message +.>And visual information->Input to shared self-encoder->The content obtained is +.>And +.about.representing a constant characterization of the visual modality>(learning the shared representation in a common subspace with distributed similarity constraints helps to minimize the heterogeneous gap, which is an ideal property for multimodal fusion),>and->Projected into these two different subspaces to obtain modality specific information and modality sharing information. The method of projection to the modality-specific subspace is: first, each corpus vector is encoded, a full-connected layer with 128 output neurons is used for encoding, and then the encoded result is mapped to a range between 0 and 1 through a sigmoid activation function. Such operations may effectively model and represent particular modality information; the method of projecting to the unchanged subspace of the mode (processing text and visual information simultaneously) is as follows: the corpus vector is processed through a full-connection layer with 128 inputs and outputs and a sigmoid activation layer. This portion of the operation helps to minimize the heterogeneous gap between modalities in order to better capture the commonality features between modalities. Through such a modality representation learning process, content can be described in more detail while retaining semantics, information of text and image modalities is projected into different subspaces, characteristics and commonalities of modalities are captured at the same time, and differences between modalities are minimized, thereby achieving better modality representation learning. By the preceding steps, four vectors are obtained +.>(thus, every corpus vector +.>Projection onto two different representations, namely the two representations of modality invariance and modality specificity, whose presence provides the possibility of effectively fusing the desired features, represent respectively the specific and modality invariance representations of the text and of the visual modality, which are operated in series in the dimension dim=0 in order to integrate them further, thus obtaining a new matrix M, which is transformed using a multi-headed attention mechanism, obtaining a new matrix h (in so doing allowing each representation to induce from the other representation the underlying information that has a synergistic effect on the task as a whole). The multi-head attention mechanism can fully capture the relation between different attention heads and extract a richer characteristic representation. The parameters of the multi-head attention mechanism are: conveying deviceThe ingress dimension is 128, head=2; and finally, carrying out one-time fusion operation on the output h. This converged network includes the following layers: first, a full connection layer with an input dimension of 512 and an output dimension of 384; a drop_rate of 0.5 drop out layer follows to reduce the risk of model overfitting. Then a ReLU activation layer is used to introduce nonlinear features. Finally, a full connection layer with 384 input dimension and 200 output dimension is provided, and the function of the fusion network is to further fuse and map the features obtained before so as to obtain the final classification effect;

step 3, model training and loss calculation:

inputting the train.csv file in the plant leaf disease data set obtained in the step (1) into the network model constructed in the step (2), and training the network model, wherein the optimizer selects Adam, and the loss function comprises: similarity lossA differential loss->And a cross entropy loss for classification +.>The method comprises the steps of carrying out a first treatment on the surface of the The purpose of introducing similarity loss is to reduce the difference between each modality representation, helping to unify common cross-modality features in a shared subspace, similarity loss ∈ ->Based on the center distance difference (CMD) to calculate, CMD measures the difference between two matrices by matching their moment differences, the value of CMD becomes smaller as the similarity of the two distributions increases, let +.>For interval->On, respectively by probability->Sampling the obtained bounded random sample, then +.>Wherein->Is the desire of the sample, +.>Is->Definition of central moment of->Is->；/>Is to ensure the modal invariance and modal specificity characterization, the input characteristics can be captured from different angles, and in the training process, the +.>And->Respectively represent +.>Each hidden vector of the modalities, the uncorrelation constant of the modality vector is +.>Wherein->Representing the Frobenius paradigm, the loss is defined as +.>Cross entropy loss is used->As a predictive loss function for downstream tasks, the formula is as follows:

the total loss function is as follows:

；

step 4. The best accuracy obtained by training the network is 0.999257 (see effect shown in the last line in fig. 2).

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and variation of the above embodiment according to the technical matter of the present invention still fall within the scope of the technical scheme of the present invention.

Claims

1. A plant leaf phenotype disease classification method based on image-text subspace joint learning comprises the following steps:

step 1, data preprocessing:

step 2, constructing a network model:

in the modal representation learning phase, a framework is used to project text and image modalities into two different subspaces, one of which is modal-invariant and the other is modal-specific, by、 />Andthe three self-encoders pass the text and the representation of the visual mode obtained in the feature extraction stage through the self-encodersAnd->Obtaining +.>And +.about.representing a characterization of the specificity of the visual modality>Text message->And visual information->Input to shared self-encoder->Obtaining the representation of the unchanged representation of the text modeAnd +.about.representing a constant characterization of the visual modality>；/>And->Projecting into the two different subspaces to obtain mode specific information and mode sharing information, and obtaining four vectors +.>Representing the specific and modal invariant representations of the text and visual modes respectively, performing a series operation on them in the dimension dim=0 to obtain a new matrix M, transforming this matrix M using a multi-head attention mechanism, obtaining a new matrix h, the parameters of the multi-head attention mechanism being: the input dimension is 128, head=2; finally, a fusion operation is carried out on the output h, and the fusion network comprises the following layers: first, a full connection layer with an input dimension of 512 and an output dimension of 384; then a drop_rate of 0.5 dropout layer, then a ReLU activation layer, and finally a full connection layer, wherein the input dimension is 384, and the output dimension is 200, so that the final classification effect is obtained;

step 3, model training and loss calculation:

the train.csv file in the plant leaf disease data set obtained in the step (1) is used for obtainingInputting the model into the network model constructed in the step (2), and training the network model, wherein the optimizer selects Adam, and the loss function comprises: similarity lossA differential loss->And a cross entropy loss for classification +.>The method comprises the steps of carrying out a first treatment on the surface of the Similarity loss->Based on the center distance difference (CMD) to calculate, CMD measures the difference between two matrices by matching their moment differences, the value of CMD becomes smaller as the similarity of the two distributions increases, let +.>For interval->On, respectively by probability->Sampling the obtained bounded random sample, then +.>Wherein->Is the desire of the sample, +.>Is->Definition of central moment of->Is that；/>In the training process, set +.>And->Respectively represent +.>Each hidden vector of the modalities, the uncorrelation constant of the modality vector is +.>Wherein->Representing the Frobenius paradigm, the loss is defined as +.>Cross entropy loss is used->As a predictive loss function for downstream tasks, the formula is as follows:

the total loss function is as follows:

；

and 4, training a network to obtain the best accuracy 0.999257.

2. The plant leaf phenotype disease classification method based on graphic subspace joint learning of claim 1, wherein: the feature extraction stage described in step 2 applies a dropout and an emb_dropout of 0.1.

3. The plant leaf phenotype disease classification method based on graphic subspace joint learning of claim 1, wherein: described in step 2、 />And->The self-encoder is three self-encoders which are completely consistent in structure, and comprises a fully-connected layer with 128 input and output and a sigmoid activation layer; />For processing the text of the text,to process visual vision information.

4. The plant leaf phenotype disease classification method based on graphic subspace joint learning according to claim 1, wherein in the mode representation learning stage in step 2, a framework is used to project text and image modes into two different subspaces, one subspace is mode-invariant, and the method of projecting the text and image modes into the mode-invariant subspace is as follows: the corpus vector is processed through a full-connection layer with 128 inputs and outputs and a sigmoid activation layer.

5. A plant leaf phenotype disease classification method based on joint learning of graphic subspaces as claimed in claim 1 or 4, wherein in the stage of learning the modal representation in step 2, a framework is used to project text and image modalities into two different subspaces, one subspace is modality-invariant, the other subspace is modality-specific, and the method of projection into modality-specific subspaces is as follows: first, each corpus vector is encoded, a full-connected layer with 128 output neurons is used for encoding, and then the encoded result is mapped to a range between 0 and 1 through a sigmoid activation function.