CN117453859A

CN117453859A - Agricultural pest and disease damage image-text retrieval method, system and electronic equipment

Info

Publication number: CN117453859A
Application number: CN202311475104.3A
Authority: CN
Inventors: 位耀光; 刘星雨; 安东; 李道亮; 刘金存
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-01-26

Abstract

The invention discloses an agricultural pest image-text retrieval method, an agricultural pest image-text retrieval system and electronic equipment, and relates to the technical field of electric digital data processing. According to the invention, the feature extraction model is adopted to extract the features of the data to be searched, a cross-modal feature fusion network is constructed, the cross-modal feature fusion network is adopted to determine the matching information of the feature sequence of the data to be searched in the data search library, and the matching information is used as a cross-modal search result, so that the graph search or the graph search of the plant diseases and insect pests is better realized, and the efficiency and the speed of agricultural information search are improved.

Description

Agricultural pest and disease damage image-text retrieval method, system and electronic equipment

Technical Field

The invention relates to the technical field of electric digital data processing, in particular to an agricultural pest image-text retrieval method, an agricultural pest image-text retrieval system and electronic equipment.

Background

Cross-modal retrieval is a popular research direction of the current artificial intelligence, and aims to realize mutual correspondence and retrieval of data of different modes such as text, voice, video and pictures, namely, data of one mode is used for querying data of another mode, but because of 'semantic gap' between data of different modes, the cross-modal retrieval is not simple to realize, and potential association between the data needs to be found and matching needs to be completed.

The cross-modal retrieval based on deep learning at the present stage mainly comprises the steps of extracting features from original data through a pre-trained target detector, mapping the extracted features to a public space, reducing the difference between the cross-modal data and the advanced semantic features through constraint of a loss function and the like, and then retrieving samples with highest similarity as matching through cosine similarity and the like, wherein the pre-trained target detector is not needed in the field of agricultural diseases and insect pests, and more calculation resources are needed for training and reasoning.

Various information technologies in the field of agricultural informatization are widely used at present, such as text classification, intelligent question-answering, image-based pest detection and the like, but the application of cross-modal technology has not been widely popularized. At present, most of plant diseases and insect pests are detected by processing image data, and related text information is searched after plant diseases and insect pests are identified through the image, so that the efficiency is low, a proper cross-mode searching technology is found, the matched text is searched through the picture of plant diseases and insect pests of the crops or the image of the plant diseases and insect pests is directly searched through sentences, the planting skills of farmers and means for preventing the plant diseases and insect pests can be effectively improved, information transmission is promoted, and the searching efficiency is improved.

Therefore, the development of the method for realizing the cross-mode retrieval in the field of agricultural diseases and insect pests has high application value.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an agricultural pest and disease damage image-text retrieval method, an agricultural pest and disease damage image-text retrieval system and electronic equipment.

In order to achieve the above object, the present invention provides the following solutions:

an agricultural pest and disease damage image-text retrieval method comprises the following steps:

acquiring data to be retrieved; the data to be retrieved is image data of agricultural diseases and insect pests or text data of the agricultural diseases and insect pests;

extracting the characteristics of the data to be searched by adopting a characteristic extraction model to obtain a characteristic sequence of the data to be searched;

constructing a cross-modal feature fusion network;

and determining the matching information of the data feature sequence to be searched in a data search library by adopting the cross-modal feature fusion network, and taking the matching information as a cross-modal search result.

Optionally, extracting the features of the data to be retrieved by using a feature extraction model to obtain a feature sequence of the data to be retrieved, including:

when the data to be retrieved is image data, extracting the characteristics of the image data by adopting an image encoder, and combining a pixel point characteristic extraction method to obtain image sequence characteristics;

when the data to be retrieved is text data, extracting the characteristics of the text data by adopting a text encoder, and combining a dictionary information fusion method to obtain text sequence characteristics.

Optionally, the construction process of the image encoder includes:

adopting a network of Resnet-50 added with channel attention as an original feature extractor, and processing non-sequence features output by the original feature extractor into sequence features by using a pixel feature extraction method so as to construct and obtain an image feature extraction model;

training an image feature extraction model on an image classification task by adopting image sample data, and taking the trained image feature extraction model as the image encoder.

Optionally, the construction process of the text encoder includes:

extracting word-level features of text sample data by adopting a Chinese Roberta model;

collecting word lists in the field of plant diseases and insect pests, adding the word lists into a word segmentation device, inputting words appearing in the word lists in text sample data as word level token into a Chinese RoBERTa model during text word segmentation, adjusting the Chinese RoBERTa model on a text classification task by using collected texts to generate embedded vectors meeting set requirements for the words newly added into the word lists, obtaining an adjusted Chinese RoBERTa model, and taking the adjusted Chinese RoBERTa model as the text encoder.

Optionally, before the text data is input into the text encoder, a tag "< cls >" and a tag "< seq >" are added at the beginning and end of the text data, respectively; wherein, the text data after marking is:

S2＝“<cls>”+S1+“<seq>”；

in the formula, S2 is marked text data, and S1 is the text data.

Optionally, the process of constructing the cross-modal feature fusion network includes:

acquiring picture sample characteristics and text sample characteristics, and combining the picture sample characteristics and the text sample characteristics to obtain an input sample sequence;

training an Embedding layer with the length of 2, and respectively adding the Embedding layer with the picture sample characteristics and the text sample characteristics;

constructing an encoder structure of a self-attention transformation network, and inputting the input sample sequence into the encoder structure to obtain an output result;

respectively carrying out attention pooling on the text sequence characteristics of the output result and the picture sequence characteristics of the output result to obtain one-dimensional characterization of the text sequence characteristics and one-dimensional characterization of the picture sequence characteristics, and respectively carrying out multi-classification prediction on nonlinear changes of FC-Relu-FC combinations to obtain the categories of the one-dimensional characterization of the text sequence characteristics and the categories of the one-dimensional characterization of the picture sequence characteristics;

taking joint embedded features in an output result, and predicting the similarity of pictures and texts in an input sample sequence through FC-Sigmoid combination;

and determining matching information based on the similarity and a preset value, and determining a cross-modal retrieval result based on the matching information.

Optionally, in the cross-modal feature fusion network training process, the loss function adopted is:

Loss＝L ₁ +λL ₂ ；

wherein the first partial loss function L ₁ The method comprises the following steps:

wherein m represents a first sample number, y ^p Indicating that the p-th sample matches the tag,representing a predicted result of similarity to the p-th sample;

second partial loss function L ₂ The method comprises the following steps:

wherein nums represents the second sample number, 9 represents 9 kinds of insect pests, s represents the s-th insect pest,tag class representing the q-th picture sample,/>Class label representing the q-th text sample, < ->Representing the probability of a picture sample predicted by a softmax function,/>Representing the probability of a text sample predicted by the softmax function.

Further, the invention provides an agricultural pest image-text retrieval system, which is used for implementing the agricultural pest image-text retrieval method; the system comprises:

the data acquisition module is used for acquiring data to be retrieved; the data to be retrieved is image data of agricultural diseases and insect pests or text data of the agricultural diseases and insect pests;

the feature extraction module is used for extracting the features of the data to be searched by adopting a feature extraction model to obtain a feature sequence of the data to be searched;

the network construction module is used for constructing a cross-modal feature fusion network;

and the image-text retrieval module is used for determining the matching information of the data feature sequence to be retrieved in the data retrieval library by adopting the cross-modal feature fusion network, and taking the matching information as a cross-modal retrieval result.

Still further, the present invention also provides an electronic device including:

a memory for storing a computer program;

and the processor is connected with the memory and used for calling and executing the computer program so as to implement the agricultural pest image-text retrieval method.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

according to the invention, the feature extraction model is adopted to extract the features of the data to be searched, a cross-modal feature fusion network is constructed, the cross-modal feature fusion network is adopted to determine the matching information of the feature sequence of the data to be searched in the data search library, and the matching information is used as a cross-modal search result, so that the graph search or the graph search of the plant diseases and insect pests is better realized, and the efficiency and the speed of agricultural information search are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an agricultural pest image-text retrieval method provided by the invention;

FIG. 2 is a schematic diagram of an embodiment of the present invention for searching and implementing agricultural pest and disease damage graph text;

FIG. 3 is a flowchart of word segmentation processing and training performed by fusing dictionary information in RoBERTa according to an embodiment of the present invention; wherein (a) of fig. 3 is a flowchart of the original text encoder word segmentation process, and (b) of fig. 3 is a flowchart of the word segmentation process after adding a dictionary;

fig. 4 is a flowchart of text retrieval provided in an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to provide an agricultural pest image-text retrieval method, an agricultural pest image-text retrieval system and electronic equipment, which can improve the efficiency and speed of agricultural information retrieval.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 1, the image-text searching method for agricultural diseases and insect pests provided by the invention comprises the following steps:

step 100: and obtaining data to be retrieved. The data to be retrieved is image data of an agricultural pest or text data of an agricultural pest.

Step 101: and extracting the characteristics of the data to be searched by adopting a characteristic extraction model to obtain a characteristic sequence of the data to be searched. When the data to be retrieved is image data, an image encoder is adopted to extract the characteristics of the image data, and an image sequence characteristic is obtained by combining a pixel characteristic extraction method. When the data to be retrieved is text data, extracting the characteristics of the text data by adopting a text encoder, and combining a dictionary information fusion method to obtain the text sequence characteristics.

In the practical application process, the construction process of the image encoder comprises the following steps:

(1) And adopting a network of Resnet-50 added with channel attention as an original feature extractor, and processing the non-sequence features output by the original feature extractor into sequence features by using a pixel feature extraction method so as to construct and obtain an image feature extraction model.

(2) Training an image feature extraction model on an image classification task by adopting image sample data, and taking the trained image feature extraction model as an image encoder.

B. The construction process of the text encoder comprises the following steps:

(1) The character level features of the text sample data are extracted by using a Chinese Roberta model.

(2) The method comprises the steps of collecting word lists in the field of plant diseases and insect pests, adding the word lists into a word segmentation device, inputting words appearing in the word lists in text sample data as word level token into a Chinese RoBERTa model during text word segmentation preprocessing, adjusting the Chinese RoBERTa model on a text classification task by using collected texts to generate embedded vectors meeting set requirements for the words newly added into the word lists, obtaining an adjusted Chinese RoBERTa model, and taking the adjusted Chinese RoBERTa model as a text encoder.

Step 102: and constructing a cross-modal feature fusion network.

In the actual application process, the process of constructing the cross-modal feature fusion network comprises the following steps:

(1) And acquiring the picture sample characteristics and the text sample characteristics, and combining the picture sample characteristics and the text sample characteristics to obtain an input sample sequence.

(2) Training an Embedding layer with the length of 2, and respectively adding the Embedding layer with the picture sample characteristics and the text sample characteristics.

(3) And constructing an encoder structure of the self-attention transformation network, and inputting the input sample sequence into the encoder structure to obtain an output result.

(4) And respectively carrying out attention pooling on the text sequence characteristics of the output result and the picture sequence characteristics of the output result to obtain one-dimensional characterization of the text sequence characteristics and one-dimensional characterization of the picture sequence characteristics, and respectively carrying out multi-classification prediction on nonlinear changes of the FC-Relu-FC combination to obtain the category of the one-dimensional characterization of the text sequence characteristics and the category of the one-dimensional characterization of the picture sequence characteristics.

(5) And taking joint embedded features in the output result, and predicting the similarity of the pictures and the texts in the input sample sequence through FC-Sigmoid combination.

(6) And determining matching information based on the similarity and a preset value, and determining a cross-modal retrieval result based on the matching information.

Further, in order to improve the accuracy of the search, in the training process of the cross-modal feature fusion network, the adopted loss function may be:

Loss＝L ₁ +λL ₂ ；

second partial loss function L ₂ The method comprises the following steps:

Step 103: and determining the matching information of the data feature sequence to be searched in the data search library by adopting a cross-modal feature fusion network, and taking the matching information as a cross-modal search result.

The following provides two examples to explain the specific application process of the image-text searching method for agricultural diseases and insect pests.

Example 1

In this embodiment, the overall architecture of the image-text searching method for agricultural diseases and insect pests is shown in fig. 2, and the implementation process is as follows:

step one, constructing a disease and pest cross-modal data set, expanding the data set through data enhancement, using a Resnet-50 network model added with channel attention and a Chinese RoBERTa pre-training model as a feature extraction network of images and texts, and respectively extracting sequence features of the images and texts by combining pixel feature extraction and dictionary information fusion methods.

A) The specific steps of constructing the disease and pest cross-modal data set in the first step are as follows:

training and testing data sets are constructed in a graphic pair mode. The pictures are mainly pictures of disease and insect damage symptoms of crops, the texts are related texts of the disease and insect damage, and the data set is expanded through data enhancement. And taking all matched image-text pairs as positive samples, randomly constructing unmatched negative samples among different kinds of pest and disease data, and keeping the same number of positive samples and negative samples.

B) The specific steps of extracting text and picture feature vectors by using a pre-training text encoder and an image encoder in the first step are as follows:

1) A pre-trained text encoder is constructed that extracts word-level features of text using a pre-trained chinese RoBERTa model. And then, collecting a professional vocabulary in the field of plant diseases and insect pests, adding the professional vocabulary into a word segmentation device, inputting words appearing in the vocabulary in the text into a model as word-level token during text word segmentation preprocessing, finely adjusting a RoBERTA model on a text classification task by using the collected text to generate proper embedded vectors for the words newly added into a dictionary, and storing the model as a text encoder.

Further, before text is input into the text encoder, the text S1 needs to be processed as follows:

S2＝“<cls>”+S1+“<seq>” (4)

two marks are added at the beginning and the end of the text: "< cls >" and "< seq >", text features are extracted using a text encoder, expressed as:

V＝{‘[ismh]’,v ₁ ,v ₂ ,...,v _n ,’[seq]’} (5)

in the formula, v ₁ ,v ₂ ,...,v _n For n text data, cls in original bert is expressed as ismh after feature fusion.

2) Constructing a pre-trained image encoder: the image encoder uses a network of Resnet-50 with channel attention added as an original feature extractor, processes non-sequence features output by the model into sequence features by using a pixel feature extraction method, and trains a Resnet model on an image classification task by using the collected pictures as the image encoder.

Further, the channel attention and pixel feature extraction is specifically:

channel attention after each residual connection of the Resnet-50 network, feature compression is performed on the feature map with the input dimension (d×h×w) in the dimension of the spatial feature D, and each two-dimensional feature map (h×w) is changed into a real number by the following formula:

wherein d _i Characteristic diagram representing the ith channel, Z _i The calculation result of the feature map representing the ith channel, H and W each represent the number of channels.

After obtaining the real value representation of each channel characteristic, mapping the real value through nonlinear change of (FC-Tanh-FC), finally mapping the real value between (0-1) through a Sigmoid (i.e. operation, as an attention weight set of each channel of the original characteristic diagram, multiplying each two-dimensional characteristic diagram of the D dimension of the original characteristic diagram by the corresponding weight in the attention weight set, and as a new characteristic diagram for adding the attention of the channel.

The dimension of the feature map output through the Resnet network is [2048,3,3 ]]Regarding the two latter dimensions of the feature map as 9 pixel points, extracting 2048-dimensional features of each pixel point in the space dimension, splicing 9 groups, and representing the processed image feature dimension as [9,2048 ]]Expressed as u= { U ₁ ,u ₂ ,...,u ₉ }。u ₁ ,u ₂ ,...,u ₉ Is 9 processed image features.

And secondly, constructing a trans-former-based trans-modal feature fusion network of the encoder structure to fuse the picture and the text features. And carrying out subspace mapping on the picture and text features through a mapping matrix, then splicing a common input feature fusion network, generating a new joint embedded vector in the feature fusion network to learn picture-text matching features, predicting the similarity of the picture-text matching features through a full-connection layer to judge whether the current input samples are matched, and returning the matched samples. And the multi-dimensional characteristics of the pictures and text characteristics after the characteristics are fused with the network are constrained into one-dimensional characteristics through attention pooling, the categories of the multi-dimensional characteristics are predicted through a linear layer, the multi-dimensional characteristics serve as a subtask for constraining model training, and the model perceives the differences among more input samples through classification.

The specific steps of constructing the cross-modal feature fusion network in the second step are as follows:

the encoder section of a self-attention transformation network (transducer) is employed as a cross-modal feature fusion network, operating specifically as: combining the mapped picture feature sequence U and the text feature sequence V into an input sequence f according to the following rule:

f＝{‘[ismh]’,v ₁ ,v ₂ ,...,v _n ,’[seq]’,u ₁ ,u ₂ ,...,u ₉ } (7)

at the same time, training an Embedding layer with length of 2, and respectively associating the first and second layer with the picture feature v _i And text feature u _i And adding to distinguish data of different modes.

Then, constructing an encoder structure of a transducer, splicing the picture and the text characteristics, inputting the spliced picture and text characteristics as a model, and expressing an output result as

In the method, in the process of the invention,indicating the output result corresponding to x.

Finally, outputting the text sequence characteristics of the resultAnd picture sequence feature->One-dimensional characterization of +.A.A. is obtained by attention pooling>And->And predicting the category of the FC-Relu-FC combination by the multi-classification of the nonlinear variation of the FC-Relu-FC combination. Get->And the joint embedded feature of the position predicts the similarity of the input picture and the text through FC-Sigmoid combination, and takes more than 0.5 and less than 0.5 as matching and unmatched conditions respectively.

When training the model, the loss function used for constraining the model is shown in the formulas (1) to (3).

Training a cross-modal feature fusion network on the disease and pest image-text pair data set, adding a reminding project in the training, and collecting relevant text and pictures of the disease and pest as a data retrieval library.

Further, the step of training the network model and creating the data retrieval library in the step three is as follows:

the whole model is trained on the dataset using the collected teletext including the text encoder, the image encoder and the cross-modal feature fusion network. And adding a paradigm T in the text training data: this is a short text of a picture of XX pest, where XX represents the name of the pest. In the text checking diagram of model reasoning, a user inputs the disease and pest names, the disease and pest names are placed in the XX position of T, and then the whole text is used as a search Query, so that the search related pictures of the disease and pest names can be directly input. And collecting partial pictures and texts from the training set and the test set as a data retrieval library of the model, and extracting and storing data features in the library by using an image encoder and a text encoder in advance.

And step four, when the trained model is used for cross-modal retrieval, extracting the serialization features of the input pictures (or texts) by using a corresponding feature extraction method, judging whether the pictures are matched with the texts (or pictures) in the data retrieval library through a cross-modal feature fusion network, and outputting matched information as a cross-modal retrieval result.

Further, in the fourth step, the step of using the model to perform cross-modal matching on the input data is:

when inputting a picture (or text), an image encoder (or a text encoder) is used to obtain the serialization characteristics of the picture (or text), and then the picture (or text) and the text (or image) with the characteristics extracted in advance in a data retrieval library are passed through a cross-modal characteristic fusion network to obtain the matching result and return the matching result to the request.

Example two

In the implementation, a pre-training model for extracting the characteristics of the pictures and the texts is pre-trained, a Resnet50 network added with the attention of a channel is used as a backstone for extracting the characteristics of the images, and a method for extracting the characteristics of the pixels is combined to obtain the characterization of the pictures. And extracting text sequence characterization by combining the Chinese Roberta model with a dictionary information fusion method. And after subspace mapping is completed through a full-connection layer, splicing the subspaces, transmitting the subspaces into a cross-modal feature fusion network at the same time, taking the output joint embedded features, predicting the similarity of the joint embedded features through the full-connection layer to judge whether the joint embedded features are matched, and adding a text search graph for prompting the engineering to better realize the pest and disease damage. The trained model can search the input data for another mode data matched with the input data in a search library as output. The whole implementation framework is shown in fig. 2, and the specific steps are as follows:

step 1: the Resnet-50 model with added channel attention is used as an image encoder, and pixel channel characteristics of the image are extracted from the output characteristic diagram.

The attention of the added channel is to perform feature compression on the feature map with the input dimension of (D multiplied by H multiplied by W) after each Residual block connected by the Residual-50 network, and change the two-dimensional feature map (H multiplied by W) of each channel into a real number through a formula (6).

After the real value representation of each channel dimension characteristic is obtained, mapping the real value through nonlinear change of (FC-ReLu-FC), finally mapping the real value between (0-1) through a Sigmoid (·) operation as an attention weight set of each channel of the original characteristic diagram, performing scale operation on each two-dimensional characteristic diagram of the D dimension of the original characteristic diagram and the weight corresponding to the attention weight set, namely weighting the weight subjected to the Sigmoid (·) operation to the characteristic of each channel to be used as a new characteristic diagram added with the attention of the channel, and then adding the new characteristic diagram with the original characteristic.

The pixel feature extraction method represents that after an image is extracted and downsampled by removing the classification layer and then adding a (3×3) adaptive avgpool2d layer of Resnet-50 network features, the image is finally represented as a (2048×3×3) feature map, because the deep learning network learns the features of the channel sufficiently, and the subsequent feature fusion requires processing the image into sequential features, extracting 2048 dimensional spatial features corresponding to each pixel in the (3×3) dimension from the feature map, and stitching the features of the image in the order of (1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3) and processing the features of the image into sequential features of (9×2048). And adding the channel attention and extracting the channel dimension characteristics as image serialization characterization, so that the quality of the image characteristics is obviously improved. The image encoder performs pre-training on the obtained picture data set by using an image classification task, and stores the trained checkpoint as the image encoder of the method.

The process of using the Chinese RoBERTa model as a text encoder and fusing domain dictionary information is as follows:

because the Chinese Roberta model takes a word as a Token, but because the pre-training data of the Chinese Roberta model may not contain too much data of the field, splitting the words of the pest and disease field into word-level characterization is not accurate, the nouns with proper field can be added into a word divider of the Chinese Roberta model, the nouns are directly regarded as a Token, and then fine adjustment is carried out on text data of the field, so that the model learns more accurate noun characterization.

As shown in fig. 3, the process of word segmentation and training to obtain a new tensor is as follows: collecting the terms in the field of plant diseases and insect pests, including words of various academic names, common symptoms, common equipment, colloquial words and the like, adding the words into a word segmentation table of a word segmentation device in text pretreatment, regarding the terms in the text as word level Token in word segmentation, expanding an Embedding layer of a Chinese Roberta model, and randomly generating new vector representations for each term. The Chinese RoBERTa model is finely tuned on the collected plant disease and insect pest text data set by using a text classification task, so that the random vector representation of the new expanded term is more accurate.

Creating a word segmentation device and a Chinese RoBERTa model through BertTokinezer and BertModel provided by Hugging Face, wherein collected field terminology is as follows by using api: add_keys (list).

The nouns are added into the word segmentation library, at this time, the word segmentation device can not split nouns existing in the list in the text, and also add new initialization vector representations for newly added nouns in an Embedding layer of the Chinese Roberta model, and can call the api. Wherein, the random vector of new words newly added into the word segmentation device is automatically created by the bert. Resize_token_email (len (token), then fine tuning is carried out on the collected pest and disease field text by a text classification task to the Chinese RoBERTa model so as to obtain more accurate characterization of the field nouns newly added into the word segmentation device, and the Chinese RoBERTa model is stored as a text encoder.

Before extracting text features by using a text encoder, adding' [ cls ] before and after the text respectively]'and' [ seq]'two marks are used as classification features and segmentation vectors, and the text length max_len is set to be 50, the text exceeding or falling short of 50 needs cutting and filling operation, and the extracted features are expressed as {' [ cls ]]’,v ₁ ,v ₂ ,...,v ₄₈ ,’[seq]' wherein each feature dimension is 768 dimensions.

Step 2: and constructing a disease and pest cross-mode image-text matching data set and preprocessing, wherein pictures and texts of various disease and pests are collected through crawler, book scanning, field shooting and other modes. The picture is mainly a symptom picture of a certain type of plant diseases and insect pests, and the text is various text descriptions related to the plant diseases and insect pests, including introduction, symptoms, severity, prevention measures and the like. Firstly, expanding a picture data set for picture mode data in a manner of overturning, random cutting and the like, and expanding a text data set for text mode data in a manner of synonym replacement, random insertion, random deletion and the like. In constructing the training set (of picture-text pairs), by means of mutual matching, for example for a certain class of image and text data: wherein (1)>A picture set is represented and is displayed,representing a text set. Then by adding->And->Method for matching with each other, constituting a data set +.> As positive samples of cross-modality retrieval matches. At the same time randomly shuffleThe unmatched image-sentence pairs are used as negative samples, and the same number of positive samples and negative samples are kept, so that the model can learn a correct distinguishing method, and training data can be amplified by the data enhancement method, and meanwhile, the generalization capability of the model is enhanced.

Prompt Engineering is added in model training to remind the engineering, so that the accuracy of the graph search in the model reasoning stage is improved. The method comprises the following specific steps:

adding a text paradigm T in the training set and the test set: this is a picture of XX pest, where XX represents the name of the pest. In the text checking diagram of model reasoning, a user inputs the disease and pest names, the disease and pest names are placed in the XX position of T, and then the whole text is used as a search Query, so that the search related pictures of the disease and pest names can be directly input.

Step 3: the collected image-text data are respectively transmitted into an image and a text encoder to obtain characterization of the image-text data, and the image characterization is represented as U= { U ₁ ,u ₂ ,...,u ₉ [9,2048 ] dimension]The text representation is expressed as V = {' [ cls ]]’,v ₁ ,v ₂ ,...,v ₄₈ ,’[seq]' dimension [48,768 ]]。

Step 4: the extracted image and text features are mapped in subspace by training two subspace matrixes, wherein the steps are training two fully connected layers, mapping the image and text features to a common subspace through the fully connected layers, and mapping the features in U and V to 1024 lengths.

Step 5: mapping the picture characteristics U _i And text feature V _i The input sequences as shown in equation (7) are combined according to the following rules.

Constructing a trans-former-based trans-modal feature fusion network of an encoder structure, learning trans-modal attention between image pixels and languages through multi-head self-attention of an encoder part of the trans-former, and passing an input sequence formed by combination through the encoder part, wherein the result is expressed as shown in a formula (8).

Further, text sequence features of the result are outputAnd picture sequence feature->One-dimensional characterization of +.A.A. is obtained by attention pooling>And->And predicting the category of the cell by multi-classification of the FC-Relu-FC combination. Get->And the joint embedded feature of the position is used for predicting the similarity of the input picture and text through the combination of FC-Sigmoid to judge that the input picture and text are matched in time.

Wherein, attention pooling is: the set of attention weights for each sequence, denoted AttU and AttV, was learned by full connectivity layer plus Softmax (. Multiplying each sequence feature in the image feature U and the text feature V by the corresponding attention weight in the AttU and the AttV, and adding the sequences to obtain respective one-dimensional feature representations U and V of the picture and the text:

multiple classification to obtain classification results of u and vAnd->

Model training is constrained by the Loss function of equations (1) through (3), where λ in los is a hyper-parameter, which can be referenced to 0.001.

Step 5: after training is completed, the picture is searched for text or the picture is searched for text using the flow shown in fig. 4:

the data retrieval library for collecting the agricultural pest pictures and texts can selectively select texts in the text library, wherein the texts comprise pest introduction, pest symptoms, harm, treatment means and the like, tensor representations of the data are obtained in advance through a pre-training model and are stored in a file in a mat format, and the time of a feature extraction part can be saved in an reasoning stage, so that the retrieval efficiency is improved.

When inputting a picture (or text), an image encoder (or a text encoder) is used to obtain the serialization characteristics of the picture (or text), and then the picture (or text) and the text (or image) with the characteristics extracted in advance in a data retrieval library are passed through a cross-modal characteristic fusion network to obtain the matching result and return the matching result to the request. The user inputs the disease and pest names to be searched, the disease and pest names are put in the XX position of the paradigm T to obtain a complete text, and the complete text is input into the model to be matched with the pictures in the search library.

Based on the description, the invention utilizes the Resnet50 network added with the channel attention to extract the pixel point channel characteristics of the image, and utilizes the Chinese RoBERTa model to extract the text sequence characteristics by combining a dictionary information fusion method. And after subspace mapping is completed through the matrix, the subspaces are transmitted into a cross-modal feature fusion network together, the similarity of the output joint embedded features is calculated, whether the joint embedded features are matched with each other is judged, and a text search graph for prompting the engineering to better realize the pest and disease damage is added. The trained model can judge the input data in the search library and judge the other mode data matched with the input data as output.

Further, the invention provides an agricultural pest image-text retrieval system, which is used for implementing the agricultural pest image-text retrieval method. The system comprises:

and the data acquisition module is used for acquiring the data to be retrieved. The data to be retrieved is image data of an agricultural pest or text data of an agricultural pest.

And the feature extraction module is used for extracting features of the data to be searched by adopting a feature extraction model to obtain a feature sequence of the data to be searched.

And the network construction module is used for constructing a cross-modal feature fusion network.

And the image-text retrieval module is used for determining the matching information of the data feature sequence to be retrieved in the data retrieval library by adopting a cross-modal feature fusion network, and taking the matching information as a cross-modal retrieval result.

Still further, the present invention also provides an electronic device including: memory and a processor.

The memory is used for storing a computer program.

The processor is connected with the memory for retrieving and executing the computer program to implement the agricultural pest image-text retrieval method.

Furthermore, the computer program in the above-described memory may be stored in a computer-readable storage medium when it is implemented in the form of a software functional unit and sold or used as a separate product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a mobile hard disk, a read-only memory, a random access memory, a magnetic disk or an optical disk.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. An agricultural pest and disease pattern searching method is characterized by comprising the following steps:

constructing a cross-modal feature fusion network;

2. The agricultural pest and disease damage image-text retrieval method according to claim 1, wherein extracting features of the data to be retrieved by using a feature extraction model to obtain a feature sequence of the data to be retrieved, comprises:

3. The agricultural pest and disease damage image-text retrieval method according to claim 2, wherein the construction process of the image encoder comprises:

4. The agricultural pest and disease damage image-text retrieval method according to claim 2, wherein the construction process of the text encoder comprises:

5. The agricultural pest and disease pattern retrieval method according to claim 2, wherein a tag "< cls >" and a tag "< seq >" are added to the text data, respectively, at the beginning and end thereof, before the text data is inputted into the text encoder; wherein, the text data after marking is:

S2＝“<cls>”+S1+“<seq>”；

in the formula, S2 is marked text data, and S1 is the text data.

6. The agricultural pest and disease damage graph-text retrieval method according to claim 1, wherein the process of constructing a cross-modal feature fusion network comprises:

7. The agricultural pest and disease pattern retrieval method according to claim 1, wherein in the cross-modal feature fusion network training process, a loss function is adopted as follows:

Loss＝L ₁ +λL ₂ ；

second partial loss function L ₂ The method comprises the following steps:

8. An agricultural pest image-text retrieval system, characterized in that the system is used for implementing the agricultural pest image-text retrieval method according to any one of claims 1-7; the system comprises:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor, connected to the memory, for retrieving and executing the computer program to implement the agricultural pest image-text retrieval method according to any one of claims 1-7.