CN116861361B

CN116861361B - Dam deformation evaluation method based on image-text multi-mode fusion

Info

Publication number: CN116861361B
Application number: CN202310768316.4A
Authority: CN
Inventors: 王龙宝; 张津豪; 储洪强; 毛莺池; 张雪洁; 徐淑芳; 徐荟华
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2023-06-27
Filing date: 2023-06-27
Publication date: 2024-05-03
Anticipated expiration: 2043-06-27
Also published as: CN116861361A

Abstract

The invention discloses a dam deformation evaluation method based on image-text multi-mode fusion, which comprises the following steps: acquiring a previous image and a current image; obtaining a differential image; carrying out multi-scale feature extraction and fusion on the previous image and the difference image to obtain an original image; preprocessing an original image and a dam deformation discrimination text; inputting the preprocessed image and text characteristics into a double-flow cross-mode transducer model for pre-training, and jointly modeling intra-mode and cross-mode representations to obtain a pre-training model; optimizing and adjusting parameters of the pre-training model; and predicting according to the test set image and the problem text data by using the trained model to obtain a dam deformation evaluation result. According to the invention, knowledge of the dam scene graph is integrated into multi-mode pre-training, so that the understanding capability of a machine on the dam deformation scene is greatly improved, and the model can be aligned with fine granularity characteristics between cross modes of images and texts more accurately, thereby improving the accuracy of answering the dam deformation problem.

Description

Dam deformation evaluation method based on image-text multi-mode fusion

Technical Field

The invention belongs to the field of monitoring and evaluating of deformation of a water conservancy dam, and particularly relates to a dam deformation evaluating method based on image-text multi-mode fusion.

Background

The dam with more than 10 ten thousand seats is built in China so far, and is one of the most countries of reservoir dams in the world. With the further development and utilization of water resources, newly built high-dam vaults are more and more, and the projects play a great role in agricultural irrigation, flood control and drought resistance, water source configuration, hydroelectric power generation, urban water supply, water and soil conservation, ecological environment protection and the like. 6. A part of dams built in seventies are limited to economic conditions, scientific and technical levels and other reasons at the moment, and have the safety problems of low design standards, geology, construction quality, aging and the like, so that the comprehensive benefits of reservoirs are influenced, and even the downstream towns, traffic and lives and properties of people are threatened. Thus, dam safety issues become increasingly prominent public safety issues, which must be highly appreciated.

The main projects of dam safety monitoring are as follows: deformation, seepage, pressure, stress strain, hydraulics, environmental quantities, and the like. The deformation monitoring is the most visual and reliable, and can basically reflect the safety state of the dam under the action of various loads, so that the dam is the most important monitoring item. The deformation monitoring mainly comprises surface deformation, internal deformation, dam foundation deformation, cracks and joints, concrete panel deformation, bank slope displacement and the like. The dam surface deformation monitoring mainly comprises the observation of vertical displacement and the observation of horizontal displacement. The observation of horizontal displacement means measurement of horizontal displacement by observation instruments and equipment on representative points of hydraulic structures and foundations, and the monitoring methods include a sight line method, a tension line method, a laser collimation method, a vertical line method, an intersection method, a wire method and the like.

The traditional engineering monitoring method often needs to consume manpower and material resources, and cannot automatically observe the horizontal displacement. Along with the rapid development of multi-modal feature extraction methods of computers in images, natural languages and the like, the method interacts the domain knowledge with the domain image information of the domain knowledge, and finally realizes the prediction of cross-modal learning. At present, two types of modal data, namely visual images and texts, are taken as research objects, and remarkable progress is achieved in the directions of visual question answering, image-text matching and the like. Therefore, the dam deformation visual question-answer evaluation method based on the image-text multi-mode fusion has important practical significance by taking a dam image set with the same area and long time span and text knowledge for judging the dam deformation as research objects and taking observation of the dam surface horizontal displacement deformation as research purposes.

Disclosure of Invention

The invention aims to: in order to overcome the defects in the prior art, the dam deformation evaluation method based on image-text multi-mode fusion is provided.

The technical scheme is as follows: in order to achieve the above purpose, the invention provides a dam deformation evaluation method based on image-text multi-mode fusion, which comprises the following steps:

S1: acquiring a dam image set through a fixed-point industrial monitoring camera, and respectively acquiring a previous image and a current image;

s2: acquiring a differential image according to the previous image and the current image;

S3: the method comprises the steps of respectively carrying out multi-scale feature extraction and fusion on a previous image and a differential image by utilizing a feature pyramid FPN network, and taking an obtained current feature image as an original image;

S4: preprocessing an original image and a dam deformation discrimination text;

s5: inputting the preprocessed image and text characteristics into a double-flow cross-mode transducer model for pre-training, and jointly modeling intra-mode and cross-mode representations to obtain a pre-training model;

s6: optimizing and adjusting parameters of a pre-training model by utilizing a previous image of the dam, a current image training set and a problem text training set related to the deformation risk of the dam to finish training;

s7: and (3) predicting according to the test set image and the problem text data by using the model trained in the step (S6) to obtain a dam deformation evaluation result.

Further, in the step S2, true color feature enhancement and feature difference are performed on the previous image and the current image, and the current feature image is used as a difference image, and the specific process includes the following steps:

A1: the method is characterized in that a PCA-based color feature enhancement method is adopted to perform true color feature enhancement, and on the premise that the dominant color of things and the color difference contrast of an image are not changed, the brightness of the image is obviously enhanced;

A2: and calculating the feature difference between the previous image and the current image after the true color feature enhancement. The feature matrix of the previous image is src _init, the feature matrix of the current image is src _final, and then the feature difference d _src is expressed as:

Further, the specific process of enhancing the true color features in the step A1 is as follows:

B1: respectively carrying out standardized processing on the previous image P _init and the current image P _final according to RGB three channels, wherein the mean value is 0, the variance is 1, the relative relation between RGB channels is ensured, and the pixel value distribution inside the three channels is not changed;

b2: images P _init and P _final are flattened into N×3 vectors according to channels, and are denoted as vectors I (θ), θ ε D;

b3: solving a covariance matrix of the vector I (theta);

B4: performing feature decomposition on the covariance matrix to obtain a feature vector F (theta) and a feature value lambda (theta);

B5: and adding the three channel feature vectors of the images P _init and P _final with the processed feature vectors respectively to obtain the feature-enhanced image. Taking one channel of the image P _init as an example, the formula is as follows, where α is the added dither coefficient.

P_result(θ)＝P_init(θ)+F(θ)_i·(a_i·λ(θ)_i)^T，θ，i∈D

Further, the step S3 specifically includes:

D1: the characteristic extraction is carried out on the previous image and the differential image through a backbone network ResNet with the same structure, the final output characteristics of the stages C2, C3, C4 and C5 are subjected to convolution operation with the step length of 1 multiplied by 1, so that the channel number is 256, and the channel number is marked as F2, F3, F4 and F5;

D2: (horizontal operation) the F5 feature is subjected to a convolution operation of 3×3 with a step size of 1, and a P5 image feature is output; the F5 feature is up-sampled (vertical operation from top to bottom), so that the length and width of the feature image are doubled, the feature image is consistent with the shape of the F4 feature and fused with the feature image, and then the convolution operation with 3 multiplied by 3 and the step length of 1 is carried out, and the P4 image feature is output; and so on until P2 image features are output;

D3: the output characteristics of the previous image and the differential image processed by the FPN network are marked as F '_θ and F' _θ, theta represents the number of layers, wherein theta=4, the characteristics of each layer are fused, the characteristics at the moment are taken as the characteristics of the original image, and the formula is as follows, wherein The feature concat is represented by an addition,

Further, the specific operation process of the step D1 is as follows:

d1-1: the C1 stage adopts a convolution operation with 7 multiplied by 7 and the step length of 2 and a maximum pooling operation with 3 multiplied by 3 and the step length of 2, and the channel number is 64;

D1-2: the connection between the C2 to C5 stages is divided into two branches, a main branch and a shortcut branch; the main branches adopt convolution operations of 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step length is 1,2 and 1, which is called a residual block, 3, 4, 6 and 3 residual blocks are adopted among each stage, and the channel numbers are 256, 512, 1024 and 2048 respectively, so that the length and width of the characteristic image are doubled; the shortcut branch adopts a convolution operation with the step length of 2 and the 1×1, so that the shape of the feature matrix is the same as that of the main branch.

Further, the preprocessing operation in step S4 is as follows: the RPN module of the Faster R-CNN network is used for selecting the salient image area and extracting the area characteristics, and the average pooling representation is used as the area characteristics for each reserved area through screening.

Further, the preprocessing operation in step S4 specifically includes the following steps:

e1: generating candidate frames for the original image features of each scale through an RPN structure;

e2: and projecting the candidate frames generated by the RPN onto the feature map to obtain corresponding feature matrixes, scaling each feature matrix to the feature map with the size of 7 multiplied by 7 through ROI Pooling layers, and flattening the feature map through a series of full-connection layers to obtain a salient image area.

Further, the operation of the step E1 is specifically as follows:

e1-1: the RPN structure adopts convolution with 3 multiplied by 3 and step length of 1 as a sliding window, the characteristics of the original image of each scale are slid, the central point (each to-be-detected point) of each sliding window is calculated to correspond to the central point on the original image, and the mapping formula of the characteristic image and the original image after sliding is as follows:

s_width＝w_origin/w_feature

s_height＝h_origin/h_feature

Wherein w _feature and h _feature are the width and length of the feature image, w _origin and h _origin are the width and length of the original image, and s _width and s _height respectively represent the scaling from the original image to the feature image; the coordinate on the original image can be obtained by multiplying the abscissa of a certain point of the characteristic image by the scaling in the corresponding direction;

E1-2: after calculating that the feature image (each detection point) of each scale corresponds to the center point on the original image, an anchor frame of 9 groups of three areas {1282, 2562, 5122}, three aspect ratios { 1:1, 1:2, 2:1 } is generated at the center point position on each original image, and then the calculation formulas of the width and the length of the generated anchor frame are:

Wherein area is the area of the generated anchor frame, ratio is the aspect ratio of the generated anchor frame, h is the width of the generated anchor frame, and w is the length of the generated anchor frame;

E1-3: the feature images with the channel number of 256 are subjected to 18 convolution operations of 1 multiplied by 1 to obtain feature images with the channel number of 18, and then the feature images are subjected to a Softmax layer classification calculation value; if the value is greater than 0.5, the anchor frame on the original image corresponding to the point on the feature image is foreground active, otherwise background negative, and the formula is as follows, wherein j is the number of samples:

E1-4: the scale feature images with 256 channels are subjected to 36 convolution operations of 1×1 to generate 4 coordinate offsets [ t _x,t_y,t_w,t_h ] of each anchor frame, and the coordinate offsets are used for correcting the anchor frames, and the offset calculation formula is as follows:

t_x＝(x-x_a)/w_a t_y＝(y-y_a)/h_a

t_w＝log(w/w_a) t_h＝log(h/h_a)

where [ x _a,y_a,w_a,h_a ] is the center point coordinate and width and height of the anchor frame, [ t _x,t_y,t_w,t_h ] is the predicted offset, then the corrected anchor frame coordinates [ x, y, w, h ] are calculated by the following formula:

Wherein [ p _x,p_y,p_w,p_h ] represents the coordinates of the original anchor frame, [ d _x,d_y,d_w,d_h ] represents the coordinate offset predicted by the RPN network, [ g _x,g_y,g_w,g_h ] represents the coordinates of the modified anchor frame;

E1-5: and correcting all original anchor frames by using the offset generated by E1-4, arranging the positive anchor frames from large to small according to the classification probability generated by E1-4, taking the first 6000 anchor frames, adopting non-maximum suppression, setting IoU to be 0.7, only leaving 2000 candidate frames for each picture, and finally outputting coordinates corresponding to the upper left corner and the lower right corner of the anchor frames of the original picture, wherein the anchor frames at the moment are called as candidate frames.

Further, the operation of step E2 is specifically as follows:

e2-1: the candidate frames are mapped back to the original images with corresponding scales, the feature images corresponding to each candidate frame are divided into 7X 7 grids, and the maximum pooling operation is carried out on each part of the grids, namely the feature images with the corresponding size of 7X 7 are obtained by projecting the feature images to the original images, and the specific mapping formula of the feature images with the corresponding scales is as follows:

Where k is the number of layers of the feature map used for mapping, k ₀ is the number of scales of the feature map (here, 4), w and h are the width and height of a single candidate frame (mapped to the original image), area _origin is the input picture size (area of the candidate frame);

E2-2: and finally, finishing classification and regression of the candidate frames: classifying all candidate boxes into specific categories through the full connection layer and Softmax, and the operation is similar to the S413 operation; the regression prediction is performed again on the candidate frame, and a final predicted frame with higher accuracy is obtained, which is consistent with the operation of S414.

Further, the step S5 specifically includes:

G1: input representation of region images: the method comprises the steps of (1) carrying out position coding on the region characteristics reserved after S4 operation processing by adopting a 5-dimensional vector, wherein elements of 5 dimensions are coordinates of an upper left corner and a lower right corner of a normalized region and an image region coverage duty ratio respectively, mapping the position codes to match with visual characteristic dimensions, adding the two to obtain image region characteristics, and finally marking the beginning and the end of an image sequence by using a specific image token, and using the output of the specific image token to represent the whole image;

And G2: input representation of text: judging the text by the dam deformation after S4 pretreatment, and inputting a Bert model to obtain a corresponding text embedding;

And G3: region image and text joint characterization: and (3) carrying out information interaction on the image and text characteristics obtained after the processing of G1 and G2 through a transducer layer of 6 groups of common attention mechanisms, namely giving an image I, representing as a group of region characteristics v ₀,...,v_T and a text input w ₀,...,w_T, and finally outputting as h _v0,...,h_vT and h _w0,...,h_wT.

The beneficial effects are that: compared with the prior art, the invention takes the dam image set with the same area and long time span and text knowledge of dam deformation discrimination as research objects and takes the observation of the horizontal displacement deformation of the surface of the dam as research purposes, and provides the dam deformation visual question-answering evaluation method based on the image-text multi-mode fusion, which has the following advantages:

1. compared with the existing engineering monitoring method, the method overcomes the defect of manual operation, saves manpower and material resources, and has a better evaluation effect.

2. Through the two feature pyramid networks, under the condition of basically not increasing the calculated amount of the original model, the features of the previous image and the differential image with larger scale difference can be extracted more fully, and the dam deformation detection performance on the differential image is greatly improved.

3. The knowledge of the dam scene graph is integrated into multi-mode pre-training, so that the understanding capability of a machine to a dam deformation scene is greatly improved, and the model can be aligned with fine granularity characteristics between cross modes of images and texts more accurately, so that the accuracy of answering the dam deformation problem is improved.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a schematic illustration of a process according to the present invention.

FIG. 3 is a schematic representation of feature pyramid multi-scale feature extraction in accordance with the method of the present invention.

Fig. 4 is a schematic diagram of the original image feature fusion of the method of the present invention.

FIG. 5 is a schematic diagram of an original image of a candidate frame mapped to a corresponding scale according to the method of the present invention.

FIG. 6 is an illustration of a resolution scenario of the method of the present invention.

FIG. 7 is a schematic diagram of a multi-modal pretraining scheme of the method of the present invention.

FIG. 8 is a schematic diagram of a dam deformation evaluation visual question-answering model of the method of the present invention.

Detailed Description

The present application is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the application and not limiting of its scope, and various modifications of the application, which are equivalent to those skilled in the art upon reading the application, will fall within the scope of the application as defined in the appended claims.

The invention provides a dam deformation evaluation method based on image-text multi-mode fusion, which is shown in fig. 1 and 2 and comprises the following steps:

S1: acquiring a dam image set of the same area with a time interval of 3 years by using a fixed-point industrial monitoring camera, determining an image which is far away from the current time as a previous image, and determining an image which is far away from the current time as a current image;

s2: according to the acquired two remote sensing images, namely the previous image and the current image, the real color characteristic enhancement is firstly carried out on the two remote sensing images, then the characteristic difference of the two remote sensing images is taken, and the current characteristic image is called as a difference image;

the specific process comprises the following steps:

the specific process of enhancing the true color features in the step A1 is as follows:

b3: solving a covariance matrix of the vector I (theta);

P_result(θ)＝P_init(θ)+F(θ)_i·(a_i·λ(θ)_i)^T，θ，i∈D

Referring to fig. 3 and 4, the specific steps of multi-scale feature extraction and feature fusion are:

The following 5 stages of feature extraction operations are performed for backbone network ResNet:

D1-2: the connection between the C2 to C5 stages is divided into two branches, a main branch and a shortcut branch; the main branches adopt convolution operations of 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step length is 1,2 and 1, which is called a residual block, 3,4, 6 and 3 residual blocks are adopted among each stage, and the channel numbers are 256, 512, 1024 and 2048 respectively, so that the length and width of the characteristic image are doubled; the branch of the shortcut adopts convolution operation with the step length of 2 and the length of 1 multiplied by 1, so that the shape of the feature matrix is the same as that of the main branch;

Wherein, a residual structure can be expressed as follows:

x_l+1＝x_l+F(x_l，Wx)

F (x _l,W_l) is the main branch output of the first unit, and x _l is the shortcut branch output of the first unit;

the specific operation of upsampling of each layer of features is as follows:

For up-sampling of the features, a nearest neighbor interpolation algorithm is adopted, d _st x and d _st y are marked as the abscissa and the ordinate of a certain pixel of the up-sampled target image, d _st width and d _st height are the width and the height of the target image, s _rc width and s _rc height are the width and the height of the original image, and s _rc x and s _rc y are the coordinates of the original image of the target image corresponding to the point (d _stx,d_st y), and the formula is as follows:

s_rcx＝d_stx*(s_rcwidth/d_stwidth)

s_rcy＝d_sty*(s_rcheight/d_stheight)

S4: preprocessing the dam deformation discrimination text of the original image subjected to expert demonstration examination:

Preprocessing an original image, namely selecting a significant image area and extracting area characteristics by adopting an RPN module of a Faster R-CNN network, screening each reserved area, and using an average pooling representation as the area characteristics;

Preprocessing the dam deformation discrimination text subjected to expert demonstration examination, and referring to FIG. 6, the method is characterized in that a scene graph is resolved from sentences through a scene graph resolver, the discrimination text is marked in WordPieces mode, and then 15% of word segmentation and 30% of scene graph nodes are randomly covered;

The pretreatment operation specifically comprises the following steps:

e1: generating candidate frames for the original image features of each scale through an RPN structure:

s_width＝w_origin/w_feature

s_height＝h_origin/h_feature

t_x＝(x-x_a)/w_a t_y＝(y-y_a)/h_a

t_w＝log(w/w_a) t_h＝log(h/h_a)

E2: projecting the candidate frames generated by the RPN onto the feature map to obtain corresponding feature matrixes, scaling each feature matrix to the feature map with the size of 7 multiplied by 7 through ROI Pooling layers, flattening the feature map through a series of full-connection layers to obtain a salient image area:

E2-1: referring to fig. 5, the candidate frames are mapped back to the original image of the corresponding scale, the feature map corresponding to each candidate frame is divided into 7×7 grids, and the maximum pooling operation is performed on each part of the grids, that is, the feature map with the corresponding size of 7×7 is obtained by projecting the feature map onto the original image, and the formula is as follows:

referring to fig. 7, the specific operation steps are as follows:

wherein the encoding of the region feature locations with a 5-dimensional vector operates specifically as:

W, H denote the length and width of the region feature, respectively, the upper left corner of the image region is [ x ₁,y₁ ], and the lower right corner is [ x ₂,y₂ ], then the region is position-coded and expressed as a 5-dimensional vector v= [ x, y, w, h, s ].

And G3: region image and text joint characterization: the image and text characteristics obtained after the processing of G1 and G2 are subjected to information interaction through a transducer layer of 6 groups of common attention mechanisms, namely, an image I is given, the image I is expressed as a group of region characteristics v ₀,...,v_T and a text input w ₀,...,w_T, and the output is finally expressed as h _v0,...,h_vT and h _w0,...,h_wT;

the transducer layer of the 6 groups of common attention mechanisms is consistent with the encoder structure of the transducer, but the source Q, K, V after linear transformation is different, and the common attention mechanisms can be expressed as the following formula:

MultiHead(Q，K，V)＝Concat(head1，......，head_h)W^O

for image streams, Q is derived from region feature v ₀,...,v_T, K, V is derived from text input w ₀,...,w_T; for the text stream, Q originates from text input w ₀,...,w_T, K, V originates from region feature v ₀,...,v_T;

wherein the two tasks of the pre-training are respectively predicting whether the masked text token (MLM task), text feature and region feature match (ITM task) based on the unmasked text token and region feature, wherein the loss functions of the MLM and ITM tasks can be expressed as the following formulas:

L_MLM＝-E_(W,V)∈DlogP_θ(w_m|w_/m，V)

W _m,w_/m represents the masked, unmasked text token, (W, V) e D represents a pair of text W and region image V samples of the dam deformation dataset, respectively;

L_ITM＝-E_(W,V)∈D[y log s_θ(w_[CLS],v_[IMG])+(1-y)log(1-s_θ(w_[CLS],v_[IMG])]

the s _θ scoring function measures the probability between the region image and the text, y e {0,1} indicates whether the text W matches the region image V, and W _[CLS] and V _[IMG] indicate the text W and the region image V, respectively.

Wherein the visual question-answering training task is a multi-classification task, so the loss function of the training task can be expressed as the following formula:

N is the number of labels with higher occurrence frequency in answer labels of the training set, y _v epsilon {0,1} is a marking value for a predicted result, and p _v is the probability that the predicted classification result is the v-th class.

S7: referring to fig. 8, the model trained in step S6 is used to predict according to the test set image and the problem text data, and the dam deformation evaluation result is obtained for the professional to refer to and pre-alarm.

Claims

1. The dam deformation evaluation method based on the image-text multi-mode fusion is characterized by comprising the following steps of:

S4: preprocessing an original image and a dam deformation discrimination text;

S7: predicting according to the test set image and the problem text data by utilizing the model trained in the step S6 to obtain a dam deformation evaluation result;

In the step S2, true color feature enhancement and feature difference are performed on the previous image and the current image, and the current feature image is used as a difference image, and the specific process includes the following steps:

A1: performing true color feature enhancement using a PCA-based color feature enhancement method;

A2: calculating the feature difference between the previous image with enhanced true color features and the current image, wherein the feature matrix of the previous image is src _init, and the feature matrix of the current image is src _final, and then the feature difference d _src is expressed as:

b3: solving a covariance matrix of the vector I (theta);

B5: adding the three channel feature vectors of the images P _init and P _final with the processed feature vectors respectively to obtain a feature enhanced image;

The step S3 specifically comprises the following steps:

d2: f5 characteristic is subjected to convolution operation of 3 multiplied by 3 and step length of 1, and P5 image characteristic is output; f5, up-sampling the features to double the length and width of the feature image, consistent with the shape of the F4 features and fusing with the feature image, and then performing convolution operation with 3 multiplied by 3 and step length of 1 to output the P4 image features; and so on until P2 image features are output;

D3: the output characteristics of the previous image and the differential image processed by the FPN network are marked as F '_θ and F' _θ, theta represents the number of layers, the characteristics of each layer and each layer are fused, the characteristics at the moment are taken as the characteristics of the original image, and the formula is as follows, wherein The feature concat is represented by an addition,

The specific operation process of the step D1 is as follows:

D1-2: the connection between the C2 to C5 phases is divided into two branches-a main branch and a shortcut branch; the main branches adopt convolution operations of 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step length is 1,2 and 1, which is called a residual block, 3, 4, 6 and 3 residual blocks are adopted among each stage, and the channel numbers are 256, 512, 1024 and 2048 respectively, so that the length and width of the characteristic image are doubled; the shortcut branch adopts a convolution operation with the step length of 2 and the 1×1, so that the shape of the feature matrix is the same as that of the main branch.

2. The dam deformation evaluation method based on image-text multi-modal fusion according to claim 1, wherein the preprocessing operation in step S4 is as follows: the RPN module of the Faster R-CNN network is used for selecting the salient image area and extracting the area characteristics, and the average pooling representation is used as the area characteristics for each reserved area through screening.

3. The dam deformation evaluation method based on the image-text multi-mode fusion according to claim 2, wherein the preprocessing operation in the step S4 specifically comprises the following steps:

E2: and projecting the candidate frames generated by the RPN onto the feature map to obtain corresponding feature matrixes, scaling each feature matrix to the feature map with the size of 7 multiplied by 7 through ROIPooling layers, and flattening the feature map through a series of full-connection layers to obtain a salient image area.

4. A dam deformation evaluation method based on image-text multi-modal fusion according to claim 3, wherein the operation of step E1 is specifically as follows:

e1-1: the RPN structure adopts convolution with 3 multiplied by 3 and step length of 1 as a sliding window, the characteristics of the original image of each scale are slid, the central point of each sliding window is calculated to correspond to the central point on the original image, and the mapping formula of the characteristic image and the original image after sliding is as follows:

s_width＝w_origin/w_feature

s_height＝h_origin/h_feature

Wherein w _feature and h _featture are the width and length of the feature image, w _origin and h _origin are the width and length of the original image, and s _width and s _height respectively represent the scaling from the original image to the feature image; the coordinate on the original image can be obtained by multiplying the abscissa of a certain point of the characteristic image by the scaling in the corresponding direction;

E1-2: after calculating that the feature image of each scale corresponds to the center point on the original image, 9 anchor frames of a group of three areas {1282, 2562, 5122}, three aspect ratios { 1:1, 1:2, 2:1 } are generated at the center point position on each original image, and then the calculation formulas of the width and the length of the generated anchor frames are:

t_x＝(x-x_a)/w_a t_y＝(y-y_a)/h_a

t_w＝log(w/w_a)t_h＝log(h/h_a)

5. A dam deformation evaluation method based on image-text multi-modal fusion according to claim 3, wherein the operation of step E2 is specifically as follows:

Wherein k is the number of layers of the feature map used for mapping, k ₀ is the number of scales of the feature map, w and h are the width and height of a single candidate frame, and area _origin is the size of the input picture;

e2-2: and finally, finishing classification and regression of the candidate frames: classifying the specific categories of all the candidate frames through the full connection layer and the Softmax; and carrying out regression prediction on the candidate frames again to obtain a final predicted frame.

6. The dam deformation evaluation method based on the image-text multi-mode fusion according to claim 1, wherein the step S5 specifically comprises:

And G3: region image and text joint characterization: and carrying out information interaction on the images and text features obtained after the processing of G1 and G2 through a transducer layer of 6 groups of common attention mechanisms, namely giving an image I, representing the image I as a group of region features upsilon ₀,...,υ_T and a text input w ₀,...,w_T, and finally outputting the images as h _υ0,...,h_υT and h _w0,...,h_wT.