CN116861361B - Dam deformation evaluation method based on image-text multi-mode fusion - Google Patents

Dam deformation evaluation method based on image-text multi-mode fusion Download PDF

Info

Publication number
CN116861361B
CN116861361B CN202310768316.4A CN202310768316A CN116861361B CN 116861361 B CN116861361 B CN 116861361B CN 202310768316 A CN202310768316 A CN 202310768316A CN 116861361 B CN116861361 B CN 116861361B
Authority
CN
China
Prior art keywords
image
feature
text
images
dam
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310768316.4A
Other languages
Chinese (zh)
Other versions
CN116861361A (en
Inventor
王龙宝
张津豪
储洪强
毛莺池
张雪洁
徐淑芳
徐荟华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202310768316.4A priority Critical patent/CN116861361B/en
Publication of CN116861361A publication Critical patent/CN116861361A/en
Application granted granted Critical
Publication of CN116861361B publication Critical patent/CN116861361B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/08Construction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A10/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE at coastal zones; at river basins
    • Y02A10/40Controlling or monitoring, e.g. of flood or hurricane; Forecasting, e.g. risk assessment or mapping

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Development Economics (AREA)

Abstract

The invention discloses a dam deformation evaluation method based on image-text multi-mode fusion, which comprises the following steps: acquiring a previous image and a current image; obtaining a differential image; carrying out multi-scale feature extraction and fusion on the previous image and the difference image to obtain an original image; preprocessing an original image and a dam deformation discrimination text; inputting the preprocessed image and text characteristics into a double-flow cross-mode transducer model for pre-training, and jointly modeling intra-mode and cross-mode representations to obtain a pre-training model; optimizing and adjusting parameters of the pre-training model; and predicting according to the test set image and the problem text data by using the trained model to obtain a dam deformation evaluation result. According to the invention, knowledge of the dam scene graph is integrated into multi-mode pre-training, so that the understanding capability of a machine on the dam deformation scene is greatly improved, and the model can be aligned with fine granularity characteristics between cross modes of images and texts more accurately, thereby improving the accuracy of answering the dam deformation problem.

Description

Dam deformation evaluation method based on image-text multi-mode fusion
Technical Field
The invention belongs to the field of monitoring and evaluating of deformation of a water conservancy dam, and particularly relates to a dam deformation evaluating method based on image-text multi-mode fusion.
Background
The dam with more than 10 ten thousand seats is built in China so far, and is one of the most countries of reservoir dams in the world. With the further development and utilization of water resources, newly built high-dam vaults are more and more, and the projects play a great role in agricultural irrigation, flood control and drought resistance, water source configuration, hydroelectric power generation, urban water supply, water and soil conservation, ecological environment protection and the like. 6. A part of dams built in seventies are limited to economic conditions, scientific and technical levels and other reasons at the moment, and have the safety problems of low design standards, geology, construction quality, aging and the like, so that the comprehensive benefits of reservoirs are influenced, and even the downstream towns, traffic and lives and properties of people are threatened. Thus, dam safety issues become increasingly prominent public safety issues, which must be highly appreciated.
The main projects of dam safety monitoring are as follows: deformation, seepage, pressure, stress strain, hydraulics, environmental quantities, and the like. The deformation monitoring is the most visual and reliable, and can basically reflect the safety state of the dam under the action of various loads, so that the dam is the most important monitoring item. The deformation monitoring mainly comprises surface deformation, internal deformation, dam foundation deformation, cracks and joints, concrete panel deformation, bank slope displacement and the like. The dam surface deformation monitoring mainly comprises the observation of vertical displacement and the observation of horizontal displacement. The observation of horizontal displacement means measurement of horizontal displacement by observation instruments and equipment on representative points of hydraulic structures and foundations, and the monitoring methods include a sight line method, a tension line method, a laser collimation method, a vertical line method, an intersection method, a wire method and the like.
The traditional engineering monitoring method often needs to consume manpower and material resources, and cannot automatically observe the horizontal displacement. Along with the rapid development of multi-modal feature extraction methods of computers in images, natural languages and the like, the method interacts the domain knowledge with the domain image information of the domain knowledge, and finally realizes the prediction of cross-modal learning. At present, two types of modal data, namely visual images and texts, are taken as research objects, and remarkable progress is achieved in the directions of visual question answering, image-text matching and the like. Therefore, the dam deformation visual question-answer evaluation method based on the image-text multi-mode fusion has important practical significance by taking a dam image set with the same area and long time span and text knowledge for judging the dam deformation as research objects and taking observation of the dam surface horizontal displacement deformation as research purposes.
Disclosure of Invention
The invention aims to: in order to overcome the defects in the prior art, the dam deformation evaluation method based on image-text multi-mode fusion is provided.
The technical scheme is as follows: in order to achieve the above purpose, the invention provides a dam deformation evaluation method based on image-text multi-mode fusion, which comprises the following steps:
S1: acquiring a dam image set through a fixed-point industrial monitoring camera, and respectively acquiring a previous image and a current image;
s2: acquiring a differential image according to the previous image and the current image;
S3: the method comprises the steps of respectively carrying out multi-scale feature extraction and fusion on a previous image and a differential image by utilizing a feature pyramid FPN network, and taking an obtained current feature image as an original image;
S4: preprocessing an original image and a dam deformation discrimination text;
s5: inputting the preprocessed image and text characteristics into a double-flow cross-mode transducer model for pre-training, and jointly modeling intra-mode and cross-mode representations to obtain a pre-training model;
s6: optimizing and adjusting parameters of a pre-training model by utilizing a previous image of the dam, a current image training set and a problem text training set related to the deformation risk of the dam to finish training;
s7: and (3) predicting according to the test set image and the problem text data by using the model trained in the step (S6) to obtain a dam deformation evaluation result.
Further, in the step S2, true color feature enhancement and feature difference are performed on the previous image and the current image, and the current feature image is used as a difference image, and the specific process includes the following steps:
A1: the method is characterized in that a PCA-based color feature enhancement method is adopted to perform true color feature enhancement, and on the premise that the dominant color of things and the color difference contrast of an image are not changed, the brightness of the image is obviously enhanced;
A2: and calculating the feature difference between the previous image and the current image after the true color feature enhancement. The feature matrix of the previous image is src init, the feature matrix of the current image is src final, and then the feature difference d src is expressed as:
Further, the specific process of enhancing the true color features in the step A1 is as follows:
B1: respectively carrying out standardized processing on the previous image P init and the current image P final according to RGB three channels, wherein the mean value is 0, the variance is 1, the relative relation between RGB channels is ensured, and the pixel value distribution inside the three channels is not changed;
b2: images P init and P final are flattened into N×3 vectors according to channels, and are denoted as vectors I (θ), θ ε D;
b3: solving a covariance matrix of the vector I (theta);
B4: performing feature decomposition on the covariance matrix to obtain a feature vector F (theta) and a feature value lambda (theta);
B5: and adding the three channel feature vectors of the images P init and P final with the processed feature vectors respectively to obtain the feature-enhanced image. Taking one channel of the image P init as an example, the formula is as follows, where α is the added dither coefficient.
Presult(θ)=Pinit(θ)+F(θ)i·(ai·λ(θ)i)T,θ,i∈D
Further, the step S3 specifically includes:
D1: the characteristic extraction is carried out on the previous image and the differential image through a backbone network ResNet with the same structure, the final output characteristics of the stages C2, C3, C4 and C5 are subjected to convolution operation with the step length of 1 multiplied by 1, so that the channel number is 256, and the channel number is marked as F2, F3, F4 and F5;
D2: (horizontal operation) the F5 feature is subjected to a convolution operation of 3×3 with a step size of 1, and a P5 image feature is output; the F5 feature is up-sampled (vertical operation from top to bottom), so that the length and width of the feature image are doubled, the feature image is consistent with the shape of the F4 feature and fused with the feature image, and then the convolution operation with 3 multiplied by 3 and the step length of 1 is carried out, and the P4 image feature is output; and so on until P2 image features are output;
D3: the output characteristics of the previous image and the differential image processed by the FPN network are marked as F 'θ and F' θ, theta represents the number of layers, wherein theta=4, the characteristics of each layer are fused, the characteristics at the moment are taken as the characteristics of the original image, and the formula is as follows, wherein The feature concat is represented by an addition,
Further, the specific operation process of the step D1 is as follows:
d1-1: the C1 stage adopts a convolution operation with 7 multiplied by 7 and the step length of 2 and a maximum pooling operation with 3 multiplied by 3 and the step length of 2, and the channel number is 64;
D1-2: the connection between the C2 to C5 stages is divided into two branches, a main branch and a shortcut branch; the main branches adopt convolution operations of 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step length is 1,2 and 1, which is called a residual block, 3, 4, 6 and 3 residual blocks are adopted among each stage, and the channel numbers are 256, 512, 1024 and 2048 respectively, so that the length and width of the characteristic image are doubled; the shortcut branch adopts a convolution operation with the step length of 2 and the 1×1, so that the shape of the feature matrix is the same as that of the main branch.
Further, the preprocessing operation in step S4 is as follows: the RPN module of the Faster R-CNN network is used for selecting the salient image area and extracting the area characteristics, and the average pooling representation is used as the area characteristics for each reserved area through screening.
Further, the preprocessing operation in step S4 specifically includes the following steps:
e1: generating candidate frames for the original image features of each scale through an RPN structure;
e2: and projecting the candidate frames generated by the RPN onto the feature map to obtain corresponding feature matrixes, scaling each feature matrix to the feature map with the size of 7 multiplied by 7 through ROI Pooling layers, and flattening the feature map through a series of full-connection layers to obtain a salient image area.
Further, the operation of the step E1 is specifically as follows:
e1-1: the RPN structure adopts convolution with 3 multiplied by 3 and step length of 1 as a sliding window, the characteristics of the original image of each scale are slid, the central point (each to-be-detected point) of each sliding window is calculated to correspond to the central point on the original image, and the mapping formula of the characteristic image and the original image after sliding is as follows:
swidth=worigin/wfeature
sheight=horigin/hfeature
Wherein w feature and h feature are the width and length of the feature image, w origin and h origin are the width and length of the original image, and s width and s height respectively represent the scaling from the original image to the feature image; the coordinate on the original image can be obtained by multiplying the abscissa of a certain point of the characteristic image by the scaling in the corresponding direction;
E1-2: after calculating that the feature image (each detection point) of each scale corresponds to the center point on the original image, an anchor frame of 9 groups of three areas {1282, 2562, 5122}, three aspect ratios { 1:1, 1:2, 2:1 } is generated at the center point position on each original image, and then the calculation formulas of the width and the length of the generated anchor frame are:
Wherein area is the area of the generated anchor frame, ratio is the aspect ratio of the generated anchor frame, h is the width of the generated anchor frame, and w is the length of the generated anchor frame;
E1-3: the feature images with the channel number of 256 are subjected to 18 convolution operations of 1 multiplied by 1 to obtain feature images with the channel number of 18, and then the feature images are subjected to a Softmax layer classification calculation value; if the value is greater than 0.5, the anchor frame on the original image corresponding to the point on the feature image is foreground active, otherwise background negative, and the formula is as follows, wherein j is the number of samples:
E1-4: the scale feature images with 256 channels are subjected to 36 convolution operations of 1×1 to generate 4 coordinate offsets [ t x,ty,tw,th ] of each anchor frame, and the coordinate offsets are used for correcting the anchor frames, and the offset calculation formula is as follows:
tx=(x-xa)/wa ty=(y-ya)/ha
tw=log(w/wa) th=log(h/ha)
where [ x a,ya,wa,ha ] is the center point coordinate and width and height of the anchor frame, [ t x,ty,tw,th ] is the predicted offset, then the corrected anchor frame coordinates [ x, y, w, h ] are calculated by the following formula:
Wherein [ p x,py,pw,ph ] represents the coordinates of the original anchor frame, [ d x,dy,dw,dh ] represents the coordinate offset predicted by the RPN network, [ g x,gy,gw,gh ] represents the coordinates of the modified anchor frame;
E1-5: and correcting all original anchor frames by using the offset generated by E1-4, arranging the positive anchor frames from large to small according to the classification probability generated by E1-4, taking the first 6000 anchor frames, adopting non-maximum suppression, setting IoU to be 0.7, only leaving 2000 candidate frames for each picture, and finally outputting coordinates corresponding to the upper left corner and the lower right corner of the anchor frames of the original picture, wherein the anchor frames at the moment are called as candidate frames.
Further, the operation of step E2 is specifically as follows:
e2-1: the candidate frames are mapped back to the original images with corresponding scales, the feature images corresponding to each candidate frame are divided into 7X 7 grids, and the maximum pooling operation is carried out on each part of the grids, namely the feature images with the corresponding size of 7X 7 are obtained by projecting the feature images to the original images, and the specific mapping formula of the feature images with the corresponding scales is as follows:
Where k is the number of layers of the feature map used for mapping, k 0 is the number of scales of the feature map (here, 4), w and h are the width and height of a single candidate frame (mapped to the original image), area origin is the input picture size (area of the candidate frame);
E2-2: and finally, finishing classification and regression of the candidate frames: classifying all candidate boxes into specific categories through the full connection layer and Softmax, and the operation is similar to the S413 operation; the regression prediction is performed again on the candidate frame, and a final predicted frame with higher accuracy is obtained, which is consistent with the operation of S414.
Further, the step S5 specifically includes:
G1: input representation of region images: the method comprises the steps of (1) carrying out position coding on the region characteristics reserved after S4 operation processing by adopting a 5-dimensional vector, wherein elements of 5 dimensions are coordinates of an upper left corner and a lower right corner of a normalized region and an image region coverage duty ratio respectively, mapping the position codes to match with visual characteristic dimensions, adding the two to obtain image region characteristics, and finally marking the beginning and the end of an image sequence by using a specific image token, and using the output of the specific image token to represent the whole image;
And G2: input representation of text: judging the text by the dam deformation after S4 pretreatment, and inputting a Bert model to obtain a corresponding text embedding;
And G3: region image and text joint characterization: and (3) carrying out information interaction on the image and text characteristics obtained after the processing of G1 and G2 through a transducer layer of 6 groups of common attention mechanisms, namely giving an image I, representing as a group of region characteristics v 0,...,vT and a text input w 0,...,wT, and finally outputting as h v0,...,hvT and h w0,...,hwT.
The beneficial effects are that: compared with the prior art, the invention takes the dam image set with the same area and long time span and text knowledge of dam deformation discrimination as research objects and takes the observation of the horizontal displacement deformation of the surface of the dam as research purposes, and provides the dam deformation visual question-answering evaluation method based on the image-text multi-mode fusion, which has the following advantages:
1. compared with the existing engineering monitoring method, the method overcomes the defect of manual operation, saves manpower and material resources, and has a better evaluation effect.
2. Through the two feature pyramid networks, under the condition of basically not increasing the calculated amount of the original model, the features of the previous image and the differential image with larger scale difference can be extracted more fully, and the dam deformation detection performance on the differential image is greatly improved.
3. The knowledge of the dam scene graph is integrated into multi-mode pre-training, so that the understanding capability of a machine to a dam deformation scene is greatly improved, and the model can be aligned with fine granularity characteristics between cross modes of images and texts more accurately, so that the accuracy of answering the dam deformation problem is improved.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a schematic illustration of a process according to the present invention.
FIG. 3 is a schematic representation of feature pyramid multi-scale feature extraction in accordance with the method of the present invention.
Fig. 4 is a schematic diagram of the original image feature fusion of the method of the present invention.
FIG. 5 is a schematic diagram of an original image of a candidate frame mapped to a corresponding scale according to the method of the present invention.
FIG. 6 is an illustration of a resolution scenario of the method of the present invention.
FIG. 7 is a schematic diagram of a multi-modal pretraining scheme of the method of the present invention.
FIG. 8 is a schematic diagram of a dam deformation evaluation visual question-answering model of the method of the present invention.
Detailed Description
The present application is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the application and not limiting of its scope, and various modifications of the application, which are equivalent to those skilled in the art upon reading the application, will fall within the scope of the application as defined in the appended claims.
The invention provides a dam deformation evaluation method based on image-text multi-mode fusion, which is shown in fig. 1 and 2 and comprises the following steps:
S1: acquiring a dam image set of the same area with a time interval of 3 years by using a fixed-point industrial monitoring camera, determining an image which is far away from the current time as a previous image, and determining an image which is far away from the current time as a current image;
s2: according to the acquired two remote sensing images, namely the previous image and the current image, the real color characteristic enhancement is firstly carried out on the two remote sensing images, then the characteristic difference of the two remote sensing images is taken, and the current characteristic image is called as a difference image;
the specific process comprises the following steps:
A1: the method is characterized in that a PCA-based color feature enhancement method is adopted to perform true color feature enhancement, and on the premise that the dominant color of things and the color difference contrast of an image are not changed, the brightness of the image is obviously enhanced;
A2: and calculating the feature difference between the previous image and the current image after the true color feature enhancement. The feature matrix of the previous image is src init, the feature matrix of the current image is src final, and then the feature difference d src is expressed as:
the specific process of enhancing the true color features in the step A1 is as follows:
B1: respectively carrying out standardized processing on the previous image P init and the current image P final according to RGB three channels, wherein the mean value is 0, the variance is 1, the relative relation between RGB channels is ensured, and the pixel value distribution inside the three channels is not changed;
b2: images P init and P final are flattened into N×3 vectors according to channels, and are denoted as vectors I (θ), θ ε D;
b3: solving a covariance matrix of the vector I (theta);
B4: performing feature decomposition on the covariance matrix to obtain a feature vector F (theta) and a feature value lambda (theta);
B5: and adding the three channel feature vectors of the images P init and P final with the processed feature vectors respectively to obtain the feature-enhanced image. Taking one channel of the image P init as an example, the formula is as follows, where α is the added dither coefficient.
Presult(θ)=Pinit(θ)+F(θ)i·(ai·λ(θ)i)T,θ,i∈D
S3: the method comprises the steps of respectively carrying out multi-scale feature extraction and fusion on a previous image and a differential image by utilizing a feature pyramid FPN network, and taking an obtained current feature image as an original image;
Referring to fig. 3 and 4, the specific steps of multi-scale feature extraction and feature fusion are:
D1: the characteristic extraction is carried out on the previous image and the differential image through a backbone network ResNet with the same structure, the final output characteristics of the stages C2, C3, C4 and C5 are subjected to convolution operation with the step length of 1 multiplied by 1, so that the channel number is 256, and the channel number is marked as F2, F3, F4 and F5;
The following 5 stages of feature extraction operations are performed for backbone network ResNet:
d1-1: the C1 stage adopts a convolution operation with 7 multiplied by 7 and the step length of 2 and a maximum pooling operation with 3 multiplied by 3 and the step length of 2, and the channel number is 64;
D1-2: the connection between the C2 to C5 stages is divided into two branches, a main branch and a shortcut branch; the main branches adopt convolution operations of 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step length is 1,2 and 1, which is called a residual block, 3,4, 6 and 3 residual blocks are adopted among each stage, and the channel numbers are 256, 512, 1024 and 2048 respectively, so that the length and width of the characteristic image are doubled; the branch of the shortcut adopts convolution operation with the step length of 2 and the length of 1 multiplied by 1, so that the shape of the feature matrix is the same as that of the main branch;
Wherein, a residual structure can be expressed as follows:
xl+1=xl+F(xl,Wx)
F (x l,Wl) is the main branch output of the first unit, and x l is the shortcut branch output of the first unit;
D2: (horizontal operation) the F5 feature is subjected to a convolution operation of 3×3 with a step size of 1, and a P5 image feature is output; the F5 feature is up-sampled (vertical operation from top to bottom), so that the length and width of the feature image are doubled, the feature image is consistent with the shape of the F4 feature and fused with the feature image, and then the convolution operation with 3 multiplied by 3 and the step length of 1 is carried out, and the P4 image feature is output; and so on until P2 image features are output;
the specific operation of upsampling of each layer of features is as follows:
For up-sampling of the features, a nearest neighbor interpolation algorithm is adopted, d st x and d st y are marked as the abscissa and the ordinate of a certain pixel of the up-sampled target image, d st width and d st height are the width and the height of the target image, s rc width and s rc height are the width and the height of the original image, and s rc x and s rc y are the coordinates of the original image of the target image corresponding to the point (d stx,dst y), and the formula is as follows:
srcx=dstx*(srcwidth/dstwidth)
srcy=dsty*(srcheight/dstheight)
D3: the output characteristics of the previous image and the differential image processed by the FPN network are marked as F 'θ and F' θ, theta represents the number of layers, wherein theta=4, the characteristics of each layer are fused, the characteristics at the moment are taken as the characteristics of the original image, and the formula is as follows, wherein The feature concat is represented by an addition,
S4: preprocessing the dam deformation discrimination text of the original image subjected to expert demonstration examination:
Preprocessing an original image, namely selecting a significant image area and extracting area characteristics by adopting an RPN module of a Faster R-CNN network, screening each reserved area, and using an average pooling representation as the area characteristics;
Preprocessing the dam deformation discrimination text subjected to expert demonstration examination, and referring to FIG. 6, the method is characterized in that a scene graph is resolved from sentences through a scene graph resolver, the discrimination text is marked in WordPieces mode, and then 15% of word segmentation and 30% of scene graph nodes are randomly covered;
The pretreatment operation specifically comprises the following steps:
e1: generating candidate frames for the original image features of each scale through an RPN structure:
e1-1: the RPN structure adopts convolution with 3 multiplied by 3 and step length of 1 as a sliding window, the characteristics of the original image of each scale are slid, the central point (each to-be-detected point) of each sliding window is calculated to correspond to the central point on the original image, and the mapping formula of the characteristic image and the original image after sliding is as follows:
swidth=worigin/wfeature
sheight=horigin/hfeature
Wherein w feature and h feature are the width and length of the feature image, w origin and h origin are the width and length of the original image, and s width and s height respectively represent the scaling from the original image to the feature image; the coordinate on the original image can be obtained by multiplying the abscissa of a certain point of the characteristic image by the scaling in the corresponding direction;
E1-2: after calculating that the feature image (each detection point) of each scale corresponds to the center point on the original image, an anchor frame of 9 groups of three areas {1282, 2562, 5122}, three aspect ratios { 1:1, 1:2, 2:1 } is generated at the center point position on each original image, and then the calculation formulas of the width and the length of the generated anchor frame are:
Wherein area is the area of the generated anchor frame, ratio is the aspect ratio of the generated anchor frame, h is the width of the generated anchor frame, and w is the length of the generated anchor frame;
E1-3: the feature images with the channel number of 256 are subjected to 18 convolution operations of 1 multiplied by 1 to obtain feature images with the channel number of 18, and then the feature images are subjected to a Softmax layer classification calculation value; if the value is greater than 0.5, the anchor frame on the original image corresponding to the point on the feature image is foreground active, otherwise background negative, and the formula is as follows, wherein j is the number of samples:
E1-4: the scale feature images with 256 channels are subjected to 36 convolution operations of 1×1 to generate 4 coordinate offsets [ t x,ty,tw,th ] of each anchor frame, and the coordinate offsets are used for correcting the anchor frames, and the offset calculation formula is as follows:
tx=(x-xa)/wa ty=(y-ya)/ha
tw=log(w/wa) th=log(h/ha)
where [ x a,ya,wa,ha ] is the center point coordinate and width and height of the anchor frame, [ t x,ty,tw,th ] is the predicted offset, then the corrected anchor frame coordinates [ x, y, w, h ] are calculated by the following formula:
Wherein [ p x,py,pw,ph ] represents the coordinates of the original anchor frame, [ d x,dy,dw,dh ] represents the coordinate offset predicted by the RPN network, [ g x,gy,gw,gh ] represents the coordinates of the modified anchor frame;
E1-5: and correcting all original anchor frames by using the offset generated by E1-4, arranging the positive anchor frames from large to small according to the classification probability generated by E1-4, taking the first 6000 anchor frames, adopting non-maximum suppression, setting IoU to be 0.7, only leaving 2000 candidate frames for each picture, and finally outputting coordinates corresponding to the upper left corner and the lower right corner of the anchor frames of the original picture, wherein the anchor frames at the moment are called as candidate frames.
E2: projecting the candidate frames generated by the RPN onto the feature map to obtain corresponding feature matrixes, scaling each feature matrix to the feature map with the size of 7 multiplied by 7 through ROI Pooling layers, flattening the feature map through a series of full-connection layers to obtain a salient image area:
E2-1: referring to fig. 5, the candidate frames are mapped back to the original image of the corresponding scale, the feature map corresponding to each candidate frame is divided into 7×7 grids, and the maximum pooling operation is performed on each part of the grids, that is, the feature map with the corresponding size of 7×7 is obtained by projecting the feature map onto the original image, and the formula is as follows:
Where k is the number of layers of the feature map used for mapping, k 0 is the number of scales of the feature map (here, 4), w and h are the width and height of a single candidate frame (mapped to the original image), area origin is the input picture size (area of the candidate frame);
E2-2: and finally, finishing classification and regression of the candidate frames: classifying all candidate boxes into specific categories through the full connection layer and Softmax, and the operation is similar to the S413 operation; the regression prediction is performed again on the candidate frame, and a final predicted frame with higher accuracy is obtained, which is consistent with the operation of S414.
S5: inputting the preprocessed image and text characteristics into a double-flow cross-mode transducer model for pre-training, and jointly modeling intra-mode and cross-mode representations to obtain a pre-training model;
referring to fig. 7, the specific operation steps are as follows:
G1: input representation of region images: the method comprises the steps of (1) carrying out position coding on the region characteristics reserved after S4 operation processing by adopting a 5-dimensional vector, wherein elements of 5 dimensions are coordinates of an upper left corner and a lower right corner of a normalized region and an image region coverage duty ratio respectively, mapping the position codes to match with visual characteristic dimensions, adding the two to obtain image region characteristics, and finally marking the beginning and the end of an image sequence by using a specific image token, and using the output of the specific image token to represent the whole image;
wherein the encoding of the region feature locations with a 5-dimensional vector operates specifically as:
W, H denote the length and width of the region feature, respectively, the upper left corner of the image region is [ x 1,y1 ], and the lower right corner is [ x 2,y2 ], then the region is position-coded and expressed as a 5-dimensional vector v= [ x, y, w, h, s ].
And G2: input representation of text: judging the text by the dam deformation after S4 pretreatment, and inputting a Bert model to obtain a corresponding text embedding;
And G3: region image and text joint characterization: the image and text characteristics obtained after the processing of G1 and G2 are subjected to information interaction through a transducer layer of 6 groups of common attention mechanisms, namely, an image I is given, the image I is expressed as a group of region characteristics v 0,...,vT and a text input w 0,...,wT, and the output is finally expressed as h v0,...,hvT and h w0,...,hwT;
the transducer layer of the 6 groups of common attention mechanisms is consistent with the encoder structure of the transducer, but the source Q, K, V after linear transformation is different, and the common attention mechanisms can be expressed as the following formula:
MultiHead(Q,K,V)=Concat(head1,......,headh)WO
for image streams, Q is derived from region feature v 0,...,vT, K, V is derived from text input w 0,...,wT; for the text stream, Q originates from text input w 0,...,wT, K, V originates from region feature v 0,...,vT;
wherein the two tasks of the pre-training are respectively predicting whether the masked text token (MLM task), text feature and region feature match (ITM task) based on the unmasked text token and region feature, wherein the loss functions of the MLM and ITM tasks can be expressed as the following formulas:
LMLM=-E(W,V)∈DlogPθ(wm|w/m,V)
W m,w/m represents the masked, unmasked text token, (W, V) e D represents a pair of text W and region image V samples of the dam deformation dataset, respectively;
LITM=-E(W,V)∈D[y log sθ(w[CLS],v[IMG])+(1-y)log(1-sθ(w[CLS],v[IMG])]
the s θ scoring function measures the probability between the region image and the text, y e {0,1} indicates whether the text W matches the region image V, and W [CLS] and V [IMG] indicate the text W and the region image V, respectively.
S6: optimizing and adjusting parameters of a pre-training model by utilizing a previous image of the dam, a current image training set and a problem text training set related to the deformation risk of the dam to finish training;
Wherein the visual question-answering training task is a multi-classification task, so the loss function of the training task can be expressed as the following formula:
N is the number of labels with higher occurrence frequency in answer labels of the training set, y v epsilon {0,1} is a marking value for a predicted result, and p v is the probability that the predicted classification result is the v-th class.
S7: referring to fig. 8, the model trained in step S6 is used to predict according to the test set image and the problem text data, and the dam deformation evaluation result is obtained for the professional to refer to and pre-alarm.

Claims (6)

1. The dam deformation evaluation method based on the image-text multi-mode fusion is characterized by comprising the following steps of:
S1: acquiring a dam image set through a fixed-point industrial monitoring camera, and respectively acquiring a previous image and a current image;
s2: acquiring a differential image according to the previous image and the current image;
S3: the method comprises the steps of respectively carrying out multi-scale feature extraction and fusion on a previous image and a differential image by utilizing a feature pyramid FPN network, and taking an obtained current feature image as an original image;
S4: preprocessing an original image and a dam deformation discrimination text;
s5: inputting the preprocessed image and text characteristics into a double-flow cross-mode transducer model for pre-training, and jointly modeling intra-mode and cross-mode representations to obtain a pre-training model;
s6: optimizing and adjusting parameters of a pre-training model by utilizing a previous image of the dam, a current image training set and a problem text training set related to the deformation risk of the dam to finish training;
S7: predicting according to the test set image and the problem text data by utilizing the model trained in the step S6 to obtain a dam deformation evaluation result;
In the step S2, true color feature enhancement and feature difference are performed on the previous image and the current image, and the current feature image is used as a difference image, and the specific process includes the following steps:
A1: performing true color feature enhancement using a PCA-based color feature enhancement method;
A2: calculating the feature difference between the previous image with enhanced true color features and the current image, wherein the feature matrix of the previous image is src init, and the feature matrix of the current image is src final, and then the feature difference d src is expressed as:
The specific process of enhancing the true color features in the step A1 is as follows:
B1: respectively carrying out standardized processing on the previous image P init and the current image P final according to RGB three channels, wherein the mean value is 0, the variance is 1, the relative relation between RGB channels is ensured, and the pixel value distribution inside the three channels is not changed;
b2: images P init and P final are flattened into N×3 vectors according to channels, and are denoted as vectors I (θ), θ ε D;
b3: solving a covariance matrix of the vector I (theta);
B4: performing feature decomposition on the covariance matrix to obtain a feature vector F (theta) and a feature value lambda (theta);
B5: adding the three channel feature vectors of the images P init and P final with the processed feature vectors respectively to obtain a feature enhanced image;
The step S3 specifically comprises the following steps:
D1: the characteristic extraction is carried out on the previous image and the differential image through a backbone network ResNet with the same structure, the final output characteristics of the stages C2, C3, C4 and C5 are subjected to convolution operation with the step length of 1 multiplied by 1, so that the channel number is 256, and the channel number is marked as F2, F3, F4 and F5;
d2: f5 characteristic is subjected to convolution operation of 3 multiplied by 3 and step length of 1, and P5 image characteristic is output; f5, up-sampling the features to double the length and width of the feature image, consistent with the shape of the F4 features and fusing with the feature image, and then performing convolution operation with 3 multiplied by 3 and step length of 1 to output the P4 image features; and so on until P2 image features are output;
D3: the output characteristics of the previous image and the differential image processed by the FPN network are marked as F 'θ and F' θ, theta represents the number of layers, the characteristics of each layer and each layer are fused, the characteristics at the moment are taken as the characteristics of the original image, and the formula is as follows, wherein The feature concat is represented by an addition,
The specific operation process of the step D1 is as follows:
d1-1: the C1 stage adopts a convolution operation with 7 multiplied by 7 and the step length of 2 and a maximum pooling operation with 3 multiplied by 3 and the step length of 2, and the channel number is 64;
D1-2: the connection between the C2 to C5 phases is divided into two branches-a main branch and a shortcut branch; the main branches adopt convolution operations of 1 multiplied by 1, 3 multiplied by 3 and 1 multiplied by 1, the step length is 1,2 and 1, which is called a residual block, 3, 4, 6 and 3 residual blocks are adopted among each stage, and the channel numbers are 256, 512, 1024 and 2048 respectively, so that the length and width of the characteristic image are doubled; the shortcut branch adopts a convolution operation with the step length of 2 and the 1×1, so that the shape of the feature matrix is the same as that of the main branch.
2. The dam deformation evaluation method based on image-text multi-modal fusion according to claim 1, wherein the preprocessing operation in step S4 is as follows: the RPN module of the Faster R-CNN network is used for selecting the salient image area and extracting the area characteristics, and the average pooling representation is used as the area characteristics for each reserved area through screening.
3. The dam deformation evaluation method based on the image-text multi-mode fusion according to claim 2, wherein the preprocessing operation in the step S4 specifically comprises the following steps:
e1: generating candidate frames for the original image features of each scale through an RPN structure;
E2: and projecting the candidate frames generated by the RPN onto the feature map to obtain corresponding feature matrixes, scaling each feature matrix to the feature map with the size of 7 multiplied by 7 through ROIPooling layers, and flattening the feature map through a series of full-connection layers to obtain a salient image area.
4. A dam deformation evaluation method based on image-text multi-modal fusion according to claim 3, wherein the operation of step E1 is specifically as follows:
e1-1: the RPN structure adopts convolution with 3 multiplied by 3 and step length of 1 as a sliding window, the characteristics of the original image of each scale are slid, the central point of each sliding window is calculated to correspond to the central point on the original image, and the mapping formula of the characteristic image and the original image after sliding is as follows:
swidth=worigin/wfeature
sheight=horigin/hfeature
Wherein w feature and h featture are the width and length of the feature image, w origin and h origin are the width and length of the original image, and s width and s height respectively represent the scaling from the original image to the feature image; the coordinate on the original image can be obtained by multiplying the abscissa of a certain point of the characteristic image by the scaling in the corresponding direction;
E1-2: after calculating that the feature image of each scale corresponds to the center point on the original image, 9 anchor frames of a group of three areas {1282, 2562, 5122}, three aspect ratios { 1:1, 1:2, 2:1 } are generated at the center point position on each original image, and then the calculation formulas of the width and the length of the generated anchor frames are:
Wherein area is the area of the generated anchor frame, ratio is the aspect ratio of the generated anchor frame, h is the width of the generated anchor frame, and w is the length of the generated anchor frame;
E1-3: the feature images with the channel number of 256 are subjected to 18 convolution operations of 1 multiplied by 1 to obtain feature images with the channel number of 18, and then the feature images are subjected to a Softmax layer classification calculation value; if the value is greater than 0.5, the anchor frame on the original image corresponding to the point on the feature image is foreground active, otherwise background negative, and the formula is as follows, wherein j is the number of samples:
E1-4: the scale feature images with 256 channels are subjected to 36 convolution operations of 1×1 to generate 4 coordinate offsets [ t x,ty,tw,th ] of each anchor frame, and the coordinate offsets are used for correcting the anchor frames, and the offset calculation formula is as follows:
tx=(x-xa)/wa ty=(y-ya)/ha
tw=log(w/wa)th=log(h/ha)
where [ x a,ya,wa,ha ] is the center point coordinate and width and height of the anchor frame, [ t x,ty,tw,th ] is the predicted offset, then the corrected anchor frame coordinates [ x, y, w, h ] are calculated by the following formula:
Wherein [ p x,py,pw,ph ] represents the coordinates of the original anchor frame, [ d x,dy,dw,dh ] represents the coordinate offset predicted by the RPN network, [ g x,gy,gw,gh ] represents the coordinates of the modified anchor frame;
E1-5: and correcting all original anchor frames by using the offset generated by E1-4, arranging the positive anchor frames from large to small according to the classification probability generated by E1-4, taking the first 6000 anchor frames, adopting non-maximum suppression, setting IoU to be 0.7, only leaving 2000 candidate frames for each picture, and finally outputting coordinates corresponding to the upper left corner and the lower right corner of the anchor frames of the original picture, wherein the anchor frames at the moment are called as candidate frames.
5. A dam deformation evaluation method based on image-text multi-modal fusion according to claim 3, wherein the operation of step E2 is specifically as follows:
e2-1: the candidate frames are mapped back to the original images with corresponding scales, the feature images corresponding to each candidate frame are divided into 7X 7 grids, and the maximum pooling operation is carried out on each part of the grids, namely the feature images with the corresponding size of 7X 7 are obtained by projecting the feature images to the original images, and the specific mapping formula of the feature images with the corresponding scales is as follows:
Wherein k is the number of layers of the feature map used for mapping, k 0 is the number of scales of the feature map, w and h are the width and height of a single candidate frame, and area origin is the size of the input picture;
e2-2: and finally, finishing classification and regression of the candidate frames: classifying the specific categories of all the candidate frames through the full connection layer and the Softmax; and carrying out regression prediction on the candidate frames again to obtain a final predicted frame.
6. The dam deformation evaluation method based on the image-text multi-mode fusion according to claim 1, wherein the step S5 specifically comprises:
G1: input representation of region images: the method comprises the steps of (1) carrying out position coding on the region characteristics reserved after S4 operation processing by adopting a 5-dimensional vector, wherein elements of 5 dimensions are coordinates of an upper left corner and a lower right corner of a normalized region and an image region coverage duty ratio respectively, mapping the position codes to match with visual characteristic dimensions, adding the two to obtain image region characteristics, and finally marking the beginning and the end of an image sequence by using a specific image token, and using the output of the specific image token to represent the whole image;
And G2: input representation of text: judging the text by the dam deformation after S4 pretreatment, and inputting a Bert model to obtain a corresponding text embedding;
And G3: region image and text joint characterization: and carrying out information interaction on the images and text features obtained after the processing of G1 and G2 through a transducer layer of 6 groups of common attention mechanisms, namely giving an image I, representing the image I as a group of region features upsilon 0,...,υT and a text input w 0,...,wT, and finally outputting the images as h υ0,...,hυT and h w0,...,hwT.
CN202310768316.4A 2023-06-27 2023-06-27 Dam deformation evaluation method based on image-text multi-mode fusion Active CN116861361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310768316.4A CN116861361B (en) 2023-06-27 2023-06-27 Dam deformation evaluation method based on image-text multi-mode fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310768316.4A CN116861361B (en) 2023-06-27 2023-06-27 Dam deformation evaluation method based on image-text multi-mode fusion

Publications (2)

Publication Number Publication Date
CN116861361A CN116861361A (en) 2023-10-10
CN116861361B true CN116861361B (en) 2024-05-03

Family

ID=88231403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310768316.4A Active CN116861361B (en) 2023-06-27 2023-06-27 Dam deformation evaluation method based on image-text multi-mode fusion

Country Status (1)

Country Link
CN (1) CN116861361B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117095363B (en) * 2023-10-20 2024-01-26 安能三局(成都)工程质量检测有限公司 Dam safety monitoring method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027547A (en) * 2019-12-06 2020-04-17 南京大学 Automatic detection method for multi-scale polymorphic target in two-dimensional image
CN113220919A (en) * 2021-05-17 2021-08-06 河海大学 Dam defect image text cross-modal retrieval method and model
CN113658176A (en) * 2021-09-07 2021-11-16 重庆科技学院 Ceramic tile surface defect detection method based on interactive attention and convolutional neural network
WO2022027986A1 (en) * 2020-08-04 2022-02-10 杰创智能科技股份有限公司 Cross-modal person re-identification method and device
CN115331075A (en) * 2022-08-11 2022-11-11 杭州电子科技大学 Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN115761757A (en) * 2022-11-04 2023-03-07 福州大学 Multi-mode text page classification method based on decoupling feature guidance

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027547A (en) * 2019-12-06 2020-04-17 南京大学 Automatic detection method for multi-scale polymorphic target in two-dimensional image
WO2022027986A1 (en) * 2020-08-04 2022-02-10 杰创智能科技股份有限公司 Cross-modal person re-identification method and device
CN113220919A (en) * 2021-05-17 2021-08-06 河海大学 Dam defect image text cross-modal retrieval method and model
CN113658176A (en) * 2021-09-07 2021-11-16 重庆科技学院 Ceramic tile surface defect detection method based on interactive attention and convolutional neural network
CN115331075A (en) * 2022-08-11 2022-11-11 杭州电子科技大学 Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN115761757A (en) * 2022-11-04 2023-03-07 福州大学 Multi-mode text page classification method based on decoupling feature guidance

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
深度学习在医学影像中的应用综述;施俊;汪琳琳;王珊珊;陈艳霞;王乾;魏冬铭;梁淑君;彭佳林;易佳锦;刘盛锋;倪东;王明亮;张道强;沈定刚;;中国图象图形学报(10);全文 *

Also Published As

Publication number Publication date
CN116861361A (en) 2023-10-10

Similar Documents

Publication Publication Date Title
CN110136170B (en) Remote sensing image building change detection method based on convolutional neural network
CN111898507A (en) Deep learning method for predicting earth surface coverage category of label-free remote sensing image
CN109711413A (en) Image, semantic dividing method based on deep learning
CN109934154B (en) Remote sensing image change detection method and detection device
CN116861361B (en) Dam deformation evaluation method based on image-text multi-mode fusion
CN112149547A (en) Remote sensing image water body identification based on image pyramid guidance and pixel pair matching
CN111899225A (en) Nuclear power pipeline defect detection method based on multi-scale pyramid structure
CN111738113A (en) Road extraction method of high-resolution remote sensing image based on double-attention machine system and semantic constraint
CN109829507B (en) Aerial high-voltage transmission line environment detection method
CN117011295B (en) UHPC prefabricated member quality detection method based on depth separable convolutional neural network
CN113420619A (en) Remote sensing image building extraction method
CN116091908A (en) Multi-scale feature enhancement and training method and device for underwater sonar small target detection
CN115222754A (en) Mirror image segmentation method based on knowledge distillation and antagonistic learning
CN113888399B (en) Face age synthesis method based on style fusion and domain selection structure
CN112967227B (en) Automatic diabetic retinopathy evaluation system based on focus perception modeling
CN114818826A (en) Fault diagnosis method based on lightweight Vision Transformer module
CN107358635B (en) Color morphological image processing method based on fuzzy similarity
CN115497006B (en) Urban remote sensing image change depth monitoring method and system based on dynamic mixing strategy
CN114021422B (en) Underground structure internal defect identification method based on cross-hole radar and deep learning
CN116958800A (en) Remote sensing image change detection method based on hierarchical attention residual unet++
CN114596433A (en) Insulator identification method
Yang et al. Resnet-Unet considering Patches (RUP) network to solve the problem of patches due to shadows in extracting building top information
CN110443248A (en) Substantially remote sensing image semantic segmentation block effect removing method and system
Zhang et al. MMSTP: Multi-modal Spatiotemporal Feature Fusion Network for Precipitation Prediction
CN117291902B (en) Detection method for pixel-level concrete cracks based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant