CN113239886A

CN113239886A - Method and device for describing underground pipeline leakage based on cross-language image change description

Info

Publication number: CN113239886A
Application number: CN202110626949.2A
Authority: CN
Inventors: 胡迪; 刘玉洁; 罗辉; 段章领; 卫星; 赵冲; 赵明; 陆阳; 李航; 帅竞贤
Original assignee: Hefei University of Technology; Intelligent Manufacturing Institute of Hefei University Technology
Current assignee: Hefei University of Technology; Intelligent Manufacturing Institute of Hefei University Technology
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-08-10
Anticipated expiration: 2041-06-04
Also published as: CN113239886B

Abstract

The invention discloses a cross-language image change description-based underground pipeline leakage description method and a cross-language image change description-based underground pipeline leakage description device, wherein the method comprises the following steps: acquiring an underground pipeline scene image, and preprocessing the image to obtain a training set and a test set; constructing a cross-language image change description model based on a dual dynamic attention mechanism; training a cross-language image change description model based on a dual dynamic attention mechanism on a training set; testing the test set by using a trained cross-language image change description model based on a dual dynamic attention mechanism to obtain an image description result; the invention has the advantages that: the downhole pipeline leakage description is accurate.

Description

Method and device for describing underground pipeline leakage based on cross-language image change description

Technical Field

The invention relates to the field of underground pipeline change description, in particular to an underground pipeline leakage description method and device based on cross-language image change description.

Background

A mine is a generic term for roadways, chambers, equipment, ground buildings and structures that form the production system of an underground coal mine. Most of the important coal mines in China are large and medium-sized mines; most of local national and camp coal mines are medium and small mines. With the rapid development of national economy, the demand of China for energy is increasing. Coal is used as a security pillar energy source in China for a long time in the future, the demand quantity is increased year by year, and the large-scale coal industry is trended. The coal industry is used as a basic energy industry, and the investment scale is increased along with the existing trend of coal mine group enlargement after decades of development, and the average 300-kiloton/a well type investment is about 6-7 million yuan; in the past, extensive exploitation aiming at profit is gradually replaced by large-scale mechanized production, so that various underground safety protection systems are derived immediately, wherein the detection of underground pipeline leakage is also one of the prevention measures.

Pipeline transportation is the fifth largest transportation means following railway, highway, aviation and water transportation. Has unique advantages in the transportation of fluids such as oil, natural gas and the like. However, with the increase of the age of the pipe, the leakage condition of the pipe frequently occurs due to the existence of construction defects, corrosion and artificial damage, and great threat is brought to the lives and properties and living environment of people. The pipeline leakage condition of the underground scene is hidden, the pipeline leakage condition is not easy to find and process in time, a large amount of time and energy of maintenance and inspection personnel are consumed, and the effect is very little.

The method for detecting the leakage of the fluid conveying pipeline is multiple, and the classification is also multiple, and according to related data at home and abroad in the last ten years, the relatively accepted classification method mainly comprises the following steps: hardware and software based methods, classification according to the measurement medium, classification according to the location of the detection device, classification according to the detection object, classification based on signal processing, etc.

The deep learning network model is provided, so that the computer vision field is further developed. The deep learning model is self-adaptive learning from the image and is an end-to-end detection method. With the advent of the big data era, various data sets for training deep learning network models are continuously enriched and perfected, and the development of the deep learning-based computer vision field is also promoted. Wherein change captions have been a great deal of development as a cross-domain of computer vision and natural language processing. The main task in the field is to mark images, the marked images are processed into a group of two images, the two images are compared in a time sequence, descriptive characters which accord with the image content are generated, and the main targets in the images can be identified and the change relation between the targets is also considered. The underground mine pipeline scene is described through the change scenario model, so that patrol personnel can be assisted to monitor the underground pipeline state in real time and play a role in early warning in time.

Chinese patent No. CN107013812B discloses a method for monitoring leakage of a three-field coupling pipeline, comprising the following steps: the method comprises the steps of constructing a pipeline three-field coupling sensing system, monitoring the no-load state simulation of a detected pipeline by the pipeline three-field coupling sensing system, monitoring the normal working condition simulation of the detected pipeline by the pipeline three-field coupling sensing system, simulating the leakage event of the pipeline three-field coupling sensing system, modeling and learning a pipeline monitoring neural network, and monitoring the pipeline leakage. The invention aims to provide a method for monitoring alarm, positioning and judging leakage size of pipeline leakage alarm, which can effectively reduce false alarm, avoid missing alarm, accurately position leakage point, provide leakage size through a neural network algorithm and provide reliable basis for establishing a maintenance scheme by acquiring three parameters around a pipeline and establishing mutual relation between detection parameters. However, the data are acquired by adopting a sensor detection mode to describe the pipeline, the data of the sensor in the underground pipeline are unstable, and the pipeline leakage description is inaccurate if the sensor fails or is damaged.

Disclosure of Invention

The invention aims to solve the technical problem that the underground pipeline leakage description method in the prior art is not accurate enough.

The invention solves the technical problems through the following technical means: a downhole tubular leak description method based on cross-language image change description, the method comprising:

step a: acquiring an underground pipeline scene image, and preprocessing the image to obtain a training set and a test set;

step b: constructing a cross-language image change description model based on a dual dynamic attention mechanism;

step c: training a cross-language image change description model based on a dual dynamic attention mechanism on a training set;

step d: and testing the test set by using a trained cross-language image change description model based on a dual dynamic attention mechanism to obtain an image description result.

The method collects the scene images of the underground pipeline, avoids using a sensor for detection, ensures the accuracy of the collected data, constructs a cross-language image change description model based on a dual dynamic attention mechanism, trains the model, and finally utilizes the trained model to describe the leakage state of the pipeline, thereby ensuring the accuracy of the description of the leakage of the underground pipeline.

Further, the step a comprises:

step a 1: installing a camera at the front end of the underground pipeline to acquire daily state video stream data of the underground pipeline;

step a 2: extracting key frames in video stream data according to a preset time interval and storing the key frames as an underground pipeline scene image;

step a 3: cutting all the underground pipeline scene images to 512 x 512 to obtain an image data set; dividing the image data set into a plurality of groups according to two images as a group, wherein one image in each group is a pipeline non-leakage state image of a previous frame, and the other image is a changed image with leakage or an image without leakage change but with other factor change of a next frame; marking the image by using a COCO official pycocools package to obtain a marked mark data set; the labeled data set is classified into 3: 1 is divided into training set and test set.

Further, the step b comprises:

the cross-language image change description model based on the dual dynamic attention mechanism comprises an encoder, an RNN (radio network) embedded with a spatial attention mechanism, a dynamic attention module and a labeling module based on a dynamic speaking mechanism, wherein the dynamic attention module and the labeling module are recursive models based on LSTM (least squares), a training set or a test set is input into the encoder, the encoder is connected with the RNN embedded with the spatial attention mechanism, the RNN outputs a spatial attention result, namely an image position needing attention, the RNN embedded with the spatial attention mechanism is connected with the dynamic attention module, the dynamic attention module is connected with the labeling module, the labeling module outputs a current word, the current word is distributed, and the current word contains the time of paying attention to the image, namely when each image begins to be paid attention.

Still further, the step b further comprises:

extraction of input image set features (X) using 1 ResNet-101 network as encoder_bef，X_aft)；

Input image set features (X)_bef，X_aft) Inputting the image into an RNN network embedded with a double attention mechanism, and performing image feature (X) on the coded input image group_bef，X_aft) By the formula X_aft-X_befDifference is made to obtain difference characteristic X_diff(ii) a The obtained difference characteristic X_diffRespectively with input image group characteristics (X)_bef，X_aft) Connecting to obtain two different space attention image groups A_befAnd A_aft；

The LSTM decoder in the dynamic attention module will tag the previous hidden state of the module

And l_bef、l_diff、l_aftAs an input, predicting attention weights

Attention is paid to the weight

Cumulatively summing visual features to obtain dynamic engagement features

Dynamic engagement feature

And the previous word x^t-1Inputting the word into LSTM decoder of labeling module to generate current word distribution for distributing current word.

Further, the ResNet-101 network includes 1 conv1 convolutional layer, 3 conv2_ x convolutional layers, 4 conv3_ x convolutional layers, 23 conv4_ x convolutional layers, 3 conv5_ x convolutional layers and 1 fully connected layer connected in sequence, conv1 convolutional layers are 7 × 7 convolutional layers with a step size of 2, conv2_ x convolutional layers are composed of one convolutional core 1 × 1 and 64 numbers of convolutional layers, one convolutional core 3 × 3 and 64 numbers of convolutional layers and one convolutional core 1 × 1 and 256 numbers of convolutional layers, conv3_ x convolutional layers are composed of one convolutional core 1 × 1 and 128 numbers of convolutional layers, one convolutional layer core 3 × 3 and 128 numbers of convolutions and one convolutional core 1 × 1 and 512 numbers of convolutions, conv4_ x convolutional layers are composed of one convolutional core 1 × 1 and 256 numbers of convolutional layers, one convolutional core 3 × 3 and 256 numbers of convolutions and 1024 numbers of convolutions, the conv5_ x convolutional layers are composed of one convolutional core 1 × 1 and 512 convolutional layers, one convolutional core 3 × 3 and 512 convolutional layers, and one convolutional core 1 × 1 and 2048 convolutional layers.

Still further, the step c includes:

initializing a training parameter;

input image set features (X)_bef，X_aft) Inputting the data into a ResNet-101 network of a cross-language image change description model based on a double dynamic attention mechanism, continuously updating the learning rate of the ResNet-101 network, the weight coefficient of a dynamic attention module and the weight coefficient of a labeling module, and stopping training until a loss function value is minimum to obtain the trained cross-language image change description model based on the double dynamic attention mechanism.

Further, the initialization training parameters include an initialization learning rate, an initialization maximum iteration number, an initialization update gradient, a weight coefficient of an initialization dynamic attention module, and a weight coefficient of an initialization tagging module, wherein the update formula of the learning rate is as follows

Wherein iter is the current iteration number, max _ iter is the maximum iteration number, power is the update gradient, and leaningrate is the current learning rate.

Further, the loss function is formulated as

L(θ)＝L_XE+λ₁L₁-λ_entL_ent

L₁＝||W_c||+||W_d2||

Wherein L is_XERepresenting the value, L, obtained by minimizing the cross-entropy loss for the training target₁Representing the regularized value, L_entRepresents the cross entropy loss value, λ₁Representing a predetermined first hyperparameter, λ_entRepresenting a preset second hyperparameter, p_θIndicating the probability value, W_cRepresenting a weight coefficient, W, representing the labelling module_d2Weight coefficient, ω, representing dynamic attention module_tWeight, α, representing the labelling module_tThe attention weight of the dynamic attention module is represented.

The invention also provides a device for describing the leakage of the underground pipeline based on cross-language image change description, which comprises:

the image preprocessing module is used for acquiring an image of a scene of the underground pipeline and preprocessing the image to obtain a training set and a test set;

the model construction module is used for constructing a cross-language image change description model based on a dual dynamic attention mechanism;

the model training module is used for training the cross-language image change description model based on the dual dynamic attention mechanism on a training set;

and the test module is used for testing the test set by utilizing the trained cross-language image change description model based on the dual dynamic attention mechanism to obtain an image description result.

Further, the image preprocessing module is further configured to:

Further, the model building module is further configured to:

Still further, the model building module is further configured to:

And l_bef、l_diff、l_aftAs an input, predicting attention weights

Attention is paid to the weight

Cumulatively summing visual features to obtain dynamic engagement features

Dynamic engagement feature

Still further, the model training module is further configured to:

initializing a training parameter;

input image set features (X)_bef，X_aft) Input to Cross-language based on Dual dynamic attention mechanismAnd in the ResNet-101 network of the image change description model, continuously updating the learning rate of the ResNet-101 network, the weight coefficient of the dynamic attention module and the weight coefficient of the labeling module, and stopping training until the loss function value is minimum to obtain the trained cross-language image change description model based on the dual dynamic attention mechanism.

Further, the loss function is formulated as

L(θ)＝L_XE+λ₁L₁-λ_entL_ent

L₁＝||W_c||+||W_d2||

The invention has the advantages that:

(1) the method collects the scene images of the underground pipeline, avoids using a sensor for detection, ensures the accuracy of the collected data, constructs a cross-language image change description model based on a dual dynamic attention mechanism, trains the model, and finally utilizes the trained model to describe the leakage state of the pipeline, thereby ensuring the accuracy of the description of the leakage of the underground pipeline.

(2) The cross-language image change description model based on the dual dynamic attention mechanism is trained by adopting a training set formed by labeled underground pipeline state images, a spatial attention result, namely an image position needing attention, is obtained through an RNN network embedded with the spatial attention mechanism in the training process, a current word is output through the dynamic attention module and the labeling module and is distributed, the current word comprises the attention image time, namely when each image begins to be noticed, the whole model finally generates Chinese description of a target scene, the underground pipeline state does not need to be detected through manual observation, and the description effect is good.

(3) The invention breaks through the problems that a large amount of manual inspection exists in the traditional underground pipeline leakage state detection, the misjudgment of visual observation caused by the complex environment is caused, the traditional monitoring equipment (such as sensor detection) cannot provide effective state information and the like, improves the accuracy rate of the system for detecting the underground pipeline leakage state detection, and is more suitable for being applied to complex industrial scenes.

Drawings

FIG. 1 is a flow chart of a method for describing a downhole tubular leak based on cross-language image change description, according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of ResNet-101 architecture in a cross-language image change description-based downhole tubing leak description method disclosed in an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a downhole pipeline state image acquisition process in the downhole pipeline leakage description method based on cross-language image change description disclosed in the embodiment of the present invention;

FIG. 4 is a flow chart illustrating the preprocessing of the downhole tubing state image in the cross-language image change description-based downhole tubing leak description method disclosed in the embodiments of the present invention;

FIG. 5 is a flowchart illustrating a processing of a preprocessed data set in a cross-language image change description-based downhole tubular leak description method according to an embodiment of the present invention;

FIG. 6 is a flowchart of model training in a cross-language image change description-based downhole tubing leak description method disclosed in an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a cross-language image change description model based on a dual dynamic attention mechanism in the cross-language image change description-based downhole tubular leakage description method disclosed in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1 and 2, a method for describing a downhole pipeline leakage based on cross-language image change description, the method comprising:

step a: acquiring an underground pipeline scene image, and preprocessing the image to obtain a training set and a test set; as shown in fig. 3, the specific process is as follows:

s11, mounting cameras at positions with the vertical distance h from the side face of the underground pipeline, wherein the focal length of the cameras is f, and the cameras can be mounted at multiple angles to achieve multi-directional observation of the underground pipeline;

s12, setting camera parameters, wherein the camera is set to adopt higher resolution to capture more characteristics of images because the industrial field environment is more complex and has great interference on the images acquired by the camera; setting a camera frame rate, and adopting a higher camera frame rate when the underground pipeline leaks to enable the acquired image to be clearer; and adjusting parameters such as saturation and contrast of the camera according to the underground light characteristics so as to achieve optimal shooting of underground pipeline state acquisition.

And S13, acquiring the underground pipeline state image from the video frame, setting a fixed time interval, extracting the key frame according to the specified time interval and converting the key frame into the image. The downhole tubular state image is a data source for a training set and a testing set.

As shown in fig. 4, the process of preprocessing the downhole pipe state image is as follows:

and S21, primarily screening the images, removing unqualified images such as excessive blur, excessive occlusion, excessive exposure, insufficient exposure and the like, and processing the images with the size resolution of 512 multiplied by 512.

And S22, labeling the qualified image, and labeling the image data by using a COCO official pycocools package. The labeling rules were according to the Amazon Mechanical Turk Standard. The labeled label data is stored in a json format, and each image comprises the following label files:

(1) info: the method comprises the steps of establishing a data set, downloading an address, a version number and the like;

(2) and (3) license: data set usage terms;

(3) images: the method comprises the filename, height and width of a picture, and the id of a caption corresponding to the picture;

(4) annotation: the description comprises the id of the image, the id of the corresponding caption and the 3 sentences corresponding to each picture.

And S23, splitting the labeled data set into a training set and a test set according to a certain proportion.

As shown in FIG. 5, S31, according to Amazon Mechanical Turk standard, manually checking each image annotation description, and eliminating the description which does not meet the standard.

S32, according to Amazon Mechanical Turk standard, complement the description of the culling.

Step b: constructing a cross-language image change description model based on a dual dynamic attention mechanism; the method specifically comprises the following steps: a cross-language image change description model based on a dual dynamic attention mechanism is constructed, an Encoder network and a Decode network are selected at first, and hyper-parameters of a training network are set. Alternative Encoder network types are LeNet, AlexNet, VGGNet-16, VGGNet-19, ResNet-50, ResNet-101, ResNet-152, GoogleNet, and the like. Starting from the VGG network, the number of layers of the neural network is deeper and deeper, more features can be extracted by the deep network, but the training effect of the network is not good due to the problem of gradient disappearance. ResNet introduces a residual network structure (residual network) by which the gradient vanishing problem can be effectively solved. Alternative Decoder networks are RNN, LSTM, GRU, etc. For longer sequence input, a deeper neural network is generally needed to solve the long-term dependence problem, but as with the general deep network, the RNN also has the problem of difficult optimization, such as gradient disappearance and gradient explosion. For the gradient vanishing problem, the interaction gradient decreases exponentially, so the long-term dependent signal becomes very weak and is susceptible to short-term signal fluctuations. The LSTM realizes the functions of information retention and information selection (forgetting gate and input gate) by designing a gate structure, thereby enabling input information to be transmitted for a long time. The GRU is a simplification of the LSTM, and combines an input gate and a forgetting gate into an updating gate (the updating gate determines a hidden state retention or abandoning part); however, in many LSTM variants, performance and robustness are comparable to RNN and LSTM for many tasks. LSTM selects a single layer structure and sets the hidden _ size to 512.

Setting hyper-parameters for training a neural network, comprising: optimization methods (SGD, AdaGrad, RMSProp, Adam), initial learning rate, weight attenuation rate, and the like.

In summary, the cross-language image change description model based on the dual dynamic attention mechanism constructed by the invention comprises an encoder, an RNN network embedded with a spatial attention mechanism, a dynamic attention module based on a dynamic speech mechanism and a tagging module, wherein both the dynamic attention module and the tagging module are recursive models based on LSTM, a training set or a test set is input to the encoder, the encoder is connected with the RNN network embedded with the spatial attention mechanism, the RNN network outputs a spatial attention result, namely, a position of an image to be noticed, the RNN network embedded with the spatial attention mechanism is connected with the dynamic attention module, the dynamic attention module is connected with the tagging module, the tagging module outputs a current word and distributes the current word, and the current word contains attention image time, namely, when each image starts to be noticed.

The working process of the cross-language image change description model based on the dual dynamic attention mechanism is as follows: first, 1 ResNet-101 network is adopted as an encoder to extract the characteristics of an input image group (X)_bef，X_aft) (ii) a The ResNet-101 network includes 1 conv1 convolutional layer, 3 conv2_ x convolutional layers, 4 conv3_ x convolutional layers, 23 conv4_ x convolutional layers, 3 conv5_ x convolutional layers and 1 fully connected layer, connected in sequence, conv1 convolutional layers are 7 × 7 convolutional layers with a step size of 2, conv2_ x convolutional layers are composed of one convolutional core 1 × 1 and 64 number of convolutional layers, one convolutional core 3 × 03 and 64 number of convolutional layers and one convolutional core 1 × 11 and 256 number of convolutional layers, conv3_ x convolutional layers are composed of one convolutional core 1 × 21 and 128 number of convolutional layers, one convolutional core 3 × 3 and 128 number of convolutional layers and one convolutional core 1 × 1 and 512 number of convolutional layers, conv4_ x convolutional layers are composed of one convolutional core 1 × 1 and 256 number of convolutional layers, one convolutional core 3 × 3 and 256 number of convolutional layers and one convolutional core 1 × 1 and 1024 number of convolutional layers, the conv5_ x convolutional layers are composed of one convolutional core 1 × 1 and 512 convolutional layers, one convolutional core 3 × 3 and 512 convolutional layers, and one convolutional core 1 × 1 and 2048 convolutional layers.

Then, the image group characteristics (X) are input_bef，X_aft) Inputting the image into an RNN network embedded with a double attention mechanism, and performing image feature (X) on the coded input image group_bef，X_aft) By the formula X_aft-X_befDifference is made to obtain difference characteristic X_diff(ii) a The obtained difference characteristic X_diffRespectively with input image group characteristics (X)_bef，X_aft) Connecting to obtain two different space attention image groups A_befAnd A_aft(ii) a The specific formula is as follows:

X_diff＝X_aft-X_bef (1)

X′_bef＝[X_bef；X_diff]；X′_aft＝[X_aft；X_diff] (2)

a_bef＝σ(conv₂(ReLU(conv₁(X′_bef)))) (3)

a_aft＝σ(conv₂(ReLU(conv₁(X′_aft)))) (4)

l_bef＝∑_H,Wa_bef⊙X_bef (5)

l_aft＝∑_H,Wa_aft⊙X_aft (6)

the above is based on a dual attention mechanism that allows the system to process different images depending on the type of change and the amount of viewpoint movement, critical to detection. In order to correctly describe a pipeline leak condition, the model needs to locate and match the changing object in both images; if only one pipeline state on one image is concerned, the pipeline leakage is possibly misjudged, and the result accuracy is influenced. In a duct leak, the most obvious state change is that there is a property change (e.g. color) that does not involve object displacement, and a single attention may not be sufficient to correctly locate the changed object under a viewpoint movement, while a double attention may be used to adapt well to this environment.

Finally, to successfully describe a change, the model should learn not only where to detect in each image (spatial attention, predicted by double attention), but also when to look at each image (semantic attention). In fact, the hope model may exhibit dynamic reasoning, by which it can learn when to focus on "before" (l)_bef) After (l)_aft) Or "difference" characteristics (l)_diff＝l_aft-l_bef) And generates a word sequence for it, i.e. the final output chinese description.

Therefore, a dynamic attention module and a labeling module based on a dynamic speaking mechanism are designed, and the dynamic attention moduleThe LSTM decoder in (a) will tag the previous hidden state of the module

And l_bef、l_diff、l_aftAs an input, predicting attention weights

Attention is paid to the weight

Cumulatively summing visual features to obtain dynamic engagement features

Dynamic engagement feature

And the previous word x^t-1Inputting the word into LSTM decoder of labeling module to generate current word distribution for distributing current word. The specific formula process is as follows:

wherein l_iIs time t_bef、l_diff、l_aftThe visual characteristics of (a) the visual characteristics of (b),

and

the LSTM outputs at decoder time step t for the dynamic attention module and the tagging module, respectively, Wd1, bd1, Wd2, bd2 are learnable parameters. Dynamic participation characteristics are obtained from equation (7) using the attention weight predicted by equation (11)

Finally, the process is carried out in a batch,

and the previous word x^t-1Input to the LSTM decoder of the tagging module, the next word is distributed:

is the previous word omega_t-1E is an embedded layer; x is the number of^t-1Is a heat of the previous word at the embedding layerEncoding a value; c (t) is

And the one-hot coded value x of the previous word^t-1Concatenated and then input to the LSTM decoder of the tagging module to begin generating the next word distribution. The two decoders predict each word in parallel and keep interacting with each other.

H input at each time step_tAnd Z_tAnd calculating by adopting a baseline model method. Use of

To represent an affine transformation involving learned parameters:

c_t＝f_t⊙c_t-1+i_t⊙g_t

h_t＝o_t⊙tanh(c_t)

where i is_t，f_t，c_t，o_t，h_tRespectively the input, forget, memory, output and hidden states of the LSTM. Vector quantity

Is an image vector that captures visual information associated with a particular input location, as described below.

Is an embedded matrix. Let m and n denote the embedding dimension and the LSTM dimension, respectively, and σ and |, denote the logical-sigmoid activation and element multiplication, respectively.

Step c: training a cross-language image change description model based on a dual dynamic attention mechanism on a training set; the specific process is as follows:

initializing a training parameter;

The initialization training parameters comprise an initialization learning rate, an initialization maximum iteration number, an initialization updating gradient, a weight coefficient of an initialization dynamic attention module and a weight coefficient of an initialization tagging module, and the updating formula of the learning rate is

Wherein iter is the current iteration number, max _ iter is the maximum iteration number, power is the update gradient, and leaningrate is the current learning rate. In this example, the trained batch size is batchsize 4, and the maximum number of iterations is set to 30000. Momentum is 0.9, and the initial learning rate is set to 0.001. And adjusting the learning rate by adopting an inv strategy when the model is trained.

As shown in fig. 6, the ResNet-101 network weight initialization is performed, the weights of the other layers except the last layer of the network are initialized in an unbiased manner, that is, the bias (bias) is 0, the variance (var) is gaussian distributed (σ ═ 0.01), the weight parameter of the last layer of the network takes the problem of unbalanced distribution of samples into consideration, and a formula is adopted during weight initialization

Wherein pi is a hyper-parameter, pi is set to 0.01 in the example, and the model initialization strategy is changed to ensure that the model does not deflect to more negative samples;

the model stops training when the optimal solution is found using the following loss function:

L(θ)＝L_XE+λ₁L₁-λ_entL_ent

L₁＝||W_c||+||W_d2||

wherein L is_XERepresenting the value, L, obtained by minimizing the cross-entropy loss for the training target₁Representing the regularized value, L_entRepresents the cross entropy loss value, λ L1 represents a preset first hyper-parameter, and λ ent represents a preset second hyper-parameter. p is a radical of_θIndicating the probability value, W, of the initial time_c、b_cAnd W_d2、b_d2All give an initial value, enter the dual attention module, and turn W_d2、b_d2Substituting the initial value of (11) for the initial value of alpha_tAccording to the initial alpha_tObtaining the initial L_entThen enters into dynamic speaking mechanism and W is sent_c、b_cSubstituting the initial value of (2) into the formula (15) to obtain the initial value of omega_tFrom the initial ω_tTo obtain the initial L_XEThen using the initial W_cAnd an initial W_d2Calculating to obtain initial L₁Then according to the initial L_XEInitial L_entAnd an initial L₁To obtain initial loss values, and then updating W separately by back propagation_cAnd W_d2And finally obtaining a loss value in each updating process, stopping updating when the loss function finds the optimal solution, fixing parameters, and substituting the parameters into the formula (11) and the formula (15) to obtain the finally trained model.

Step d: and testing the test set by using a trained cross-language image change description model based on a dual dynamic attention mechanism to obtain an image description result. FIG. 7 is a schematic diagram of a cross-language image change description model architecture according to the present invention.

Through the technical scheme, the method collects the scene images of the underground pipeline, avoids using a sensor for detection, ensures the accuracy of the collected data, constructs a cross-language image change description model based on a dual dynamic attention mechanism, trains the model, and finally utilizes the trained model to describe the leakage state of the pipeline, thereby ensuring the accuracy of the description of the leakage of the underground pipeline.

Example 2

Specifically, the image preprocessing module is further configured to:

Specifically, the model building module is further configured to:

More specifically, the model building module is further configured to:

And l_bef、l_diff、l_aftAs an input, predicting attention weights

Attention is paid to the weight

Cumulatively summing visual features to obtain dynamic engagement features

Dynamic engagement feature

More specifically, the ResNet-101 network includes 1 conv1 convolutional layer, 3 conv2_ x convolutional layers, 4 conv3_ x convolutional layers, 23 conv4_ x convolutional layers, 3 conv5_ x convolutional layers, and 1 fully-connected layer, sequentially connected, conv1 convolutional layers are 7 × 7 convolutional layers having a step size of 2, conv2_ x convolutional layers are composed of one convolutional core 1 × 1 and 64 numbers of convolutional layers, one convolutional core 3 × 3 and 64 numbers of convolutional layers, and one convolutional core 1 × 1 and 256 numbers of convolutional layers, conv3_ x convolutional layers are composed of one convolutional core 1 × 1 and 128 numbers of convolutional layers, one convolutional layer core 3 × 3 and 128 numbers of convolutions, and one convolutional core 1 × 1 and 512 numbers of convolutions, conv4_ x convolutional layers are composed of one convolutional core 1 × 1 and 256 numbers of convolutional layers, one convolutional core 3 × 3 and 256 numbers of convolutions, and 1024 numbers of convolutions, the conv5_ x convolutional layers are composed of one convolutional core 1 × 1 and 512 convolutional layers, one convolutional core 3 × 3 and 512 convolutional layers, and one convolutional core 1 × 1 and 2048 convolutional layers.

More specifically, the model training module is further configured to:

initializing a training parameter;

input image set features (X)_bef，X_aft) Inputting the data into a ResNet-101 network of a cross-language image change description model based on a double dynamic attention mechanism, continuously updating the learning rate of the ResNet-101 network, the weight coefficient of a dynamic attention module and the weight coefficient of a labeling module, and stopping training until a loss function value is minimum to obtain a trained cross-language image change description model based on the double dynamic attention mechanism。

More specifically, the initialization training parameters include an initialization learning rate, an initialization maximum iteration number, an initialization update gradient, a weight coefficient of an initialization dynamic attention module, and a weight coefficient of an initialization tagging module, and the update formula of the learning rate is

More specifically, the loss function is formulated as

L(θ)＝L_XE+λ₁L₁-λ_entL_ent

L₁＝||W_c||+||W_d2||

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A downhole pipeline leakage description method based on cross-language image change description is characterized by comprising the following steps:

2. The method for describing the leakage of the underground pipeline based on the cross-language image change description according to claim 1, wherein the step a comprises the following steps:

3. The method for describing the leakage of the underground pipeline based on the cross-language image change description according to claim 1, wherein the step b comprises the following steps:

4. The method for describing the leakage of the downhole pipeline based on the cross-language image change description according to claim 3, wherein the step b further comprises the following steps:

And l_bef、l_diff、l_aftAs an input, predicting attention weights

Attention is paid to the weight

Cumulatively summing visual features to obtain dynamic engagement features

Dynamic engagement feature

5. The method of claim 4, wherein the ResNet-101 network comprises 1 conv1 convolutional layer, 3 conv2_ x convolutional layer, 4 conv3_ x convolutional layer, 23 conv4_ x convolutional layer, 3 conv5_ x convolutional layers and 1 fully connected layer connected in sequence, the conv1 convolutional layer is a 7 x 7 convolutional layer with a step size of 2, the conv2_ x convolutional layer is composed of one convolutional core 1 x 1 convolutional layer with a number of 64, one convolutional core 3 x 3 convolutional layer with a number of 64 and one convolutional core 1 x 1 convolutional layer with a number of 256, the conv3_ x convolutional layer is composed of one convolutional core 1 x 1 convolutional layer with a number of 128, one convolutional core 3 x 3 with a number of 128, and one convolutional core 1 x 1 convolutional layer with a number of 128, and one conv4_ x 1 convolutional layer with a number of 256, the conv 32 _ x convolutional layers is composed of one convolutional core 1 x 1 layer with a number of 256 and one convolutional core 1 x 1 with a number of 256, and the conv convolutional layer with a number of 39512, and the conv 32 _ x 2_ x 1 layer is composed of a number of one convolutional core 256 and one convolutional layer with a number of 256 and one convolutional core 256, One convolution kernel 3 × 3 and 256 numbers of convolution layers and one convolution kernel 1 × 1 and 1024 numbers of convolution layers, and the conv5_ x convolution layer is composed of one convolution kernel 1 × 1 and 512 numbers of convolution layers, one convolution kernel 3 × 3 and 512 numbers of convolution layers and one convolution kernel 1 × 1 and 2048 numbers of convolution layers.

6. The method for describing the leakage of the underground pipeline based on the cross-language image change description according to claim 4, wherein the step c comprises the following steps:

initializing a training parameter;

input image set features (X)_bef，X_aft) Inputting the data into a ResNet-101 network of a cross-language image change description model based on a double dynamic attention mechanism, continuously updating the learning rate of the ResNet-101 network, the weight coefficient of an initialized dynamic attention module and the weight coefficient of an initialized labeling module, and stopping training until the loss function value is minimum to obtain the trained cross-language image change description model based on the double dynamic attention mechanism.

7. The method for describing the leakage of the downhole pipeline based on the cross-language image change description according to claim 6, wherein the initialization training parameters comprise an initialization learning rate, an initialization maximum iteration number, an initialization update gradient, a weight coefficient of an initialization dynamic attention module, and a weight coefficient of an initialization tagging module, and the update formula of the learning rate is as follows

8. The method for describing downhole tubular leak based on cross-language image change description as claimed in claim 6, wherein the loss function is formulated as

L(θ)＝L_XE+λ₁L₁-λ_entL_ent

L₁＝||W_c||+||W_d2||

9. A downhole tubing leak description apparatus based on cross-language image change description, the apparatus comprising:

10. The device for describing downhole tubular leak based on cross-language image change description as claimed in claim 9, wherein the image preprocessing module is further configured to: