CN116012687A

CN116012687A - Image interaction fusion method for identifying tread defects of wheel set

Info

Publication number: CN116012687A
Application number: CN202310101570.9A
Authority: CN
Inventors: ***; 杨皓楠; 何静; 张昌凡; 李哲姝; 王忠美; 贾林; 黄刚
Original assignee: Hunan University of Technology
Current assignee: Hunan University of Technology
Priority date: 2023-02-10
Filing date: 2023-02-10
Publication date: 2023-04-25

Abstract

The invention discloses an image interaction fusion method for recognizing wheel set tread defects, which mainly comprises five stages, namely a data acquisition and processing stage, a multi-scale interaction attention feature extraction stage, a constraint coupling decoding stage, a fusion inference stage and a result display stage; the self-adaptive mixed interaction attention module applied in the multi-scale interaction attention feature extraction stage is beneficial to distinguishing the shape of a model target, and the multi-scale sparse feature extraction module applied in the multi-scale interaction attention feature extraction stage is beneficial to distinguishing the scale when extracting features; the designed constraint coupling decoding stage introduces consistency loss and reconstruction loss into a modal decoder, so that the accuracy of model identification is improved; in conclusion, the method solves the problem that the shape and the scale of the target are difficult to distinguish when the features are extracted, and the model identification precision is high.

Description

Image interaction fusion method for identifying tread defects of wheel set

Technical Field

The invention relates to the technical field of wheel set tread recognition, in particular to an image interaction fusion method for wheel set tread defect recognition.

Background

As trains run over a high-speed heavy-load age, intelligent operation and maintenance of the trains are attracting attention, and as important running and supporting components of the trains, factors such as high-frequency vibration caused by high-speed rotation of wheel sets, wheel rail temperature rise caused by sliding/rolling in the contact process of the wheel rails and the like can cause wheel rail contact fatigue, so that multiple types of defects such as scratch, scratch and block drop are generated on the wheel set tread, and accurate identification of tread defects can provide key support for train operation and maintenance. The tread defect data samples are mostly presented as the situations of small difference of heterogeneous defect shapes and large difference of the same type of defect dimensions due to the restriction of factors such as complex and changeable contact working conditions of wheel and rail, so that the existing deep learning network always faces the problems of difficult distinction of target shapes and dimensions, low model recognition precision and the like in feature extraction, and the requirements of intelligent operation and maintenance of trains on real-time diagnosis of tread defects of wheel pairs are difficult to meet.

The invention of the publication No. CN114663344A creates a patent, disclose a train wheel set tread defect identification method based on image fusion, when the said invention creates the tread defect of train wheel set, gather the image through visible light camera and infrared camera, obtain the tread regional image of train wheel set; constructing a fusion model of a visible light image and an infrared image based on a neural network, training the model until the model converges, and inputting the corresponding visible light image and infrared image into the trained fusion model to obtain a fusion image; a region growing method is adopted, and pixel points in the fused image are polymerized according to the similarity of the gray values of the image, so that an image of the tread defect region of the train wheel set is obtained; the invention adopts the conventional neural network to construct a fusion model of the visible light image and the infrared image, and when the wheel set tread defect is carried out, the problems of difficult distinction of the shape and the scale of the target, low model identification precision and the like exist.

Disclosure of Invention

Aiming at the technical problems, the invention provides an image interaction fusion method for identifying wheel set tread defects.

The invention adopts the following specific technical scheme:

the main body of the method comprises five stages, namely a data acquisition and processing stage, a multi-scale interaction attention feature extraction stage, a constraint coupling decoding stage, a fusion inference stage and a result display stage, wherein the five stages are respectively a data acquisition and processing stage, a multi-scale interaction attention feature extraction stage, a constraint coupling decoding stage, a fusion inference stage and a result display stage, and the method comprises the following steps of:

s1, data acquisition and processing: and acquiring RGB image samples of the tread defect of the wheel set on site, and encoding the original RGB image samples by using a Poisson encoder to obtain Poisson mode POS images for image fusion.

S2, a multi-scale interaction attention feature extraction stage:

s2.1, taking a pre-training lightweight network Mobilenetv2 fused with RGB and Poisson images as a model backbone network, extracting bottom features of modes of respectively acquiring the RGB images and the Poisson mode images in the step S1, and setting an input feature map x ^m ∈R ^H×W×C For an m-modality input image, its final encoding characteristics are as follows:

h ^m ＝Mobile ^m (x ^m ),m∈{r,p}

its characteristic map x ^m ∈R ^H×W×C H, W, C in the formula is expressed as the length, width and channel number of RGB image characteristics of the input image, in the formula

Wherein H is ₁ 、W ₁ 、C ₁ But also respectively represents the length, width and channel number of the characteristic diagram,

Mobile ^m coding features respectively representing m modes, mobilenetv2 network, r, p represent the features of RGB image modality and poisson modality POS image modality, respectively;

s2.2, adopting a multi-scale sparse feature extraction module DGASPP to extract multi-scale features of the bottom layer features so as to solve the problem of large difference of defect scales in the image sample, wherein the multi-scale sparse feature extraction module DGASPP is used for encoding features h of two modes _m Extracting multi-scale features, wherein the extraction formula is as follows:

s ^m ＝DGASPP ^m (h ^m ),m∈{r,p}

in the middle of

DGASPP ^m Respectively representing the multi-scale coding characteristics of m modes and a DGASPP module;

s2.3, extracting a spatial channel attention feature with interaction information in the bottom layer feature by using an adaptive hybrid interaction attention module AHMA module, so as to solve the problem of small defect shape difference in an image sample, and extracting a formula of the spatial channel attention feature with interaction information by using the AHMA module:

m ^r ,m ^p ＝AHMA(s ^r ,s ^p )

in the method, in the process of the invention,

attention weighting features respectively representing two modalities; AHMA is the adaptive hybrid interactive attention module.

S3, constraint coupling decoding stage: and (2) starting a constraint coupling decoding module, decoding the characteristics extracted by the multi-scale interaction attention characteristic extraction module in the step S2.2 by using a modal decoder, wherein the modal decoder adopts a network deep 3 with an improved coding and decoding structure, and the improvement process of the decoding part is as follows:

s3.1, adding additional characteristic splicing convolution CConv and shortcut convolution SConv for obtaining decoding characteristics on more scales;

s3.2, adding consistency constraint loss among decoding features of different modes, and capturing interaction features in the decoding features;

s3.3, increasing reconstruction loss between the input image and the reconstructed image so as to enhance shape feature extraction;

s3.4, constructing a total target loss function composed of task loss, consistency loss and reconstruction loss, wherein the total target loss function is used for guiding the network to learn related characteristics, and the formula of the total target loss function is as follows:

L _total ＝μL _task +(1-μ)(L _consis +L _recon )

wherein: mu is a loss function adjustment factor, L _task L is a task loss function _consis As a consistency loss function, L _recon Reconstructing a loss function, wherein the total target loss function is used for adjusting the contribution of task loss and decoding part loss to network learning, and taking the minimized loss function as a target training network;

s4, fusion inference phase: and adopting a global average pooling and a multi-layer perceptron to perform fusion inference on the interaction attention characteristics, wherein the fusion inference formula is as follows:

wherein P represents fusion inference output; MLP represents a multi-layer perceptron; avg represents an average pooling operation;

representing a channel splice operation.

S5, a result display stage: the respective categories of the partial test set image samples are shown as belonging to the categories.

Further, the constraint coupling decoding module is used for assisting the multi-scale interaction attention feature extraction stage to train the network and not participate in the test.

Further, the multi-scale sparse feature extraction module DGASPP is formed by introducing a ghost module and a dense connection idea into the void space pyramid pooling module ASPP.

Further, in step S2.3, the adaptive hybrid interaction attention module AHMA is composed of a hybrid non-local module HNL, attention moment array fusion and adaptive attention reliability weights.

Further, the task loss function L in step S3.4 _task The formula of (2) is:

L _task ＝-α(1-P) ^γ log(P)

where α=0.25 represents a weight factor, and γ=2 represents an adjustment factor.

Further, the consistency loss function L in step S3.4 _consis The formula of (2) is:

the 16-times downsampling and 8-times downsampling decoding characteristics of the m mode are respectively expressed as

Wherein m is { r, p }, where +.>

Representing the decoded features of the 16-fold downsampled RGB image modality and the poisson modality POS image modality respectively,

the decoding characteristics of the 8-time downsampled RGB image mode and the Poisson mode POS image mode are respectively represented, and the consistency loss function is obtained by adopting the mean square error MSE loss at the 16-time and 8-time downsampled decoding characteristics among the modes.

Further, the m-mode 16-times downsampling, 8-times downsampling decoding feature

And->

The formulas of (a) are respectively as follows:

wherein the method comprises the steps of

up ₂ Representing bilinear interpolation 2 x upsampling operation,/->

Representing a channel splice operation.

Further, decoding features

CConv in the formula ₈ Decoding characteristics->

CConv in the formula ₁₆ With the same topology.

Further, the loss function L is reconstructed in step S3.4 _recon The method comprises the following steps:

wherein x is ^r 、x ^p Representing RGB image and poisson image inputs, r, respectively ^r 、r ^p Representing the RGB image and poisson image decoding reconstruction output, respectively.

Further, the image interaction fusion method adopts experiments to carry out verification analysis.

The beneficial effects of the invention are as follows:

the image interaction fusion method for identifying the tread defects of the wheel set has the beneficial effects that:

(1) In the multi-scale interaction attention feature extraction stage, a self-adaptive mixed interaction attention module is provided, and channel branches and a self-adaptive interaction mechanism are introduced into a non-local attention module, so that the non-local attention module can improve attention information gain by using inter-mode interaction information, optimize target shape feature capturing capability, facilitate the distinction of target shapes when features are extracted, and solve the problem of small defect shape difference in a model.

(2) In the multi-scale interaction attention feature extraction stage, a multi-scale sparse feature extraction module is provided, and introduces a ghost module and a dense connection idea into a cavity space pyramid pooling module, so that the multi-scale sparse representation can be optimized by associating branch information of each scale, the scale distinction in feature extraction is facilitated, and the problem of large defect scale difference is solved.

(3) The constrained coupling decoding stage is designed, and consistency loss and reconstruction loss are introduced into the modal decoder, so that the model can better capture defect shape characteristics and inter-modal interaction characteristic information by an auxiliary network in the training process, and the model identification accuracy is high.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a simplified flow chart of an image interaction fusion method for identifying wheel set tread defects;

FIG. 2 is a schematic diagram of a data acquisition and processing stage;

FIG. 3 is a schematic illustration of a portion of a sample;

FIG. 4 is a schematic diagram of a multi-scale interaction attention feature extraction stage;

FIG. 5 is a schematic diagram of a fusion inference phase;

FIG. 6 is a schematic diagram of a constraint coupled decoding stage;

FIG. 7 is a schematic diagram of a result display stage;

fig. 8 is a schematic diagram of a DGASPP module structure.

Detailed Description

The present invention is further illustrated and described below with reference to examples, which are not intended to be limiting in any way.

Example 1

As shown in fig. 1, the main body of the method comprises five stages, namely a data acquisition and processing stage, a multi-scale interaction attention feature extraction stage, a constraint coupling decoding stage, a fusion inference stage and a result display stage, wherein the method comprises the following steps:

s1, data acquisition and processing: as shown in fig. 2, RGB image samples of tread defects of the wheel set are collected on site, and a poisson encoder is adopted to encode the original RGB image samples to obtain poisson mode POS images for image fusion;

the data set used in the experiment in the embodiment 1 is acquired by a CCD camera in a unified standard from a vehicle section wheel axle workshop, and the database established by the invention comprises 9 state images and 1 interference image derived from the wheel set tread under the action of wheel rail contact force in the actual running process of the train.

Specifically, as shown in table 1, the total of 343 images of 10 wheel set states acquired from the tread defect dataset was obtained. Wherein, the number of the sheets is 57, the number of the sheets is fluctuated 52, the number of the cracks is 47, the number of the scratches is 45, the number of the peeled sheets is 42, the number of stains (interference images) is 29, the number of the scratches is 24, the number of the edge abrasion is 21, the number of the circumferential abrasion is 20, and the number of the blocks is 6.

Tread defect class	Training sample number (sheet)	Number of test samples (sheet)
			Tread face correctionOften times	34	23
Punch damage on tread	31	21
			Tread crack	28	19
Tread scratch	27	18
			Tread stripping	25	17
Tread stain	17	12
			Scratch on tread	14	10
Edge abrasion	12	9
			Circumferential abrasion	12	8
Tread block	3	3

TABLE 1

To facilitate network training and testing, defects in the original RGB image are reconstructed into an image with 224×224 pixel size, and the RGB image is poisson coded to obtain a POS image, and a part of sample is shown in FIG. 3. And finally, approximately dividing each state data in a ratio of 3:2 to obtain 203 images of the training set and 140 images of the testing set.

S2, as shown in FIG. 4, a multi-scale interaction attention feature extraction stage:

s2.1, taking a pre-training lightweight network Mobilenetv2 fused with RGB and Poisson images as a model backbone network, extracting bottom features of modes of respectively acquiring the RGB images and the Poisson mode images in the step S1, and setting an input feature map x ^m ∈R ^224×224×3 For an m-modality input image, its final encoding characteristics are as follows:

h ^m ＝Mobile ^m (x ^m ),m∈{r,p} (1)

h ^m ∈R ^7×7×128 、Mobile ^m respectively representing coding characteristics of m modes and a Mobilenetv2 network, and r and p respectively represent characteristics of RGB image modes and Poisson mode POS image modes;

s ^m ＝DGASPP ^m (h ^m ),m∈{r,p} (2)

s in the formula (2) ^m ∈R ^7×7×128 、DGASPP ^m Respectively representing the multi-scale coding characteristics of m modes and a DGASPP module;

m ^r ,m ^p ＝AHMA(s ^r ,s ^p ) (3)

in the formula (3), m ^r ,m ^p ∈R ^7×7×128 Attention weighting features respectively representing two modalities; AHMA is the self-adaptive mixed interaction attention module;

s3, constraint coupling decoding stage: as shown in fig. 5, the constraint coupled decoding module is enabled, the features extracted by the multi-scale interaction attention feature extraction module in step S2.2 are decoded by using a modal decoder, and the process of improving the decoding part by adopting a network deep 3 with an improved encoding and decoding structure in the modal decoder is as follows:

s3.1, adding additional characteristic splicing convolution CConv and shortcut convolution SConv for decoding characteristics acquired on more scales, wherein the decoding characteristics of 16 times downsampling and 8 times downsampling of m modes in a Mobilenetv2 module are respectively expressed as

Wherein m is { r, p }, 16 times downsampling and 8 times downsampling of m modes decode the characteristic ∈ ->

And->

The formulas of (a) are respectively as follows:

/>

in the formula (4)

In formula (5)>

up ₂ Representing bilinear interpolation 2 x upsampling operation,/->

Representing a channel splicing operation;

s3.2, adding consistency constraint loss among decoding features of different modes, and capturing interaction features in the decoding features, wherein a task loss function L _task The formula of (2) is:

L _task ＝-α(1-P) ^γ log(P) (6)

wherein α=0.25 represents a weight factor, and γ=2 represents an adjustment factor;

wherein the coherence loss function L _consis The formula of (2) is:

in the formula (7)

Decoding features representing the 16-fold downsampled RGB image modality and poisson modality POS image modality, respectively,/->

The decoding characteristics of the 8-time downsampling RGB image mode and the Poisson mode POS image mode are respectively represented, and the consistency loss function is obtained by adopting a mean square error MSE loss at the 16-time and 8-time downsampling decoding characteristics among the modes;

s3.3, adding reconstruction loss between the input image and the reconstructed image to enhance shape feature extraction, wherein the reconstruction loss function L _recon The method comprises the following steps:

wherein x is ^r 、x ^p Representing RGB image and poisson image inputs, r, respectively ^r 、r ^p Respectively representing RGB image and Poisson image decoding reconstruction output;

L _total ＝μL _task +(1-μ)(L _consis +L _recon ) (9)

in formula (9): mu is a loss function adjustment factor, mu is set to 0.8, L _task L is a task loss function _consis As a consistency loss function, L _recon Reconstructing a loss function, wherein the total target loss function is used for adjusting the contribution of task loss and decoding part loss to network learning, and taking the minimized loss function as a target training network;

s4, fusion inference phase: as shown in fig. 6, the global average pooling and multi-layer perceptron is used to perform fusion inference on the interaction attention features, and the fusion inference formula is as follows:

in the formula (10), P represents a fusion inference output; MLP represents a multi-layer perceptron; avg represents an average pooling operation;

representing a channel splicing operation;

s5, a result display stage: as shown in fig. 7, the respective categories of the partial test set image samples are shown as well as the belonging categories.

Further, the constraint coupling decoding stage is used for assisting the multi-scale interaction attention feature extraction stage to train the network without participating in the test.

Further, the multi-scale sparse feature extraction module DGASPP is formed by introducing a ghost module and a dense connection idea into the cavity space pyramid pooling module ASPP, the structure of the DGASPP module is shown in fig. 8, 3 GASPP Block branches with different expansion coefficients are designed in the DGASPP module, three branches are densely connected to obtain global features, another pooling branch is added to average pooling input feature mapping, finally, the features of the input feature mapping and the four branches are spliced, and 11 convolutions are used to compress the features into 128 channels so as to restore the dimension of the input channel; the multi-scale sparse feature extraction module DGASPP provided by the invention can optimize multi-scale sparse representation by associating branch information of each scale, is beneficial to scale discrimination when features are extracted, and solves the problem of large difference of defect scales.

Further, decoding features

CConv in the formula ₈ Decoding characteristics->

CConv in the formula ₁₆ Has the same topology for integrating the corresponding downsampling feature with the previous layer decoding feature.

Example 2

Embodiment 2 differs from embodiment 1 in that embodiment 2 uses only the HNL module mentioned in embodiment 1 for attention feature extraction in the multi-scale interactive attention feature extraction stage.

Under the condition of consistent experimental parameter settings, the HNL module and the AHMA module are respectively applied to extract attention characteristic experiment results as shown in the following table 2

Module	Acc	P	R	F1
					HNL	0.85	0.859	0.8.4	0.836
AHMA	0.871	0.903	0.877	0.883

TABLE 2

The experiment mainly adopts the following indexes: accuracy (Acc), recall (R), precision (P), F1 value (F1), pa, and T. Wherein, the accuracy Acc represents the proportion of the number of samples with correct prediction (positive class and negative class) to the total number of samples; recall R represents the proportion of the number of samples correctly predicted as positive to all instances as positive samples; the accuracy rate P represents the proportion of the number of samples correctly predicted as positive types to all the samples predicted as positive; the F1 value represents a weighted harmonic average of both recall and precision; pa represents the model parameter number (unit: M); t represents the time (unit: S) spent by a single sample testing the model on the CPU

As can be seen from table 2, the classification indexes of the AHMA module used in embodiment 1 are better than those of the HNL module used in embodiment 2, so that the validity and superiority of the AHMA module are verified, and the indexes of the HNL module are respectively reduced by 0.021, 0.044, 0.043 and 0.047 compared with the AHMA, which indicates that the AHMA module formed by the modal interaction strategy designed on the basis of the HNL module in embodiment 1 can effectively capture the interaction information between the modalities, and the feature extraction capability of the HNL module is improved.

Example 3

Embodiment 3 differs from embodiment 1 in that embodiment 3 uses only the ASPP scale feature extraction module mentioned in embodiment 1 for feature extraction in the multi-scale interactive attention feature extraction stage.

In the case where the experimental parameter settings are consistent, the DGASPP module and the ASPP module extraction attention feature experiment results are shown in table 3 below.

Module	Acc	P	R	F1
					ASPP	0.821	0.863	0.81	0.819
DGASPP	0.871	0.903	0.877	0.883

TABLE 3 Table 3

As can be seen from table 3, the module DGASPP of embodiment 1 can maintain a relatively small parameter under the condition that the performance is significantly better than that of the comparison module ASPP module, which indicates that the module DGASPP of embodiment 1 can extract multi-scale features more effectively, and further can alleviate the problem of large difference of defect scales more effectively.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The image interaction fusion method for recognizing the wheel set tread defects is characterized in that the method main body comprises five stages, namely a data acquisition and processing stage, a multi-scale interaction attention feature extraction stage, a constraint coupling decoding stage, a fusion inference stage and a result display stage;

the method comprises the following steps:

s1, data acquisition and processing: on-site collecting RGB image samples of tread defects of the wheel set, and encoding the original RGB image samples by using a Poisson encoder to obtain Poisson mode POS images for image fusion;

s2, a multi-scale interaction attention feature extraction stage:

s2.1, taking a pre-training lightweight network Mobilenetv2 fused with RGB and Poisson images as a model backbone network, extracting bottom features of modes of respectively acquiring the RGB images and the Poisson mode images in the step S1, and setting an input feature map x ^m ∈R ^H ^×W×C For an m-modality input image, its final encoding characteristics are as follows:

h ^m ＝Mobile ^m (x ^m ),m∈{r,p}

its characteristic map x ^m ∈R ^H×W×C H, W, C in the formula is expressed as the length, width and channel number of the input image, in the formula

Wherein H is ₁ 、W ₁ 、C ₁ And respectively represent the length, width and channel number of the feature map, ">

Mobile ^m Respectively representing coding characteristics of m modes and a mobiletv 2 network;

s2.2, adopting a multi-scale sparse feature extraction module DGASPP for multi-scale feature extraction of the bottom layer features, wherein the multi-scale sparse feature extraction module DGASPP is used for encoding features h of two modes _m Extracting multi-scale features, wherein the extraction formula is as follows:

in the middle of

m ^r ,m ^p ＝AHMA(s ^r ,s ^p )

in the method, in the process of the invention,

attention weighting features respectively representing two modalities; AHMA is the self-adaptive mixed interaction attention module;

s3, constraint coupling decoding stage: enabling constraint coupled decoding module, decoding step using modal decoder

S2.2, the multi-scale interaction attention feature extraction module extracts features, the mode decoder adopts a network deep 3 with an improved coding and decoding structure, and the decoding part is improved as follows:

s3.3, increasing reconstruction loss between the input image and the reconstructed image, and extracting shape features;

s3.4, constructing a total target loss function composed of task loss, consistency loss and reconstruction loss, and guiding the network to learn relevant characteristics, wherein the formula of the total target loss function is as follows:

L _total ＝μL _task +(1-μ)(L _consis +L _recon )

wherein: mu is a loss function adjustment factor, L _task L is a task loss function _consis As a consistency loss function, L _recon Reconstructing a loss function;

wherein P represents fusion inference output, MLP represents a multi-layer perceptron, avg represents average pooling operation, and avg represents channel splicing operation;

2. The image interaction fusion method for identifying wheel set tread defects according to claim 1, wherein the constraint coupling decoding module is used for assisting a multi-scale interaction attention feature extraction stage training network and does not participate in testing.

3. The image interaction fusion method for identifying wheel set tread defects according to claim 1, wherein the multi-scale sparse feature extraction module DGASPP is formed by introducing a ghost module and a dense connection idea into a cavity space pyramid pooling module ASPP.

4. The image interaction fusion method for identifying wheel set tread defects according to claim 1, wherein the adaptive hybrid interaction attention module AHMA in step S2.3 is composed of a hybrid non-local module HNL, attention moment array fusion and adaptive attention reliability weights.

5. The image interaction fusion method for identifying wheel set tread defects according to claim 1, wherein the task loss function L in the step S3.4 _task The formula of (2) is:

L _task ＝-α(1-P) ^γ log(P)

6. The image interaction fusion method for identifying wheel set tread defects according to claim 1, wherein the consistency loss function L in the step S3.4 _consis The formula of (2) is:

16 times of m modeThe downsampled and 8-times downsampled decoding features are represented as

Wherein m is { r, p }, where

7. The image interactive fusion method for wheel set tread defect identification according to claim 6, wherein the m-mode 16-time downsampling and 8-time downsampling decoding features

And->

The formulas of (a) are respectively as follows:

wherein the method comprises the steps of

up ₂ Representing bilinear interpolation 2 x upsampling operations ∈Representing a channel splice operation.

8. The image interactive fusion method for wheel set tread defect identification according to claim 7, wherein the decoding features

CConv in the formula ₈ Decoding characteristics->

CConv in the formula ₁₆ With the same topology.

9. The image interaction fusion method for identifying wheel set tread defects according to claim 1, wherein the reconstructing loss function L in step S3.4 _recon The method comprises the following steps:

10. The image interaction fusion method for identifying wheel set tread defects according to claim 1, wherein the image interaction fusion method adopts experiments for verification analysis.