CN117591888B

CN117591888B - Cluster autonomous learning fault diagnosis method for key parts of train

Info

Publication number: CN117591888B
Application number: CN202410064148.5A
Authority: CN
Inventors: 王彪; 邱海权; 秦勇; 伊枭剑; 郭亮; 丁奥
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-04-12
Anticipated expiration: 2044-01-17
Also published as: CN117591888A

Abstract

The invention relates to the technical field of fault diagnosis of rail transit equipment, in particular to a cluster autonomous learning fault diagnosis method for key components of a train, which comprises the following steps: constructing a data stream hierarchy, and setting a plurality of edge clients and a central server; entering an initial learning level, carrying out R-round cluster cooperative training on each edge model, and aggregating to a central server; entering an autonomous learning level, and constructing a local loss function of each edge client; and (3) taking the local loss function as a learning criterion, carrying out R-round cluster collaborative training on each edge model, and aggregating the R-round cluster collaborative training to a central server, wherein the central server selects a model with highest diagnosis accuracy in R rounds from a global view as an optimal global model of the hierarchy. According to the invention, under the cloud-edge cooperative architecture, dynamic data resources scattered at the edge end can be fully utilized on the premise of protecting data privacy, so that the autonomous training of the cluster is realized.

Description

Cluster autonomous learning fault diagnosis method for key parts of train

Technical Field

The invention relates to the technical field of rail transit equipment, in particular to a cluster autonomous learning fault diagnosis method for key components of a train.

Background

Rail transit is an important choice for people to travel because of the advantages of high speed, rapidness, safety and comfort. With the increase of train running speed and intensity, advanced monitoring and diagnosis technology for key components of the train is getting more and more attention. The prior monitoring and diagnosis is limited by technology, and an edge collecting-centralized processing architecture is generally adopted, namely, data are collected from edge nodes and uploaded to a central server, and then data analysis is carried out by utilizing an algorithm deployed on the central server and diagnosis results are output. With the development of edge computing hardware, edge nodes have stronger and stronger computing performance, and it is possible to deploy more and more complex algorithms at the edge nodes.

Monitoring and diagnosis architecture based on cloud edge cooperation is favored by industry because of faster response real-time performance and lower communication cost. However, the existing cloud edge collaborative diagnosis architecture simply drops the data analysis task to the edge client, and the problem in practical application is not fully considered, namely 1) the data in the practical scene is continuously accumulated, and when a new fault is found, how to quickly perform knowledge propagation from the global angle is not fully considered; 2) There is a data barrier between different vehicle suppliers and operating subjects, sharing raw data often involves cumbersome data confidentiality issues, and under a cloud-edge collaborative framework, how diagnostic experience is temporarily left blank in a desensitizing manner to share among different subjects.

Disclosure of Invention

In view of the above, the invention provides a cluster autonomous learning fault diagnosis method for key train components, and the model can fully utilize dynamic data resources scattered at an edge client under the premise of protecting data privacy under the cloud-edge cooperative architecture to realize cluster autonomous training.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a cluster autonomous learning fault diagnosis method for key components of a train comprises the following steps:

constructing a series of data stream layers, and setting a plurality of edge clients and a central server;

entering an initial learning level, downloading an initial global model to a local area from a central server by each edge client to serve as a local area model, and collecting respective training samples from a data stream level according to batches to perform R-round initial training on the local area model;

after each round of initial training is completed, each edge client randomly selects M old class reference samples from training samples used in the current round of training, stores the M old class reference samples in an old reference sample memory, is used for the next round of training, and a central server collects the local models trained by each edge client for aggregation and updating to obtain a global model of an initial learning level;

after the R round cluster cooperative training is carried out on the initial learning level, the automatic learning level is entered;

in each autonomous learning level, the edge client continuously collects new training data and calculates the average entropy of the training data, and when the average entropy meets the preset condition, the edge client is considered to be receiving the new level data, and at the moment, the next autonomous learning level is carried out;

in each autonomous learning hierarchy, constructing a gradient weighted loss function of each edge client to normalize learning rates and forgetting rates of new class samples and old class samples;

constructing an old class distillation loss function of each edge client based on the bottom layer relation between the new class sample and the old class sample;

constructing a local loss function of each edge client based on the gradient weighted loss function and the old distillation loss function;

taking a local loss function as a learning criterion, and each edge client-side acquires training samples according to batches to carry out R-round autonomous training on a local model;

after each round of autonomous training is completed, each edge client randomly selects M old class reference samples from training samples used in the current round of training, stores the M old class reference samples in an old reference sample memory, is used for the next round of training, and a central server collects the trained local models of each edge client for aggregation; after cluster collaborative training of the R-round autonomous learning hierarchy is carried out, the central server selects a model with highest diagnosis accuracy in the R-round from a global view as an optimal global model of the hierarchy.

Further, a series of data stream levels are represented asWherein t represents the t th hierarchy, and layer represents the total number of layers; the t th hierarchical data is expressed as +>The data of the hierarchy is composed of->Sample->And their tag value->Composition (S)/(S)>Time domain vibration signal representing collected critical component, tag value +.>Representing the health status category of the component, +.>Tag set expressed in the t-th hierarchy, which contains +.>A new health status category is seeded; the set of all old health status categories in the t-1 layer is +.>，/>For the set of old health categories in layer i, < >>To get the union set for t-1 sets, +.>Is the label set in layer j.

Further, after each hierarchical training is completed, the M old class reference samples randomly selected by the edge client satisfy the following relationships:。

further, in the autonomous learning level, the data stream level will change with time, and when the data stream level is the t level, the edge client terminalDynamically changing with the change of data stream layer, adding +/periodically at each layer>Let the number of clients +.>Gradually increasing; wherein (1)>By->Edge client composition, which->The edge clients do not collect new data at the current stage, and store old reference samples through the previous learning level; />By->Edge client composition, which->The edge clients not only collect new data of the current stage, but also store old reference samples of the previous stage; />By->Edge client composition, which->The edge clients only receive new data for the current stage, not including old reference samples.

Further, the calculation formula of the average entropy is:

；

wherein,for the output value of the local model, < +.>For the number of samples->Is an entropy function, expressed as->；

When (when)Is suddenly increased and satisfies +.>，/>When the edge client receives new hierarchical data, the hierarchical value is updated from the original t-1 to t.

Further, the construction process of the gradient weighted loss function is as follows:

training samples of the current levelInputting the output value into a local model of the edge client to obtain an output value of the last layer of the local model;

calculating a gradient measurement value of the training sample according to the output value of the local model;

carrying out gradient normalization on the rate of learning a new class sample by the local model;

carrying out gradient normalization on the rate of forgetting old class samples of the local model;

constructing a gradient weighted loss function：

；

Wherein,；

wherein b is the batch size,gradient measurement of the ith training sample at the t-th level for the ith edge client; />Normalizing the characterization value for the gradient; />Is a binary cross entropy loss; />Local office for the first edge clientThe domain model outputs +.f for sample i at the t-level>A health status category label value for sample i at the t level for the first edge client; />To indicate the function, when->When this value is expressed as 1, otherwise 0; />For values after shaving normalization of learning rate of new class samples ++>Is a value obtained by carrying out gradient normalization on the forgetting rate of the old class sample.

Further, the construction process of the distillation loss function of the old class is as follows:

aiming at the first edge client, the local model of the upper layer is obtainedAnd local model of the hierarchy->；

Inputting training samples into the local model according to batches to obtain output values of the local model of the previous layerAnd the output value of the local model of the hierarchy +.>；

Output value of the local model of the previous layerSubstitution of the sample tag ∈>The individual dimension values, resulting in tag variants->；

Calculation of distillation loss of old class，/>In the formula->Indicating that the KL divergence of both is calculated.

Further, for the first edge client, its local loss function is expressed as:

wherein,weighting the loss function for the gradient +.>For old distillation loss function, +.>And->The weights of the two loss functions are respectively.

Further, in each autonomous learning hierarchy, the determining process of the optimal global model includes:

when the edge client monitors the occurrence of new fault categories, selecting a representative prototype sample for each new fault categoryThe prototype sample is locally modeledThe output value in the model is close to the average value of the output values of all samples in the local model under the category;

the structural gradient is thatA feature extractor model for extracting prototype sample features;

adding Gaussian noise to the prototype sample characteristics to obtain noise prototype sample characteristics;

constructing a sample reconstruction model with the same gradient as the feature extractor model in the opposite structure in the central server;

the central server collects noise prototype sample characteristics of each edge client and the gradient of the characteristic extractor model, then performs scrambling, and inputs the scrambled noise prototype sample characteristics into a sample reconstruction model to obtain a prototype-like sample;

after the central server and each edge client perform R round cluster collaborative training, each round of aggregated global model is stored, prototype samples are respectively input into the R round global models, the diagnosis accuracy of each round of global models is obtained, and the global model with the highest diagnosis accuracy is selected as the optimal global model of the hierarchy.

Further, the noise prototype sample features are expressed as:

wherein,standard deviation of prototype sample +.>For prototype sample, ++>Representing the feature extraction process->Representing a gaussian distribution->Is the noise weight.

According to the technical scheme, compared with the prior art, the cloud-edge collaborative autonomous training method and the cloud-edge collaborative autonomous training system can fully utilize dynamic data resources scattered on the edge client on the premise of protecting data privacy. Has the following beneficial effects:

1) The central server collects the local model information of each edge client to evaluate the effectiveness of the edge client model from a more comprehensive view angle, and then aggregates the model information, so that the knowledge is effectively propagated, and the model cannot cause diagnosis performance degradation due to the occurrence of new category data.

2) And a gradient communication mechanism is introduced, and data information is transmitted through the gradient of the model, so that an edge client can share information at a central end without providing original data, and the desensitization processing of the data is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a cluster autonomous learning fault diagnosis method for key train components provided by the invention;

FIG. 2 is a network architecture diagram of the autonomous learning hierarchy cluster collaborative training process provided by the invention;

FIG. 3 is a schematic diagram showing the performance of the method of the present invention compared with the prior deep learning method.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the embodiment of the invention discloses a cluster autonomous learning fault diagnosis method for key components of a train, which comprises the following steps:

after each round of initial training is completed, each edge client randomly selects M old class reference samples from initial training samples and stores the M old class reference samples in an old reference sample memory, and a central server collects and updates a local model trained by each edge client to obtain a global model of an initial learning level;

after each round of autonomous training is completed, each edge client randomly selects M old class reference samples from training samples used in the current round of training and stores the M old class reference samples in an old reference sample memory for subsequent training; the central server collects the trained local models of all edge clients to aggregate; after cluster collaborative training of the R-round autonomous learning hierarchy is carried out, the central server selects a model with highest diagnosis accuracy in the R-round from a global view as an optimal global model of the hierarchy.

The above steps of the present invention will be further described below.

1) Constructing a data stream:

constructing a series of data stream hierarchies, expressed asWherein t represents the t th hierarchy, and layer represents the total number of layers; the t th hierarchical data is expressed as +>The data of the hierarchy is composed of->Sample->And their tag value->Composition (S)/(S)>Time domain vibration signal representing collected critical component, tag value +.>Representing the health status category of the component, +.>Indicated at tTag sets in the hierarchy, which contain->A new health status category is seeded; the set of all old health status categories in the t-1 layer is +.>，/>For the set of old health categories in layer i,to get the union set for t-1 sets, +.>Is the label set in layer j.

Setting K edge clients according to actual engineering requirementsAnd a central server->. Establishing an old reference sample memory at each edge client, wherein the old reference sample memory is used for randomly selecting M old class reference samples after each hierarchical training is completed, and the following relation is satisfied: />。

Constructing a global model formed by cross stacking of a convolution layer, a pooling layer and a full connection layer on a central serverWherein->Representing the entered data>Is a model trainable parameter. Dividing edge clients into at each levelThree categories, i.e.)>. Specifically, the->By->Edge clients are composed which do not collect new data of the current stage but store old reference samples through the previous learning hierarchy, +.>By->An edge client, which not only collects new data for the current phase but also stores old reference samples for the previous phase,by->And the edge clients are composed, and the edge clients only receive new data of the current stage and do not have any old reference samples.

2) Initial learning hierarchy:

the initial learning level is denoted as level 0. At this level, all edge clients are. The edge client firstly downloads an initial global model from a central server to a local area to obtain a local model and initial values of trainable parameters, and secondly trains the local model by utilizing unique data; wherein each edge client has unique data only a subset of the data in the data stream hierarchy, each edge client has only a portion of the data stream hierarchy, and all client data are added together to be the data in the data stream hierarchy. Specifically, the first edge client will be global model +.>Downloaded to local, local model +.>It is shown that the local model and the global model have the same model structure, the trainable parameters in the global model +.>Initial value trained for local model +.>. Using the collected monitoring data +.>Training local model by batch (i.e. setting a certain batch size, inputting the batch size into the model, and determining the batch size according to the data size), and adopting a classification loss function +.>As learning criterion, a small batch of random gradient descent method is used for the parameters +.>Update by loss function->The following formula describes:

wherein b is the batch size,is a binary cross entropy loss.

After training is completed, M old class reference samples are randomly selected from the initial samples and stored in an old reference sample memory, and the old reference samples are used as a training set for the next training round, and can be used forAnd the model forgetting rate is reduced. And uploading the trained local model to a central server without revealing the original data. The central server models the local area of each edge clientAggregation is carried out, after R rounds are executed on the flow, the updated global model is finally obtained>. After the initial learning phase is finished, the model enters an autonomous learning phase.

3) Autonomous learning hierarchy:

in the autonomous learning level, the data is divided into different levels, and the data flow levelWill change with time, when the current is the t-th hierarchy, the edge client is +.>Dynamically changing with the change of data stream layer, adding +/periodically at each layer>Let the number of clients +.>Gradually increasing.

In practical problems, the edge client data changes dynamically and the edge client does not know when new hierarchical data will be received, and when a new health class appears, it cannot determine whether the newly received tag is from the new hierarchical data or from old data types collected by other edge clients. Therefore, the invention establishes an entropy change monitoring mechanism to accurately identify the occurrence of new hierarchical data. Specifically, for training dataCalculating average entropy->The calculation of the average entropy is described by the following formula:

When (when)Is suddenly increased and satisfies +.>，/>When the edge client receives new hierarchical data, the hierarchical value is updated from the original t-1 to t, then the data in the old reference sample memory is updated and the old local model +.>And (5) preserving.

At the t-th level, the number of samples of each category is unbalanced, which results in that the local model performs different learning speeds when learning the health status of the new category, and simultaneously performs different forgetting speeds for the old health category, thereby greatly affecting the diagnosis efficiency. In this regard, the present invention constructs a gradient weighted loss function that normalizes new categories and learning rates and forgetting rates for categories. The specific flow of calculating the loss function is as follows:

i. training samples of the current levelInputting the output value into a local model of the edge client to obtain an output value of the last layer of the local model;

calculating a gradient measurement of the training sample based on the output values of the local modelSpecifically described by the following formula:

；

carrying out gradient normalization on the rate of learning new class samples by the local model, wherein a normalization formula is described as follows:

；

carrying out gradient normalization on the rate of forgetting old class samples of the local model, wherein a normalization formula is described as follows:

；

v. pair loss functionRe-weighting to obtain a gradient weighted loss function>：

，

Wherein,；

wherein b is the batch size,gradient measurement of the ith training sample at the t-th level for the ith edge client; />Normalizing the characterization value for the gradient; />Is a binary cross entropy loss; />The local model for the first edge client outputs for sample i at the t-level, +.>A health status category label value for sample i at the t level for the first edge client; />To indicate the function, when->When this value is expressed as 1, otherwise 0; />For values after shaving normalization of learning rate of new class samples ++>Is a value obtained by carrying out gradient normalization on the forgetting rate of the old class sample.

In addition, the appearance of new categories can lead to catastrophic forgetfulness of the local model. In this regard, the invention considers the bottom layer relation between the new class and the old class, and constructs a distillation loss of the old class to effectively relieve the disastrous forgetfulness. The specific flow is as follows:

i. aiming at the first edge client, the local model of the upper layer is obtainedAnd local model of the hierarchy->；

inputting training samples into the local model according to batches to obtain the output value of the local model of the previous layerAnd the output value of the local model of the hierarchy +.>；

Output value of local model of last layerSubstitution of the sample tag ∈>The individual dimension values, resulting in tag variants->；

Calculation of distillation losses of old class，/>In the formula->Indicating that the KL divergence of both is calculated.

For the first edge client, its local loss function is expressed as:

wherein,weighting the loss function for gradients，/>For old distillation loss function, +.>And->The weights of the two loss functions are respectively.

The method has the advantages that the catastrophic forgetting of the model caused by the expansion of the new fault class is relieved from the local view through the local loss function, and the capability of diagnosing the new fault class of the model is improved. Although the local loss function can effectively solve the problem that the local end is forgotten in a disastrous way, the local loss function cannot solve the problem that data between the edge clients are heterogeneous, so that the deviation of a model diagnosis result is caused. In this regard, the present invention adjusts the learning direction of the model from a global perspective. The method comprises the following steps:

when the edge client monitors the occurrence of new fault categories, selecting a representative prototype sample for each new fault categoryThe output value of the prototype sample in the local model is close to the average value of the output values of all samples in the local model under the class; in order to achieve privacy protection, information is propagated to the prototype samples through a gradient communication mechanism.

In particular, a feature extractor model is constructedThe model structure is the same as the feature extraction part of the local model, wherein the gradient of the feature extractor model is +.>To represent that the output is a feature vector representing the features of the prototype sample.

In order to prevent information leakage, gaussian noise is added to prototype sample characteristics to obtain noise prototype sample characteristics; the noise prototype sample features were expressed as:

wherein,standard deviation of prototype sample +.>For prototype sample, ++>Representing the feature extraction process->Representing a gaussian distribution->For noise weight, usually set +.>To control the effect of gaussian noise.

the central server collects noise prototype sample characteristics and characteristic extractor model gradients of all edge clients and then breaks up, data of all edge clients are uploaded to the central server, and the central server breaks up the data, so that if a network invader invades, the invader does not know which edge client the data comes from, and information of a certain edge client is prevented from being tracked. Then, the noise prototype sample characteristics uploaded by all edge clients and disturbed are input into a sample reconstruction model in a summation way to obtain a prototype sample;

after R-round cluster collaborative training is carried out on the central server and each edge client, each round of aggregated global modelAnd (5) storing.

And then, the model is evaluated by using the prototype samples, and the prototype samples are equivalent to the simulated prototype samples, wherein the prototype samples contain the privacy information of the data, and the prototype samples simulate the corresponding health state information of the prototype samples, so that the privacy information of the data can be protected. Specifically, the class prototype samples are respectively input into the global models of the R rounds, the diagnosis accuracy of the global models of each round is obtained, and the global model with the highest diagnosis accuracy is selected as the optimal global model of the hierarchy.

The strategy adjusts the learning direction of the model from the global view, thereby effectively relieving the deviation of the diagnosis result caused by data isomerism among companies and improving the stability of the model.

4) The model is then in an autonomous learning phase, method 3).

In general, the method can effectively solve the problem that the diagnosis model under the cloud-edge collaborative framework is difficult to realize autonomous learning, not only can break the data barriers existing between the edge clients and fully utilize the data resources scattered on the edge clients, but also can protect the privacy of the data, and simultaneously solves the problem that the model is forgotten in a disastrous way due to the increase of fault types, thereby providing a flexible solution for the actual rail transit fault diagnosis scene.

According to the method, the problem of data isomerism among all edge clients is processed from a global view, a unique gradient communication mechanism is designed, the constraint of data privacy is broken through by sharing an intermediate model instead of original data, and autonomous learning fault diagnosis under a cloud-edge cooperative scene is realized. Specifically, the method combines the gradient weighted loss function and the distillation loss function at the edge client, and locally expands the fault types diagnosable by the model; and a cluster strategy is adopted to upload local models of a plurality of edge clients to a central server so as to capture the optimal transmission path of knowledge between the edge clients, thereby alleviating the problem of limited knowledge propagation of the model in the autonomous learning process. In addition, the invention does not simply share the data, but extracts the characteristics of the data and adds certain interference, and then combines the gradient of the model to communicate, thereby realizing the desensitization processing of the data. The method provided by the invention does not need to provide private secret data for each edge client, and each edge client can highly closely cooperate to finish updating the model only through sharing the model information, so that high-precision fault diagnosis work is realized. The method has important significance for improving the application potential of cloud edge cooperative technology in the field of rail transit.

The invention also takes the fault diagnosis simulation experiment of the train running part as a case to verify the effectiveness of the method.

The experiment table structure is designed according to an actual train running part, wherein components comprise a motor, a gear box, an axle box and the like, and the ratio of the experiment table to the actual running part is set to be 1: and 2, carrying out fault simulation experiments. The experiment considers 40 working conditions altogether, different train speeds are simulated by setting different running speeds for the running part, different transverse loads are applied to simulate transverse forces born by the train when the train is in a straight line or a curve, the vertical load is set to be a constant value of 10kn for simulating the weight of the train body, the running conditions of different trains are simulated by setting the working conditions, and each working condition corresponds to one edge client data. The rotating speed of the motor is controlled by a frequency converter, and the load is applied by electrohydraulic load equipment, so that the health states of 32 different train running parts, including motor phase failure, rotor bending, axle box bearing inner ring faults and the like, are simulated. For each operating condition state, 180 samples are taken, each sample containing 3200 sample points.

In order to make a cloud-edge collaborative diagnostic scenario, a central server and 20 edge clients are set in the initial learning stage. The 32 health states in each edge client are divided into 4 parts, arranged in 4 levels of learning, respectively. With the increase of each hierarchy, 5 new edge clients are introduced, and experiments compare the method provided by the invention with the traditional intelligent diagnosis method. Method a builds 5 individual edge clients that do not employ cluster learning strategies, only use local unique data training models, and their network results are identical to the global model structure in cluster learning. Method B uses a cluster learning strategy, but does not adjust model learning direction from a global perspective, only uses a local loss function to train the model. Method C uses a cluster learning strategy to adjust the learning direction of the model from a global perspective, but does not employ distillation loss in the local loss function.

Meanwhile, the experiment tests a gradient communication mechanism by utilizing a data set, and the confidentiality of the method is verified. The network parameter configuration is shown in table 1, and the training related parameters are shown in table 2. The experimental results are summarized in table 3 and fig. 3.

TABLE 1 configuration of network parameters

TABLE 2 training related parameter summary table

TABLE 3 comparison of diagnostic accuracy for each method

The experimental result shows that the invention can maintain higher diagnosis accuracy against continuous data stream layers, and the method has obvious superiority compared with other methods along with the change of each layer.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The cluster autonomous learning fault diagnosis method for the key components of the train is characterized by comprising the following steps of;

constructing a series of data stream layers, and setting a plurality of edge clients and a central server; the series of data stream levels are shown asWherein t represents the t th hierarchy, and layer represents the total hierarchy number; the t th hierarchical data is expressed as +>The data of the hierarchy is composed of N ^t Sample->And their tag value->Composition (S)/(S)>Time domain vibration signal representing collected critical component, tag value +.>Representing the health status category of the component, Y ^t A tag set represented in a t-th hierarchy, which includes C ^t A new health status category is seeded; the set of all old health status categories in the t-1 layer isC ^_i For the set of old health categories in layer i, < >>To get the union of t-1 sets, Y ^j A label set in the j layer;

in each autonomous learning hierarchy, constructing a gradient weighted loss function of each edge client to normalize learning rates and forgetting rates of new class samples and old class samples; the construction process of the gradient weighted loss function is as follows:

constructing a gradient weighted loss function L _GC ：

Wherein->

Wherein b is the batch size,gradient measurement of the ith training sample at the t-th level for the ith edge client;normalizing the characterization value for the gradient; d (D) _CE Is a binary cross entropy loss; />The local model for the first edge client outputs for sample i at the t-level, +.>A health status category label value for sample i at the t level for the first edge client; />To indicate the function, when->At the time, the indication function +.>The value of (1) is expressed as 1, otherwise as 0; g _n Normalized for the learning rate of new class samplesValue of G _o The value after gradient normalization is carried out on the forgetting rate of the old class sample;

constructing an old class distillation loss function of each edge client based on the bottom layer relation between the new class sample and the old class sample; the construction process of the old distillation loss function is as follows:

aiming at the first edge client, the local model of the upper layer is obtainedAnd local model of the hierarchy

Inputting training samples into the local model according to batches to obtain output values of the local model of the previous layerAnd the output value of the local model of the hierarchy +.>

Output value of the local model of the previous layerFront C for substituting the sample tag ^p The individual dimension values, resulting in tag variants->

Calculation of distillation loss L of old class _RD ，D in _KL (||) represents calculating the KL divergence of both;

constructing a local loss function of each edge client based on the gradient weighted loss function and the old distillation loss function; for the first edge client, its local loss function is expressed as:

L _l ＝λ ₁ L _GC +λ ₂ L _RD ；

wherein L is _GC For gradient weighted loss function, L _RD Lambda is the old distillation loss function ₁ And lambda (lambda) ₂ Weights of two loss functions respectively;

after each round of autonomous training is completed, the central server collects the trained local models of all edge clients to aggregate; after cluster collaborative training of the R-round autonomous learning hierarchy is carried out, a central server selects a model with highest diagnosis accuracy in the R-round from a global view as an optimal global model of the hierarchy;

and performing fault diagnosis on the key parts of the train based on the optimal global model.

2. The method for diagnosing a cluster autonomous learning fault for a key train component according to claim 1, wherein after each level training is completed, M old class reference samples randomly selected by the edge client satisfy the following relationship:

3. the method for diagnosing a train critical component oriented cluster autonomous learning fault as claimed in claim 1, wherein the data flow level changes with time in the autonomous learning level, and the edge client is the t level in the presentDynamically changing with the change of data stream level, S is added at each level irregularly _n Let the number of clients k=k _o +K _b +K _n Gradually increasing; wherein S is _o From K _o Edge client composition, K _o The edge clients do not collect new data of the current stage and store the new data through the previous learning levelOld reference samples; s is S _b From K _b Edge client composition, K _b The edge clients not only collect new data of the current stage, but also store old reference samples of the previous stage; s is S _n From K _n Edge client composition, K _n The edge clients only receive new data for the current stage, not including old reference samples.

4. The method for diagnosing a cluster autonomous learning fault for key components of a train according to claim 1, wherein the calculation formula of the average entropy is:

wherein,for the output value of the local model, < +.>For the number of samples, I (·) is the entropy function, denoted as I (P) = Σ _i P _i logP _i ；

When (when)Is suddenly increased and satisfies +.>r _h When=1.2, the edge client is considered to be receiving new hierarchical data, and the hierarchical value is updated from t-1 to t.

5. The method for diagnosing a cluster autonomous learning fault for a critical train component according to claim 1, wherein the determining the optimal global model in each autonomous learning hierarchy comprises:

when the edge client monitors the occurrence of new fault categories, for each new fault category, it is from thereSelecting a representative prototype sampleThe output value of the prototype sample in the local model is close to the average value of the output values of all samples in the local model under the class;

6. The method for diagnosing a cluster autonomous learning fault for a critical train component according to claim 5, wherein the noise prototype sample features are expressed as:

where σ is the standard deviation of the prototype sample,for prototype sample, ++>The feature extraction process is represented, N (·, ·) represents gaussian distribution, and γ is noise weight.