CN114970709B

CN114970709B - Improved GA-based data-driven AHU multi-fault diagnosis feature selection method

Info

Publication number: CN114970709B
Application number: CN202210555420.0A
Authority: CN
Inventors: 苏义鑫; 史凡跃; ***; 张华军; 刘晨宇
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2024-06-14
Anticipated expiration: 2042-05-20
Also published as: CN114970709A

Abstract

The invention relates to the technical field of feature selection of AHU fault diagnosis sample data, and discloses a data-driven AHU multi-fault diagnosis feature selection method based on improved GA, which comprises the following steps: determining faults, collecting sample data, preprocessing the data, eliminating redundant features, initializing improved GA parameters, initializing heuristic improved GA population, selecting features of the sample data, establishing a fault diagnosis model, and training and obtaining an optimal feature subset. The data-driven AHU multi-fault diagnosis feature selection method based on the improved GA can accurately select certain feature elements with high fault correlation, can effectively reduce the dimension of sample data, reduces the calculated amount and improves the accuracy of AHU fault diagnosis.

Description

Improved GA-based data-driven AHU multi-fault diagnosis feature selection method

Technical Field

The invention relates to the technical field of feature selection of AHU fault diagnosis sample data, in particular to a data-driven AHU multi-fault diagnosis feature selection method based on improved GA.

Background

An Air Handling Unit (AHU) is used as a core component of an HVAC (heating ventilation system), plays a crucial role in ensuring that the HVAC can work normally, and has the characteristics of long working time and large load. And therefore is also the most prone subsystem to failure in HVAC, it is necessary to diagnose the failure in order to improve the safety and reliability of HVAC for easy service and maintenance. In recent years, the data-driven AHU fault diagnosis method can establish a fault diagnosis model by finding a valuable rule from a large amount of historical data through self-learning, has higher accuracy, has low dependence on expert knowledge and an AHU mathematical model, and gradually becomes a hot spot.

Because of the characteristics of huge, complex and more parameters of the AHU system, how to select effective system parameters in the AHU system for constructing a fault diagnosis model becomes a research key and a difficulty of an AHU fault diagnosis method based on data driving. If useful sample data cannot be acquired, it is difficult to establish an accurate AHU fault diagnosis model by using the useful sample data, and the diagnosis accuracy of the established fault diagnosis model is not high due to too few and too many characteristic elements selected by the sample data and low correlation of faults. In addition, too many feature elements can result in a computational dimensionality disaster, so it is necessary to explore a feature element that can be effectively selected for use in certain fault diagnostics.

Disclosure of Invention

The invention aims to overcome the defects of the technology, and provides a data-driven AHU multi-fault diagnosis feature selection method based on an improved GA, which can accurately select certain feature elements with higher fault correlation, effectively reduce the dimension of sample data, reduce the calculated amount and improve the accuracy of AHU fault diagnosis.

In order to achieve the above object, the data-driven AHU multi-fault diagnosis feature selection method based on an improved GA according to the present invention includes the steps of:

a) Determining a fault: determining faults in the AHU, wherein a fault diagnosis model needs to be established;

b) Sample data acquisition: firstly, normal sample data acquisition is carried out, then faults are artificially applied to the AHU, and the sample data acquisition of the faults is carried out to obtain sample data;

c) Data preprocessing: tagging the sample data and normalizing the sample data;

D) Rejecting redundant features: removing redundant features in the sample data by using a Pearson correlation coefficient method;

e) Improved GA parameter initialization: the method comprises the steps of population size, maximum iteration times, variation probability, crossover probability, expected feature number and variance of normal distribution obeyed by the feature number;

F) Heuristic improvement GA population initialization: adopting binary coding to generate a population with the feature number of the feature subset satisfying the mean value as an expected feature number and the variance as the variance of normal distribution obeyed by the feature number;

G) Feature selection is performed on sample data: decoding individuals in the population, and carrying out feature selection on the sample data through the feature subset to obtain sample data after feature selection;

H) Establishing a fault diagnosis model and training: a part of the sample data is randomly divided into training sets, the training sets are input into a classifier for training to obtain a fault diagnosis model, the part of the sample data, from which the training sets are removed, is used as a test set for testing the fault diagnosis model, a test result is output, the test result comprises an accuracy rate, a false alarm rate and a false alarm rate, the test result is returned to an adaptation function of an improved GA, an individual adaptation value is calculated, selection and cross variation are carried out according to the individual adaptation value, and the cross probability and the variation probability are adaptively adjusted along with the individual adaptation value and convergence speed;

L) obtaining an optimal feature subset: and (3) returning to the step G) after updating the population until the iteration termination condition is met, and decoding the optimal individuals in the population of the last generation to obtain a feature subset which is the optimal feature subset.

Preferably, in the step a), system parameters that are abnormal after the AHU fails are selected, defined as potential feature elements { x ₁,x₂,...,x_n } for fault diagnosis, and corresponding types of sensors are arranged at corresponding positions of the system parameters and transmitted to the computer through the data acquisition module.

Preferably, in the step B), the sample data in each type of fault is marked as X _i, i is the serial number of the sample data, X _i＝(x₁,x₂,...,x_n)^T,x₁,x₂,...,x_n is the characteristic element for fault diagnosis, in the step C), the actual label l _c of the fault class is added to the sample data, the sample data X _i＝(x₁,x₂,...,x_n,l_c)^T s.t.l_c epsilon {1, 2..the., k }, k is the total number of fault classes, and then all the sample data are normalized to the (0, 1) interval, in the step D), redundant features in the sample data are removed by using a Pearson correlation coefficient method, a Pearson correlation coefficient ρ between any two feature elements is calculated, if the absolute value of the Pearson correlation coefficient is greater than or equal to a set threshold T _ρ, any one feature element is removed, and the mathematical expression of the Pearson correlation coefficient ρ is as follows:

Where x _a and x _b represent any two feature elements, E (-) represents the desire, and the sample data after feature selection becomes Y _i＝(y₁,y₂,...,y_p,l_c)^T s.t. The number of features removed is (n-p).

Preferably, in the step E), the population size p=50, the maximum iteration number T _i =100, the variation probability P _m ⁰ =0.1, the crossover probability P _c ⁰ =0.8, the expected feature number N _g =n/2, N being the total number of feature elements in the sample data, and the variance σ ² =1.

Preferably, in the step G), the individuals in the population are decoded, the corresponding feature subset is selected, 1 represents selection, 0 represents rejection, and the sample data after feature selection is Z _i＝(z₁,z₂,...,z_q,l_c)^T s.t.Q is the number of 1 in the chromosome coding of individuals in the population.

Preferably, in the step H), the training set is 60% -80% of the sample data, the rest is the test set, and the classifier is one of SVM, ANN, decision tree or random forest.

Preferably, in the step H), the accuracy rateFalse alarm rateRate of missing report/>Where N _test is the total number of sample data in the test set,/>Predicting the number of correct sample data in the test set, namely the number of the matching of the predicted label of the sample data and the actual label l _c,/>To predict the number of sample data where both the tag and the actual tag l _c are faulty,/>To predict the amount of sample data for which both the tag and the actual tag l _c are normal,To predict the number of sample data for which the tag is faulty and the actual tag l _c is normal,/>To predict that the tag is normal, the actual tag l _c is the number of failed sample data.

Preferably, in the step H), the fitness function of the improved GA is:

f(v_h)＝λ*ac+μ*wb+ε*lb s.t.λ+μ+ε＝1,h∈{1,2,...,P}

Wherein v _h is the individual in the population, P is the population size, lambda is the weight coefficient of the accuracy ac, mu is the weight coefficient of the false positive rate wb, epsilon is the weight coefficient of the false negative rate lb.

Preferably, in the step H), the crossover probability and the mutation probability are adaptively adjusted according to the following strategies:

Wherein T is the current iteration number, namely algebra of the population, p _c and p _m are respectively the crossover probability and mutation probability of the individual v _h in the population of the T generation, xi is the adjustment coefficient of the crossover probability, And/>The upper limit values of the crossover probability and the mutation probability are respectively shown as f _max which is the maximum fitness value in the T generation population, f _min which is the minimum fitness value in the T generation population, and f _avg which is the average fitness value in the T generation population,/>For fitness value of individual v _h in the T-th generation population, f _g is the desired optimal fitness value,Second order backward difference of maximum fitness,/>For the second-order backward difference of the maximum fitness in the iterative process, the calculation formula is as follows:

Wherein the method comprises the steps of F _max (0) is the maximum fitness value in the initial population.

Preferably, in the step L), the iteration termination condition is: the iteration number T of the population is equal to the maximum iteration number T _i or the second-order backward difference of the maximum adaptability of the populationLess than or equal to a set threshold/>Or the maximum fitness value f _max of the population is greater than or equal to the expected optimal fitness value f _g.

Compared with the prior art, the invention has the following advantages:

1. the method can accurately select a characteristic element with higher fault correlation, effectively reduce the dimension of sample data, reduce the calculated amount and improve the accuracy of AHU fault diagnosis;

2. The improved GA is used for fault diagnosis feature selection, so that the selection effect can be effectively improved, and the selection time is shortened.

Drawings

FIG. 1 is a schematic flow diagram of a diagnostic model in a data driven AHU multi-fault diagnostic feature selection method based on an improved GA of the present invention;

FIG. 2 is a schematic flow chart of the improved GA of the present invention.

Detailed Description

The invention will now be described in further detail with reference to the drawings and to specific examples.

As shown in fig. 1 and 2, a data-driven AHU multi-fault diagnosis feature selection method based on an improved GA includes the steps of:

A) Determining a fault: determining faults in the AHU, such as coil pollution scaling, fan faults, controller faults, fresh air valve faults, return air valve faults, cooling coil faults and the like, selecting system parameters which are abnormal after the faults of the AHU occur, namely, the system parameters are highly correlated with the faults and defined as potential characteristic elements { x ₁,x₂,...,x_n } for fault diagnosis, arranging corresponding types of sensors at corresponding positions of the system parameters, transmitting the sensors to a computer through a data acquisition module, and normally arranging greenhouse sensors at air supply pipelines or outlets for acquisition of air supply temperature and humidity;

B) Sample data acquisition: firstly, carrying out normal sample data acquisition, then manually applying faults to the AHU, carrying out fault sample data acquisition to obtain sample data, wherein the sample data in each type of faults are marked as X _i, i is the serial number of the sample data, and X _i＝(x₁,x₂,...,x_n)^T,x₁,x₂,...,x_n is a characteristic element for fault diagnosis;

C) Data preprocessing: adding an actual label l _c of the fault class to the sample data, wherein the sample data X _i＝(x₁,x₂,...,x_n,l_c)^T s.t.l_c epsilon {1,2,.. K }, k is the total number of the fault classes, and then normalizing all the sample data to a (0, 1) interval;

D) Rejecting redundant features: and removing redundant features in the sample data by using a Pearson correlation coefficient method, calculating a Pearson correlation coefficient rho between any two feature elements, and removing any one feature element if the absolute value of the Pearson correlation coefficient is greater than or equal to a set threshold T _ρ, wherein the mathematical expression of the Pearson correlation coefficient rho is as follows:

Where x _a and x _b represent any two feature elements, E (-) represents the desire, and the sample data after feature selection becomes Y _i＝(y₁,y₂,...,y_p,l_c)^T s.t. The number of the removed characteristics is (n-p);

E) Improved GA parameter initialization: the method comprises the steps of population size, maximum iteration times, variation probability, cross probability, expected feature number and variance of normal distribution obeyed by the feature number, wherein in the embodiment, the population size p=50, the maximum iteration times T _i =100, the variation probability P _m ⁰ =0.1, the cross probability P _c ⁰ =0.8, the expected feature number N _g =n/2, N is the total number of feature elements in sample data, and the variance sigma ² =1;

F) Heuristic improvement GA population initialization: generating a normally distributed population with the feature number of the feature subset meeting the average value of N _g = N/2, and the variance of sigma ² = 1 by adopting binary coding;

G) Feature selection is performed on sample data: and decoding individuals in the population, selecting a corresponding feature subset, wherein 1 represents selection, 0 represents rejection, and sample data after feature selection is Z _i＝(z₁,z₂,...,z_q,l_c)^T s.t. Q is the number of 1 in the individual chromosome codes in the population;

H) Establishing a fault diagnosis model and training: a part of the sample data is randomly divided into training sets, the training sets are input into a classifier for training to obtain a fault diagnosis model, the part of the sample data, from which the training sets are removed, is used as a test set, the training set is 60% -80% of the sample data, the classifier is one of an SVM, an ANN, a decision tree or a random forest, the fault diagnosis model is tested, a test result is output, the test result comprises an accuracy rate, a false alarm rate and a false alarm rate, the test result is returned to an adaptation function of an improved GA, an individual adaptation value is calculated, selection and cross variation are carried out according to the individual adaptation value, and the cross probability and the variation probability are adaptively adjusted along with the individual adaptation value and the convergence speed, wherein the accuracy rate is the same as the individual adaptation value False alarm rate/>Rate of missing reportWhere N _test is the total number of sample data in the test set,/>For the number of sample data in the test set that are predicted to be correct, i.e. the number of sample data for which the predicted tag matches the actual tag l _c,To predict the number of sample data where both the tag and the actual tag l _c are faulty,/>For the number of sample data for which both the predictive tag and the actual tag l _c are normal,/>To predict the number of sample data for which the tag is faulty and the actual tag l _c is normal,/>To predict the number of sample data for which the label is normal and the actual label l _c is faulty, the fitness function of the improved GA is:

f(v_h)＝λ*ac+μ*wb+ε*lb s.t.λ+μ+ε＝1,h∈{1,2,...,P}

Wherein v _h is the individual in the population, P is the population size, lambda is the weight coefficient of the accuracy ac, mu is the weight coefficient of the false positive rate wb, epsilon is the weight coefficient of the false negative rate lb, and the crossover probability and the variation probability are adaptively adjusted according to the following strategies:

Wherein the method comprises the steps of F _max (0) is the maximum fitness value in the initial population;

L) obtaining an optimal feature subset: returning to the step G) after updating the population until the iteration termination condition is met, wherein the feature subset obtained by decoding the optimal individuals in the population of the last generation is the optimal feature subset, and the iteration termination condition is as follows: the iteration number T of the population is equal to the maximum iteration number T _i or the second-order backward difference of the maximum adaptability of the population Less than or equal to a set threshold/>Or the maximum fitness value f _max of the population is greater than or equal to the expected optimal fitness value f _g.

The invention discloses a data-driven AHU multi-fault diagnosis feature selection method based on improved GA, wherein a feature subset is selected by a binary coding method, 1 represents selection, and 0 represents discarding. And constructing an adaptability function of the GA according to the diagnosis accuracy rate or/and the false alarm rate of the fault diagnosis model, updating the population through cross variation, and performing iterative optimization to obtain a final optimal feature subset, so that feature elements which are most suitable for fault diagnosis of a certain fault are selected.

Claims

1. A data-driven AHU multi-fault diagnosis feature selection method based on improved GA is characterized in that: the method comprises the following steps:

c) Data preprocessing: tagging the sample data and normalizing the sample data;

H) Establishing a fault diagnosis model and training: the method comprises the steps of dividing part of sample data into training sets randomly, inputting the training sets into a classifier for training to obtain a fault diagnosis model, using the part of the sample data, from which the training sets are removed, as a test set, testing the fault diagnosis model, outputting a test result, wherein the test result comprises an accuracy rate, a false alarm rate and a false alarm rate, returning the test result to an adaptation function of an improved GA, calculating an adaptation value of an individual, selecting and carrying out cross variation according to the adaptation value of the individual, and adaptively adjusting the cross probability and variation probability along with the adaptation value and convergence speed of the individual, wherein the adaptation function of the improved GA is as follows:

f(v_h)＝λ*ac+μ*wb+ε*lb s.t.λ+μ+ε＝1,h∈{1,2,...,P}

2. The improved GA-based data-driven AHU multi-fault diagnosis feature selection method of claim 1, wherein: in the step A), system parameters which are abnormal after the AHU has faults are selected, the system parameters are defined as potential characteristic elements { x ₁,x₂,...,x_n } for fault diagnosis, corresponding types of sensors are arranged at corresponding positions of the system parameters, and the sensors are transmitted to a computer through a data acquisition module.

3. The improved GA-based data-driven AHU multi-fault diagnosis feature selection method of claim 2, wherein: in the step B), the sample data in each type of faults is recorded as X _i, i is the serial number of the sample data, X _i＝(x₁,x₂,...,x_n)^T,x₁,x₂,...,x_n is a feature element used for fault diagnosis, in the step C), an actual label l _c of a fault class is added to the sample data, the sample data X _i＝(x₁,x₂,...,x_n,l_c)^T s.t.l_c epsilon {1, 2..the k } is the total number of fault classes, then all the sample data are normalized to a (0, 1) interval, in the step D), redundant features in the sample data are removed by using a Pearson correlation coefficient method, a Pearson correlation coefficient ρ between any two feature elements is calculated, if the absolute value of the Pearson correlation coefficient is greater than or equal to a set threshold T _ρ, any one feature element is removed, and the mathematical expression of the Pearson correlation coefficient ρ is as follows:

Wherein x _a and x _b represent any two characteristic elements, E (-) represents the expectation, and sample data after characteristic selection becomes The number of features removed is (n-p).

4. The improved GA-based data-driven AHU multi-fault diagnosis feature selection method of claim 3, wherein: in the step E), the population size p=50, the maximum iteration number T _i =100, the variation probability P _m ⁰ =0.1, the crossover probability P _c ⁰ =0.8, the expected feature number N _g =n/2, N being the total number of feature elements in the sample data, and the variance σ ² =1.

5. The improved GA-based data-driven AHU multi-fault diagnosis feature selection method of claim 4, wherein: in the step G), decoding the individuals in the population, selecting the corresponding feature subset, wherein 1 represents selection, 0 represents elimination, and the sample data after feature selection isQ is the number of 1 in the chromosome coding of individuals in the population.

6. The improved GA-based data-driven AHU multi-fault diagnosis feature selection method of claim 5, wherein: in the step H), the training set is 60% -80% of sample data, the rest is a test set, and the classifier is one of SVM, ANN, decision tree or random forest.

7. The improved GA-based data-driven AHU multi-fault diagnosis feature selection method of claim 6, wherein: in the step H), the accuracy rateFalse alarm rate/>Rate of missing reportWhere N _test is the total number of sample data in the test set,/>For the number of sample data in the test set that are predicted to be correct, i.e. the number of sample data for which the predicted tag matches the actual tag l _c,To predict the number of sample data where both the tag and the actual tag l _c are faulty,/>For the number of sample data for which both the predictive tag and the actual tag l _c are normal,/>To predict the number of sample data for which the tag is faulty and the actual tag l _c is normal,/>To predict that the tag is normal, the actual tag l _c is the number of failed sample data.

8. The improved GA-based data-driven AHU multi-fault diagnosis feature selection method of claim 7, wherein: in the step L), the iteration termination condition is: the iteration number T of the population is equal to the maximum iteration number T _i or the second-order backward difference of the maximum adaptability of the populationLess than or equal to a set threshold/>Or the maximum fitness value f _max of the population is greater than or equal to the expected optimal fitness value f _g.