CN110175195B - Mixed gas detection model construction method based on extreme random tree - Google Patents

Mixed gas detection model construction method based on extreme random tree Download PDF

Info

Publication number
CN110175195B
CN110175195B CN201910329097.3A CN201910329097A CN110175195B CN 110175195 B CN110175195 B CN 110175195B CN 201910329097 A CN201910329097 A CN 201910329097A CN 110175195 B CN110175195 B CN 110175195B
Authority
CN
China
Prior art keywords
gas
algorithm
mixed gas
extreme random
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910329097.3A
Other languages
Chinese (zh)
Other versions
CN110175195A (en
Inventor
许永辉
孙超
赵玺
杨子萱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201910329097.3A priority Critical patent/CN110175195B/en
Publication of CN110175195A publication Critical patent/CN110175195A/en
Application granted granted Critical
Publication of CN110175195B publication Critical patent/CN110175195B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/0004Gaseous mixtures, e.g. polluted air
    • G01N33/0009General constructional details of gas analysers, e.g. portable test equipment
    • G01N33/0027General constructional details of gas analysers, e.g. portable test equipment concerning the detector
    • G01N33/0031General constructional details of gas analysers, e.g. portable test equipment concerning the detector comprising two or more sensors, e.g. a sensor array
    • G01N33/0034General constructional details of gas analysers, e.g. portable test equipment concerning the detector comprising two or more sensors, e.g. a sensor array comprising neural networks or related mathematical techniques
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/0004Gaseous mixtures, e.g. polluted air
    • G01N33/0009General constructional details of gas analysers, e.g. portable test equipment
    • G01N33/0027General constructional details of gas analysers, e.g. portable test equipment concerning the detector
    • G01N33/0036General constructional details of gas analysers, e.g. portable test equipment concerning the detector specially adapted to detect a particular component
    • G01N33/004CO or CO2
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/0004Gaseous mixtures, e.g. polluted air
    • G01N33/0009General constructional details of gas analysers, e.g. portable test equipment
    • G01N33/0027General constructional details of gas analysers, e.g. portable test equipment concerning the detector
    • G01N33/0036General constructional details of gas analysers, e.g. portable test equipment concerning the detector specially adapted to detect a particular component
    • G01N33/0047Organic compounds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Immunology (AREA)
  • Pathology (AREA)
  • Combustion & Propulsion (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a mixed gas detection model construction method based on an extreme random tree, which comprises the steps of carrying out data acquisition on mixed gas to obtain a data set, wherein the data set comprises at least three gas signal time sequences, calculating an optimal curved path of the gas signal time sequences, and screening by utilizing the optimal curved path; extracting gas characteristics from the screened gas signal time sequence by using a principal component analysis method; and establishing a model by using an extreme random number algorithm, and classifying the target mixed gas. The invention provides a mixed gas detection model construction method based on an extreme random tree, which can improve classification accuracy and time efficiency to a greater extent.

Description

Mixed gas detection model construction method based on extreme random tree
Technical Field
The invention relates to the technical field of machine olfaction, in particular to a mixed gas detection model construction method based on an extreme random tree.
Background
In the field of mixed gas detection at present, many researchers have achieved good classification effects, such as using Support Vector Machines (SVMs), artificial Neural Networks (ANN), K-nearest neighbor (KNN), and other algorithms. In order to improve the classification accuracy, some researchers propose an optimized adaboost. M2 model, fuse multiple classifiers, perform a drug classification experiment, and finally achieve the highest recognition accuracy of 91.75% through setting different fusion rules. And an estimation algorithm of posterior probability extracted from SVM is also provided, 10 bacterial components in human blood are detected by using machine olfaction, and the identification accuracy is high but the time cost is high. The other part of the researcher documents adopt a probability Bayesian algorithm to solve the uncertain relation in the gas source positioning, and meanwhile, the positioning efficiency of the gas in practice is improved through a path planning algorithm in a Markov decision process. The application of PCA and an Artificial Neural Network (ANN) algorithm can improve and distinguish the water content in soil, but the ANN algorithm is lack of interpretability, and has low convergence rate and low efficiency. No algorithm in the prior art can enable the detection precision to reach a level of more than 99%. Moreover, researchers never consider the accuracy problem of the data of the gas sensor; the PCA in the traditional feature extraction mode is an algorithm with high dimensionality, and features of the algorithm need to be constructed when the dimensionality of the algorithm is not high; in the classification algorithm, the anti-fitting ability is strong, the training time is fast, and the algorithm with high classification accuracy is more and is not supported. However, no extreme random tree algorithm model based on a random forest improvement algorithm exists in the current patent, so that the problem in the field of mixed gas detection is solved.
Therefore, how to provide a mixed gas detection model construction method based on an extreme random tree and having high detection accuracy is a technical problem to be solved urgently by those skilled in the art.
Disclosure of Invention
Aiming at the condition that the classification accuracy of two mixed gases is low, the classification accuracy and the time efficiency of models such as a traditional Support Vector Machine (SVM) and the like are not high enough, so that the invention provides the mixed gas detection model construction method based on the extreme random tree, and the classification accuracy and the time efficiency are improved to a greater extent. The specific scheme is as follows:
s1, carrying out data acquisition on the mixed gas to obtain a data set, wherein the data set comprises at least three gas signal time sequences, calculating an optimal bent path of the gas signal time sequences, and screening the gas signal time sequences by using the optimal bent path;
s2, extracting gas characteristics from the screened gas signal time sequence by using a principal component analysis method;
and S3, establishing a model by using an extreme random number algorithm, and classifying the target mixed gas.
Preferably, the calculation process of the optimal curved path of the gas time series in S1 is as follows:
s11, constructing a distance matrix of two gas signal time sequences; the two time sequences are respectively X = (X) 1 ,x 2 ,…x m )、Y=(y 1 ,y 2 ,…y n ) Wherein, the length of the two time sequences is m and n. D m×n M x n distance matrix constructed for two time series
Figure BDA0002037144320000021
Wherein D is m×n Element d in (1) ij Is through x i And y i The coordinate distance is obtained by calculation, and the calculation process is as follows:
d ij =||x i -y j || w
when w =2, the Euclidean distance is 2-norm, i is more than or equal to 1 and less than or equal to m, and j is more than or equal to 1 and less than or equal to n;
s12, passing through D m×n Finding a curved path p with a minimum distance min I.e. an optimally curved path
p min ={p 1 ,p 2 ,…p d ,…p k }
k∈{max(m,n),m+n+1}
Wherein p is d To search for a destination d ij The current cumulative distance of the curved path, then p d+1 The calculation formula is as follows:
p d+1 =p d +min[d (i+1)j ,d (i+1)(j+1) ,d i(j+1) ];
s13, abandoning P min Time series of two maximum sets of gas signalsThe residual gas signal time series is used as input data for step 2.
Preferably, S2 specifically includes:
s21, constructing original characteristics of the gas signal; constructing and obtaining a gas signal multi-dimensional original characteristic by using an interactive characteristic method;
and S22, performing dimensionality reduction on the multi-dimensional original features of the gas signals by adopting a principal component analysis method to obtain original data samples.
Preferably, the S3 specifically includes:
s31, in a classification model of an extreme random tree, each base classifier uses all original data samples to train, wherein an original data set D, the number of samples N and the number of features M are included;
s32, generating a decision tree according to a CART algorithm; when node splitting is carried out, M features are randomly selected from the M features at each split node, a plurality of categories are randomly extracted and placed into one branch, the rest categories are placed into the other branch, the optimal split value of each node is calculated at the same time, the optimal attribute splitting is selected, and pruning operation is not carried out in the splitting process; iterating the split subsets to preset values to generate a decision tree;
s33, repeating the steps S31 and S32 for K times to finally generate an extreme random tree model consisting of K decision trees;
and S34, testing the trained extreme random tree model, and finally generating a final classification result through voting.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a dynamic time warping algorithm based on DTW, which improves the classification accuracy by 26.87%; based on the original feature construction and principal component analysis algorithm, the classification accuracy is improved by 25.8%; finally, the time efficiency problem in the random forest algorithm is improved through the extreme random tree algorithm, the final classification accuracy rate reaches 99.17%, the time efficiency is improved by 66.85% compared with the random forest algorithm, and the time efficiency is only 103.2568 seconds. By the method, the problem of classification of the mixed gas is solved, the random forest algorithm is improved to a greater extent, the classification accuracy of a machine olfaction system is improved, and a theoretical basis is provided for the algorithm simulating the olfactory sensation nervous system. An extreme random tree algorithm is adopted, and a prediction result is generated through voting decision, so that the generalization capability is stronger; all original data samples are used for training the base classifier, and the training result precision is higher; since the node splitting is random selection, the randomness is greatly enhanced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a method for constructing a mixed gas detection model based on an extreme random tree according to the present invention;
FIG. 2 is a graph showing the response of a sensor to collecting gas data in accordance with the present invention;
FIG. 3 is a graph of the dynamic response of the sensor TGS2602 of the present invention to Et _ L _ Me _ H;
FIG. 4 is a three-dimensional abstract feature map of the feature engineering of the present invention;
FIG. 5 is a diagram of an extreme random tree algorithm according to the present invention;
FIG. 6 is a schematic diagram of the accuracy of cross validation of 10 folds after DTW in accordance with the present invention;
FIG. 7 is a schematic diagram of cross-validation accuracy after feature construction according to the present invention;
FIG. 8 is a comparison graph of the algorithm model runtime of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a method for constructing a mixed gas detection model based on an extreme random tree,
s1 dynamic time warping algorithm (DTW)
In this example, the detection was carried out using a mixed gas obtained by mixing ethylene-CH 4 and ethylene-CO. Under each label, 6 experiments were combined into a different data set, where each label refers to a gas mixture category. The duration of the data sampling phase is 300 seconds. No air was admitted for the first 60 seconds. And introducing the mixed gas with the set concentration ratio into the gas chamber at 60 seconds, wherein the introduction time of the mixed gas is 180 seconds. Finally, no mixed gas is introduced for 60 seconds. The sensor array is composed of 8 sensors, the frequency of the sensors is set to be 50HZ, and the mixed gas data set is acquired by the 8 sensors. Storing data sets according to a time rule, wherein each data set comprises 11 columns of data: time(s), temperature, humidity (%) and TGS2600, TGS2612, TGS2611, TGS2610, TGS2602, TGS2620 sensors collect data. The sensor collects data as its resistance value is represented by a, and then converted into a uniform value by Rs (KOhm) =10 × (3110-a)/a. Referring to the description of the figure 2, the sensor response graph for a given experiment, taking the case of Et _ H _ Me _ n as an example, et represents high concentration of ethylene H, me represents zero concentration of methane n, the abscissa is time, and the ordinate is the converted sensor reading.
To explore the data collection situation of the sensors, response curve analysis is performed on the situation that the TGS2602 is under the same label (i.e., et _ M _ Me _ M label), and refer to the dynamic response curves of TGS2602 to Et _ L _ Me _ H in fig. 3 (1) - (6) of the specification. It can be seen from the figure that the same sensor has changed to a different extent for the response in the same situation. It is evident that in the last two experiments, the sensor response curves were found to be significantly different from the previous ones. It is therefore inferred that in the experiment, data inconsistency occurs to various degrees due to problems such as the arrangement of experimental conditions.
With previous analysis, the data needs to be efficiently preprocessed. Because the mixed gas data is a gas signal response curve based on a time series, the dynamic time warping work is carried out on the data set. Dynamic time warping is an algorithm based on the idea of Dynamic Programming (DP), which optimizes the characteristic parameter misalignment and its basic principle is to find the optimal curved path between time series. And searching the point with the most same characteristic in other sequences through the coordinate values of the data points in one sequence, and calculating the distance between the points with the same characteristic after the point is found, so as to calculate the sum of the distances of the two time sequences as the optimal curved path.
Suppose that the two time series are X = (X) respectively 1 ,x 2 ,…x m )、Y=(y 1 ,y 2 ,…y n ) Wherein the length of the two time sequences is m and n. D m×n An m n distance matrix constructed for two time series.
Figure BDA0002037144320000061
Wherein D m×n Element d in (1) ij Is through x i And y i The coordinate distance is obtained by calculation, and the calculation process is as follows:
d ij =||x i -y j || w
when w =2, it is the euclidean distance 2-norm. Through D m×n Finding a curved path p with a minimum distance min I.e. the DTW distance between the two time sequences.
p min ={p 1 ,p 2 ,…p d ,…p k }
k∈{max(m,n),m+n+1}
Wherein, let p d To search for a destination d ij The current cumulative distance of the curved path.
For p min Three conditions are to be satisfied by the search of (1): 1) A fixed starting point, the starting point of the path being d 11 End point is d mn . 2) The monotonicity is consistent, and the current point position d of the search is set ij Current cumulative distance is p d , p d+1 =p d +d i′j′ Then i '> i, j' > j. 3) The continuity is consistent, and the current point of the search is set as d ij Current integration distance is p d ,p d+1 =p d +d i′j′ Then i '< i +1,j' < j +1. When the three conditions are met, the initial position of the search path is determined by the first point, the second point determines that the position of the next point of the search path is one of the right side, the upper side or the upper right side of the current point, and if the current point is p d And assume that the search point at this time is d ij Then p is d+1 The calculation formula is as follows:
p d+1 =p d +min[d (i+1)j ,d (i+1)(j+1) ,d i(j+1) ]
finally obtain p min Meanwhile, the problem that accumulated distances are different due to different sequence lengths is solved through accumulated distance averaging processing.
d=p min /k
d is the cumulative distance of averaging the two sequences.
Due to the limitation of the three-point constraint condition, the DTW algorithm traverses all observation points, and each original sequence can find the corresponding point. Finally, through setting of a dynamic time warping algorithm (DTW) algorithm, the method carries out preliminary screening on samples from original data to further improve the classification effect.
Each label of the original data set contains 6 times of repeated experimental data, namely, each sensor carries out 6 groups of acquisition aiming at one mixed gas category to obtain a time sequence of 6 gas signals, and the time sequence is subjected to P in a DTW algorithm min Calculating, discarding P min The maximum two experimental data, the remaining data were used as input data for S2.
S2, data selection and feature extraction:
primitive feature construction
In the comparison test, the set score classifier and the comparison classifier are trained and compared by using the original features and the constructed features. The original data set has 8-dimensional characteristics, and in order to improve the classification accuracy, the data characteristics are constructed, and the characteristics with the best classification effect are found by comparing the conditions of different characteristics. Feature construction is performed because the highest accuracy that can be achieved with the training data is determined. Through feature construction, the problem of poor learning capability of the recognition algorithm can be solved. Therefore, on the basis of the original features, the accuracy of the classification algorithm is improved by constructing new features.
A common feature construction method has interactive features, such as features a and B, and creates features a × B, A-B, A/B, A + B, which can make the feature space explosive. In the embodiment, as 8 sensors are adopted for gas signal data acquisition, the applied characteristic is 8-dimensional characteristic, and the created characteristic is A-B, A/B, after the interactive characteristic is created, the multi-dimensional original characteristic of the gas signal is obtained, and the characteristic number is changed into 56.
Specific implementation steps of Principal Component Analysis (PCA)
The specific implementation steps of the principal component analysis are as follows:
(1) Normalizing raw data
PCA is based on a covariance matrix of data, and the size of the data is different, so in order to keep the dimension of the data consistent, the original feature data should be standardized first. The mean of the dimensions was subtracted from the data and divided by the standard deviation of the dimensions.
Figure BDA0002037144320000081
E(X i ) Means, D (X) of the data i ) Representing the variance of the data.
(2) Computing covariance matrices for data
The covariance matrix of the normalized data is the correlation coefficient matrix of the original features. The derivation is shown in the formula.
Figure BDA0002037144320000091
The correlation coefficient matrix R can be expressed as
Figure BDA0002037144320000092
(3) Calculating the eigenvalue and eigenvector of the correlation coefficient R
From characteristic equations
Figure BDA0002037144320000093
Solving the eigenvalue of the correlation coefficient matrix to be lambda i (i =1,2,3.. P), the eigenvector is the ordering of eigenvalues from large to small, λ 1 ≥λ 2 ≥...≥λ p Is more than or equal to 0. Will be lambda i Substituting (R- λ iE) x =0, solving for the eigenvector a i And a is i Unit is e i
(4) Calculating the cumulative contribution ratio to obtain the principal component
The cumulative contribution rate of the arranged eigenvalues is calculated, and generally, when the cumulative contribution rate of the first t eigenvalues reaches 85% -95%, the t eigenvalues can be taken as the principal components, and in this embodiment, when t is 3, t =3, the cumulative contribution rate of the eigenvalues reaches 90%.
Figure BDA0002037144320000094
(5) Determining the load of the principal component
Figure BDA0002037144320000101
From the above equation, the 8-dimensional data is converted into a linear combination of 8 variables to obtain a principal component Y = (Y) 1 ,y 2 ,...,y m ) T
To illustrate the discreteness of data features, all data, each class, is abstracted into three-dimensional features, as shown in FIG. 4 of the specification, and FIG. 4 is an abstraction of the original 8-dimensional feature data into 3-dimensional features that are displayed in a three-dimensional graph. XYZ denote three-dimensional coordinate axes. It can be seen that the features are clearly discrete and cannot be classified by conventional single algorithms.
S3, an extreme random tree algorithm:
extreme random tree
Extreme random trees (ET, also called extreme random forest) are similar to random forest algorithms, and are integrated by a plurality of decision trees, so that the method has many same advantages. For example, the method has the advantages of excellent classification effect, high accuracy, capability of well processing high-dimensional feature data, no need of feature selection, capability of parallelizing calculation and high execution efficiency and the like. In the field of processing mixed gas detection classification, an ensemble learning algorithm has higher classification accuracy, but each decision tree in an extreme random tree algorithm uses all original data, and a random forest algorithm uses bootstrap sampling to generate training samples. And when the node is split, the extreme random tree randomly selects the split node, but not selects the optimal split threshold or characteristic. Referring to the specification, fig. 5 is a schematic diagram of an extreme random tree algorithm.
Distinction between extreme random trees and random forest algorithms:
first, the training samples of the random forest algorithm are generated by bootstrap sampling, but each decision tree in the extreme random tree uses all the original training sample data, which helps to reduce the deviation of the model.
Secondly, when the nodes are split, the random forest classification algorithm firstly selects partial features from all the features, and generates a decision tree by accurately selecting the optimal splitting mode (such as GINI index and the like) through splitting according to the partial features. And the extreme random tree algorithm is a random selection splitting mode. The specific implementation form is as follows: for class form splitting, randomly extracting some class data and placing the data into one branch, and placing the rest class data into the other branch; for the numerical value form splitting, a threshold value between the maximum value and the minimum value is randomly selected, the threshold value is taken as a data principle of left and right branches, data larger than the threshold value is placed into one branch, data smaller than the threshold value is placed into the other branch, and sample data is placed into the two branches. Then for the classification problem herein, the split value is calculated using the GINI index. All the characteristics of the node are traversed to obtain all characteristic splitting values, and the characteristic with the maximum splitting value is selected to be split (for the regression problem, the splitting value is calculated by using the mean square error).
In the extreme random tree algorithm, since all training data samples are OOB (out-of-bag) data samples, the calculation of the prediction error of the extreme random tree is the error calculation of the OOB samples. In the research of the subject, the extreme random tree is found to be superior to the random forest algorithm in the aspects of training time efficiency, classification accuracy, fitting capacity to training data and the like.
Extreme random tree algorithm implementation steps
Where the extreme random tree algorithm is denoted by { E (K, X, D) }, where E denotes the classifier model, D denotes the raw data samples, and K denotes the number of decision trees. Each decision tree is based on sample input X = { X = 1 ,x 2 ,...,x m And generating a prediction result, and finally obtaining a classification decision according to a voting rule. The extreme random tree algorithm comprises the following specific steps:
(1): in the extreme random tree classification model, each base classifier is trained using all training samples (OOB samples), assuming an original data set D, a number of samples N, and a number of features M.
(2): and generating a decision tree according to the CART algorithm. When node splitting is carried out, M features are randomly selected from the M features at each splitting node, some categories are randomly extracted and put into one branch, the rest categories are put into the other branch, the optimal splitting value of each node is calculated at the same time, the optimal attribute splitting is selected, and pruning operation is not carried out in the splitting process. And iterating the split subsets to preset values to generate a decision tree.
(3): and (3) repeating the steps (1) and (2) for K times to finally generate an extreme random tree model consisting of K decision trees.
(4): and testing the trained extreme random tree model through the test data, and finally generating a final classification result through voting.
To verify the effect of the proposed classifier, we performed model analysis and verification in a 10-fold cross-validation manner on the original mixed gas samples of ethylene and methane and ethylene and carbon monoxide. Specific classification results and analyses are as follows.
In the Dynamic Time Warping (DTW) algorithm, we set the reference parameters num in the DTW to 1,2, and 3, and test the situation without using the DTW, and the result is shown in fig. 6 in the specification.
TABLE 1 Cross-validation accuracy
Figure BDA0002037144320000121
As can be seen from fig. 6 and table 1, the five-fold cross-validation mean accuracy was improved by 26.87% over num =0 when num = 3. From a time efficiency perspective, num =3 improves the run-time efficiency by 56.04% over num =0 model. Therefore, after DTW, the model effect is obviously improved, and the classification accuracy is improved. Referring to the specification, fig. 7 shows the running time of the DTW model, and after repeated experiments, when the parameter of the DTW is set to 3, the running time of the time model is the shortest, which is 103.2568 seconds.
The feature construction mode is adopted to increase the dimensionality of data, the optimal features are selected for training, and the analysis result is shown in fig. 7. From the analysis of fig. 7, if the feature dimension is kept unchanged, the recognition accuracy is only 73.37%, and if the dimension is increased by the a-B method, the data becomes 28-dimensional data feature, and the recognition accuracy is increased by 87.50%, and it can be seen that for a specific feature, the purposeful increase of the dimension has a good trend for the feature discrete type. Therefore, after the feature is raised to 56 dimensions, the recognition rate is improved by 18.97% compared with the case that the dimension is unchanged, and after the response passes through the dimension reduction PCA algorithm, the final recognition rate is 99.17%.
And analyzing the extreme random tree algorithm, comparing different algorithms, and finding that the extreme random tree algorithm has higher accuracy and time efficiency compared with random forests and XGboost algorithms, and has double advantages.
After comparison of a plurality of algorithms, the random forest algorithm is the most common and has the best effect in the current ensemble learning classification algorithms. Therefore, the comparison experiment is carried out through an extreme random tree algorithm of the improved random forest algorithm. And meanwhile, comparing the accuracy and the time efficiency of the XGboost algorithm and the GBDT algorithm. As can be seen from the analysis table 2, the accuracy of the extreme random tree algorithm is improved by 4.42% compared with the random forest algorithm, 5.00% compared with the XGBoost algorithm and 7.99% compared with the GBDT algorithm.
TABLE 2 Algorithm Classification accuracy comparison data
Figure BDA0002037144320000131
Referring to the specification, fig. 8 is a running time comparison graph of a classification algorithm model, according to analysis of fig. 8, in an experiment for running time efficiency of the algorithm model, the running time of an extreme random tree algorithm is shortest and is only 103.2568 seconds, the time efficiency is improved by 66.85% compared with that of a random forest algorithm, and the XGboost algorithm has the longest model running time because the model is the most complex. Therefore, the accuracy and the time efficiency of the proposed extreme random tree algorithm are obviously improved.
The method for constructing the mixed gas detection model based on the extreme random tree provided by the invention is described in detail, a specific example is applied in the method for explaining the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Claims (2)

1. A mixed gas detection model construction method based on an extreme random tree is characterized by comprising the following steps:
s1, carrying out data acquisition on the mixed gas to obtain a data set, wherein the data set comprises at least three gas signal time sequences, calculating an optimal curved path of the gas signal time sequences, and screening the gas signal time sequences by utilizing the optimal curved path; the optimal curved path calculation process of the gas signal time sequence is as follows:
s11, constructing a distance matrix of two gas signal time sequences; the two time sequences are respectively X = (X) 1 ,x 2 ,…x m )、Y=(y 1 ,y 2 …y n ) Wherein the length of the two time sequences is m, n and D m×n M x n distance matrix constructed for two time series
Figure FDA0003835227090000011
Wherein D is m×n Element d in (1) ij Is through x i And y i The coordinate distance is obtained by calculation, and the calculation process is as follows:
d ij =||x i -y j || w
when w =2, the Euclidean distance is 2-norm, i is more than or equal to 1 and less than or equal to m, and j is more than or equal to 1 and less than or equal to n;
s12, passing through D m×n Finding a curved path p with a minimum distance min I.e. an optimal curved path
p min ={p 1 ,p 2 ,…p d ,…p k }
k∈{max(m,n),m+n+1}
Wherein p is d To search to point d ij The current cumulative distance of the curved path, then p d+1 The calculation formula is as follows:
p d+1 =p d +min[d (i+1)j ,d (i+1)(j+1) ,d i(j+1) ];
s13, abandoning P min The two groups of the largest gas signal time sequences, and the remaining gas signal time sequences are used as the input data of the step 2;
s2, extracting gas characteristics from the screened gas signal time sequence by using a principal component analysis method;
s3, establishing a model by using an extreme random number algorithm, and classifying the target mixed gas, wherein the method comprises the following steps:
s31, in a classification model of an extreme random tree, each base classifier uses all original data samples to train, wherein an original data set D, the number of samples N and the number of features M are included;
s32, generating a decision tree according to a CART algorithm; when node splitting is carried out, M features are randomly selected from the M features at each split node, a plurality of categories are randomly extracted and put into one branch, the rest categories are put into the other branch, meanwhile, the optimal splitting value of each node is calculated, the optimal attribute splitting is selected, and pruning operation is not carried out in the splitting process; iterating the split subsets to a preset value to generate a decision tree;
s33, repeating the steps S31 and S32 for K times to finally generate an extreme random tree model consisting of K decision trees;
and S34, testing the trained extreme random tree model, and finally generating a final classification result through voting.
2. The method for constructing the extremely random tree-based mixed gas detection model according to claim 1, wherein the S2 specifically comprises:
s21, constructing original characteristics of the gas signal; constructing and obtaining a gas signal multi-dimensional original characteristic by using an interactive characteristic method;
and S22, performing dimensionality reduction on the multi-dimensional original features of the gas signals by adopting a principal component analysis method to obtain original data samples.
CN201910329097.3A 2019-04-23 2019-04-23 Mixed gas detection model construction method based on extreme random tree Active CN110175195B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910329097.3A CN110175195B (en) 2019-04-23 2019-04-23 Mixed gas detection model construction method based on extreme random tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910329097.3A CN110175195B (en) 2019-04-23 2019-04-23 Mixed gas detection model construction method based on extreme random tree

Publications (2)

Publication Number Publication Date
CN110175195A CN110175195A (en) 2019-08-27
CN110175195B true CN110175195B (en) 2022-11-29

Family

ID=67689897

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910329097.3A Active CN110175195B (en) 2019-04-23 2019-04-23 Mixed gas detection model construction method based on extreme random tree

Country Status (1)

Country Link
CN (1) CN110175195B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110805534B (en) * 2019-11-18 2021-02-12 长沙理工大学 Fault detection method, device and equipment of wind driven generator
CN111210871B (en) * 2020-01-09 2023-06-13 青岛科技大学 Protein-protein interaction prediction method based on deep forests
CN111862264B (en) * 2020-06-09 2023-03-31 昆明理工大学 Multiphase mixed flow type cooperative regulation and control method
CN112163376B (en) * 2020-10-09 2024-03-12 江南大学 Extreme random tree furnace temperature prediction control method based on longhorn beetle whisker search
CN114660231B (en) * 2020-12-22 2023-11-24 中国石油化工股份有限公司 Gas concentration prediction method, system, machine-readable storage medium and processor
CN112712046B (en) * 2021-01-06 2023-06-16 浙江大学 Wireless charging equipment authentication method based on equipment hardware fingerprint
CN113177594B (en) * 2021-04-29 2022-06-17 浙江大学 Air conditioner fault diagnosis method based on Bayesian optimization PCA-extreme random tree
CN115964853B (en) * 2022-11-22 2023-08-04 首都师范大学 Novel simulation method for representing ground subsidence time sequence evolution
CN117370899B (en) * 2023-12-08 2024-02-20 中国地质大学(武汉) Ore control factor weight determining method based on principal component-decision tree model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0604663D0 (en) * 2006-01-13 2006-04-19 Cytokinetics Inc Random forest modeling of cellular phenotypes
CN204666549U (en) * 2015-05-14 2015-09-23 中国人民解放军军械工程学院 Based on the mixed gas detection system of BP neural network
CN105809191A (en) * 2016-03-07 2016-07-27 四川大学 Random tree chronic nephrosis by-stage predication algorithm integrated with Bagging algorithm
CN107563425A (en) * 2017-08-24 2018-01-09 长安大学 A kind of method for building up of the tunnel operation state sensor model based on random forest
CN108446656A (en) * 2018-03-28 2018-08-24 熙家智能***(深圳)有限公司 A kind of parser carrying out Selective recognition to kitchen hazardous gas
CN109409672A (en) * 2018-09-25 2019-03-01 深圳市元征科技股份有限公司 A kind of auto repair technician classifies grading modeling method and device
CN109473148A (en) * 2018-10-26 2019-03-15 武汉工程大学 A kind of ion concentration prediction technique, device and computer storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9489624B2 (en) * 2013-03-12 2016-11-08 Xerox Corporation Method and system for recommending crowdsourcing platforms

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0604663D0 (en) * 2006-01-13 2006-04-19 Cytokinetics Inc Random forest modeling of cellular phenotypes
CN204666549U (en) * 2015-05-14 2015-09-23 中国人民解放军军械工程学院 Based on the mixed gas detection system of BP neural network
CN105809191A (en) * 2016-03-07 2016-07-27 四川大学 Random tree chronic nephrosis by-stage predication algorithm integrated with Bagging algorithm
CN107563425A (en) * 2017-08-24 2018-01-09 长安大学 A kind of method for building up of the tunnel operation state sensor model based on random forest
CN108446656A (en) * 2018-03-28 2018-08-24 熙家智能***(深圳)有限公司 A kind of parser carrying out Selective recognition to kitchen hazardous gas
CN109409672A (en) * 2018-09-25 2019-03-01 深圳市元征科技股份有限公司 A kind of auto repair technician classifies grading modeling method and device
CN109473148A (en) * 2018-10-26 2019-03-15 武汉工程大学 A kind of ion concentration prediction technique, device and computer storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MOS传感器阵列的二元混合气体检测方法研究;许永辉等;《仪器仪表学报》;20180515(第05期);全文 *
Research on a Mixed Gas Classification Algorithm Based on Extreme Random Tree;Yonghui Xu, Xi Zhao, Yinsheng Chen, and Zixuan Yang;《Applied Sciences》;20190426;全文 *
基于改进极端随机树的异常网络流量分类;韦海宇等;《计算机工程》;20181115(第11期);全文 *
基于集成学习的混合气体分类和浓度预测算法研究;赵玺;《CNKI》;20190601;全文 *
采用核主成分分析和随机森林算法的变压器油纸绝缘评估方法;张丽平等;《四川电力技术》;20180420(第02期);全文 *

Also Published As

Publication number Publication date
CN110175195A (en) 2019-08-27

Similar Documents

Publication Publication Date Title
CN110175195B (en) Mixed gas detection model construction method based on extreme random tree
Le Nguyen et al. Time series classification by sequence learning in all-subsequence space
JP6240804B1 (en) Filtered feature selection algorithm based on improved information measurement and GA
US7858868B2 (en) Method for classifying music using Gish distance values
Lu et al. Deep ranking: Triplet MatchNet for music metric learning
Marussy et al. Success: a new approach for semi-supervised classification of time-series
JP2013541085A (en) Method for providing score to object and decision support system
Wei et al. An effective gas sensor array optimization method based on random forest
Akama Controlling Symbolic Music Generation based on Concept Learning from Domain Knowledge.
Livi et al. Entropic one-class classifiers
Valverde-Rebaza et al. Music genre classification using traditional and relational approaches
Sharma et al. Comparison of ML classifiers for Raga recognition
Neshatian et al. Dimensionality reduction in face detection: A genetic programming approach
JP2007179413A (en) Pattern recognition device, pattern recognition method, and method for generating characteristic extraction parameter
AU2021101882A4 (en) Extremely randomized tree (et)–based construction method for gas mixture detection model
Valdés et al. Cough Classification with Deep Derived Features using Audio Spectrogram Transformer
CN112465054B (en) FCN-based multivariate time series data classification method
Makhtar et al. Binary classification models comparison: On the similarity of datasets and confusion matrix for predictive toxicology applications
Bostrom Shapelet transforms for univariate and multivariate time series classification
Orozco-Alzate et al. A generalization of dissimilarity representations using feature lines and feature planes
Fernandes et al. Prediction of malignant lung nodules in CT scan images using cnn and feature selection algorithms
Sari et al. Combining the active learning algorithm based on the silhouette coefficient with pckmeans algorithm
JP7171196B2 (en) Authentication device, authentication method, and authentication program
JP4852086B2 (en) Pattern recognition device
CN117437976B (en) Disease risk screening method and system based on gene detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant