CN108280289A

CN108280289A - Bump danger classes prediction technique based on local weighted C4.5 algorithms

Info

Publication number: CN108280289A
Application number: CN201810058598.8A
Authority: CN
Inventors: 王彦彬; 彭连会; 何满辉
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2018-01-22
Filing date: 2018-01-22
Publication date: 2018-07-13
Anticipated expiration: 2038-01-22
Also published as: CN108280289B

Abstract

The present invention provides a kind of bump danger classes prediction technique based on local weighted C4.5 algorithms, is related to Prediction of Rock Burst technical field.This method uses MDLP methods to carry out discretization to the connection attribute data in sample data first, local weighted method choice training set is used again and calculates sample weights, the information gain-ratio of each attribute is calculated using sample weights, select sample attribute as the root node of C4.5 decision trees and the Split Attribute of other each branch nodes according to information gain-ratio, it finally uses sample weights that number of samples is replaced to carry out pessimistic beta pruning to the decision tree of foundation, realizes the prediction to estimation range bump danger classes.Bump danger classes prediction technique provided by the invention based on local weighted C4.5 algorithms, it is biased to the selection more multiattribute deficiency of value when overcoming in ID3 algorithms using information gain selection node split attribute, overfitting problem is avoided, the forecasting accuracy of model is higher.

Description

Bump danger classes prediction technique based on local weighted C4.5 algorithms

Technical field

The present invention relates to Prediction of Rock Burst technical field more particularly to a kind of impacts based on local weighted C4.5 algorithms Press danger classes prediction technique in ground.

Background technology

Bump be around mine working and stope coal and rock generated due to the release of deformation energy with unexpected, anxious The dynamic phenomenon that acute, fierce destruction is characterized is one of the disaster for influencing Safety of Coal Mine Production, nearly all in the world Country is all threatened by bump to some extent, and developed country was for Energy restructuring and security consideration land in recent years Continuous to close bump mine, China becomes the main injured country of bump and carries out the major country of Controlling of Coal Outburst.

Impact is pressed and predicted, evaluated to be pressed into impact on the basis of pressing genesis mechanism research to impact Row prevention a committed step, but due to the mechanism to bump do not recognize completely it is clear, especially to deep rush It presses the research of genesis mechanism to be still in the starting stage with hitting, increases the difficulty of Prediction of Rock Burst.Impact is pressed at present The method of row prediction mainly has rock mechanics method and geophysical method, wherein rock mechanics method to have drilling cuttings method, adopt and answer Power detection method etc., geophysical method have the methods of monitoring ground sound, micro seismic monitoring, electromagnetic radiation monitoring；In addition with artificial intelligence There are some methods for carrying out Prediction of Rock Burst using intelligent algorithm, such as in the development of energy：Neural network method, Bayes sentence Other analytic approach, support vector machines etc., the above method achieve lot of research in carrying out bump danger classes prediction, But there is also the sample size that some problems, such as neural network generally require is more, and it is used for the sample size of Prediction of Rock Burst Less, Bayes methods need have higher independence between data, and the bump sampled data in reality is difficult to meet solely Vertical property requirement, and the above method does not account for the overfitting problem etc. of model.

Invention content

In view of the drawbacks of the prior art, the present invention provides a kind of bump danger etc. based on local weighted C4.5 algorithms Grade prediction technique, realizes the prediction to the bump danger classes of coal and rock around mine working and stope.

Bump danger classes prediction technique based on local weighted C4.5 algorithms, includes the following steps：

Step 1, the bump data of acquisition known class are as sample data, if the sample data set of acquisition is T, sample This category set is C, and k ' is sample class sum, and the quantity of sample is N；

Step 2 describes criterion (MDLP, the to the connection attribute data in the sample data of known class using minimum Minimum Description Length Principle) discretization is carried out, specific method is：

Step 2.1：By wait for discretization one group of continuous property and its respective classes according to continuous property from small to large Sequence be ranked up；

Step 2.2：The difference of classification corresponding to the continuous property after sequence, select continuous property as point Boundary's point constitutes boundary point set；If different classes of corresponding attribute value is identical, the attribute corresponding to minimum classification is selected Value is used as separation；

Step 2.3：The information gain for calculating all separations in boundary point set, selects the boundary of information gain minimum Point, and judge whether the separation meets minimum description criterion and retain the separation if met；Otherwise, remove the boundary Point；

The calculation formula of the information gain of the separation is as follows：

Gain (a)=H (C)-H (C | a)

Wherein, a is the separation that separation is concentrated, and H (C) is classification information entropy, and H (C | a) is separation a by classification Set C is divided into the comentropy after two subsets；

If a_minIt is the separation of information gain minimum, category set C is divided into two subset C₁And C₂, judge a_min The calculation formula for whether meeting minimum description criterion is as follows：

Gain(a_min) ＞ log₂(N-1)/N+log₂(3^k′- 2)-[k ＇ H (C)-k '₁H(C₁)-k′₂H(C₂)]

Wherein, k '₁、k′₂Respectively subset C₁And C₂In included categorical measure；

Step 2.4：Whether separation in judgment step 2.3 will also have in two sequence of intervals that original data set is divided Other separations, if so, then the separation in each sequence of intervals reformulates corresponding boundary point set and return to step 2.3, continue to judge whether each sequence of intervals retains corresponding boundary according to the quantity of sample in sequence of intervals and respective classes set Point, it is no to then follow the steps 2.5；

Step 2.5：According to the boundary point set of final choice, sequence of intervals division is carried out to connection attribute data, if Final no separation meets minimum description criterion, then all connection attribute data are divided into a sequence of intervals in the attribute, Otherwise connection attribute data are divided into different sequence of intervals by separation, obtain the discretization results of connection attribute data；

Step 2.6：Whether the connection attribute in judgement sample data set has carried out discretization, if it is, executing step Rapid 3, otherwise, step 2.1- steps 2.5 are repeated, all connection attributes of sample data set are subjected to discretization；

Step 3, acquisition region to be predicted bump attribute data, and by connection attribute data therein and step 2 Middle respective attributes data are compared, and connection attribute number in bump attribute data in region to be predicted is determined according to comparison result According to the sequence of intervals at place, thus by the connection attribute Data Discretization in bump attribute data in region to be predicted；

It is searched in step 4, the discretization d ataset generated from step 2 using k nearest neighbor algorithm adjacent with sample to be predicted K sample, the training set of C4.5 decision trees is made of k sample, and calculate the weight of sample in training set；

The weight of sample is calculated according to following formula in the training set：

Wherein, ω_iFor the weight of i-th in the training set sample adjacent with sample to be predicted, i=1,2 ..., k, d_iTo wait for Forecast sample is to i-th of sample data x_iDistance, which uses the attribute data of sample, and is counted according to range formula It calculates, d_maxFor the maximum value of the distance of all samples in sample to training set to be predicted；

Step 5：According to the information gain-ratio of all properties in the weight calculation training set of sample data in training set, in root In the generating process of node and other each branch nodes, select in each secondary iterative process the maximum attribute of information gain-ratio as The Split Attribute of root node and other each branch nodes in C4.5 decision trees；

The specific method of the information gain-ratio of attribute is in the calculating training set：

If V is an attribute in training set, v_jFor j-th of attribute value in attribute V, j=1,2 ..., m, m is training The mutual misaligned attribute value number for concentrating the attribute V of sample data, the category set in training set corresponding to sample data For C '={ c₁、c₂、…、c_n, wherein c_i′For the i-th ' a classification, i '=1,2 ..., n, n is corresponding to sample data in training set The sum of classification, the specific method for calculating the information gain-ratio of attribute in training set are：

The classification information entropy for calculating sample data in training set, is shown below：

Wherein,It is c for sample class in training set_i′Sample weight and, ω_C′For all categories in training set The weight of sample and p (c_i′) it is that classification is c in training set_i′Sample weight andWith the weight of the sample of all categories and ω_C′Ratio；

The class condition entropy for calculating sample data in training set, is shown below：

Wherein,It is v for attribute value_jSample weight and, ω_VFor all samples in attribute V weight and,Table Show that attribute value is v_jSample in belong to c_i′The sum of sample weights of class, p (v_j) it is that attribute value is v in training set_jSample Weight and with the weight of all samples and ratio, p (c_i′|v_j) be attribute value it is v_jSample in classification be c_i′Sample Weight and with all properties value be v_jSample weight sum ratio；

The information gain for calculating the attribute V of sample data in training set, is shown below：

I (C ', V)=I (C ')-I (C ' | V)

The comentropy for calculating the attribute V of sample data in training set, is shown below：

The information gain-ratio for calculating the attribute V of sample data in training set, is shown below：

Gain_radio (V)=I (C ', V)/I (V)；

Step 6：Decision tree is established according to Split Attribute, beta pruning, beta pruning are then carried out to decision tree using pessimistic beta pruning method The error rate for using sample weights that number of samples is replaced to calculate branch node and corresponding leaf node in the process；Finally by generating Decision tree presses danger classes to be predicted with treating the potential impact of estimation range.

As shown from the above technical solution, the beneficial effects of the present invention are：It is provided by the invention to be based on local weighted C4.5 The minimum of the bump danger classes prediction technique of algorithm, use describes criterion MDLP methods to continuous in sample data The discretization that attribute data carries out can preferably handle connection attribute data, after local weighted method can be according to discretization The distance of sample to sample to be predicted select training set and assign different weights to the sample in training set, the C4.5 of use is calculated Method calculates information gain-ratio to select node split attribute using sample weights, overcomes in ID3 algorithms and is selected using information gain It is biased to selection value more multiattribute deficiency when selecting node split attribute, and replaces number of samples progress pessimistic using sample weights Cut operator avoids overfitting problem, improves the accuracy of prediction model.

Description of the drawings

Fig. 1 is the bump danger classes prediction technique provided in an embodiment of the present invention based on local weighted C4.5 algorithms Flow chart.

Specific implementation mode

With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Implement below Example is not limited to the scope of the present invention for illustrating the present invention.

The present embodiment uses the impact based on local weighted C4.5 algorithms of the present invention by taking the Yanshitai Colliery of somewhere as an example Danger classes prediction technique is pressed on ground, is predicted the bump danger classes of the Yanshitai Colliery.

Bump danger classes prediction technique based on local weighted C4.5 algorithms, as shown in Figure 1, including following step Suddenly：

Step 1, the bump data of acquisition known class are as sample data, if the sample data set of acquisition is T, sample This category set is C, and k ' is sample class sum, and the quantity of sample is N.

Since the factor for influencing bump is more, the present embodiment chooses coal thickness (V₁), inclination angle (V₂), buried depth (V₃), construction Situation (V₄), change of pitch angle (V₅), coal thickness change (V₆), gas density (V₇), roof control (V₈), release (V₉), ring coal report (V₁₀) 10 factors predict the bump danger classes of coal mine as the attribute of sample data.Wherein, situation is constructed (V₄), change of pitch angle (V₅), coal thickness change (V₆), roof control (V₈), release (V₉), ring coal report (V₁₀) it is state parameter, it assigns It is as shown in table 1 to be worth situation：

1 state parameter assignment of table

The danger classes of bump is divided into four classifications according to the intensity of bump, the classification 1 of respectively microshock, The classification 2 of weak impact, the classification 4 of the classification of medium impact 3 and thump.

The bump data as sample data of the present embodiment acquisition are as shown in table 2.

Bump data of the table 2 as sample data

Gain (a)=H (C)-H (C | a)

Step 2.6：Whether the connection attribute in judgement sample data set has carried out discretization, if it is, executing step Rapid 3, otherwise, step 2.1- steps 2.5 are repeated, all connection attributes of sample data set are subjected to discretization.

In the present embodiment, connection attribute V₁、V₃And V₇The information gain of separation concentrated of separation be unsatisfactory for minimum Criterion is described, according to MDLP discretization principles, corresponding connection attribute data discrete turns to same sequence of intervals, defeated in the present embodiment Go out is 1.Connection attribute V₂Final separation be continuous property 45, therefore will be greater than being classified as one equal to 45 continuous property A sequence of intervals and output are 2, and the continuous property less than 45 is classified as a sequence of intervals and exports to be 1.In the present embodiment, warp It is as shown in table 3 to cross the sample data as training set after discretization.

Sample data after 3 discretization of table

Step 3, acquisition region to be predicted bump attribute data, and by connection attribute data therein and step 2 Middle respective attributes data are compared, and connection attribute number in bump attribute data in region to be predicted is determined according to comparison result According to the sequence of intervals at place, thus by the connection attribute Data Discretization in bump attribute data in region to be predicted.

In the present embodiment, in order to verify the validity of the method for the present invention, using the attribute data in table 4 as acquisition wait for it is pre- Region bump attribute data is surveyed, the categorical data in table 4 is used for being compared with prediction result, in 10 groups of data Connection attribute data obtained in 10 groups of data by being compared with respective attributes data in 25 groups of data in table 2 The discretization results of connection attribute data, as shown in table 5.

4 data to be predicted of table

Serial number	V₁/m	V₂/(°)	V₃/m	V₄	V₅	V₆	V₇/(m³·min^-1)	V₈	V₉	V₁₀	Classification
												1	1.5	35	530	0	0	0	0.56	3	3	0	1
2	1.6	62	307	3	2	2	1	0	0	2	4
												3	1.9	59	542	1	2	3	0.25	0	0	1	3
4	1.3	44	570	0	0	0	0.66	3	3	0	1
												5	2.2	54	290	3	2	2	1	0	0	2	4
6	3	34	475	2	2	1	0.42	0	0	2	3
												7	3.2	42	574	3	0	0	0.29	0	0	2	3
8	1.8	62	283	3	2	3	1	0	0	2	4
												9	1.3	44	656	2	1	3	0.24	1	1	2	3
10	1.2	40	553	2	2	2	0.49	1	2	2	3

Data to be predicted after 5 discretization of table

The weight of sample is calculated according to following formula in training set：

Wherein, ω_iFor the weight of i-th in the training set sample adjacent with sample to be predicted, i=1,2 ..., k, d_iTo wait for Forecast sample is to i-th of sample data x_iDistance, which uses the attribute data of sample, and is counted according to range formula It calculates, d_maxFor the maximum value of the distance of all samples in sample to training set to be predicted.

I (C ', V)=I (C ')-I (C ' | V)

Gain_radio (V)=I (C ', V)/I (V)；

In the present embodiment, in order to verify the estimated performance for the decision-tree model established according to the sample data after discretization, The method of ten folding cross validations is used to test model first.Since sample data quantity is few in training set, intersection is tested The sample data selected in card in whole training sets is used as neighbouring sample, in addition significance during the beta pruning of C4.5 decision trees It is set as common 25%, the sample distance in weighting study is determined using Euclidean distance function, is trained using discretization Accuracy of the model through ten folding cross validation results that sample set is established is 88%, and uses the mould that initial data is established in table 2 The accuracy of type is 84%, shows that the sample data after discretization can establish better prediction model.

Using local weighted C4.5 algorithms to the bump danger classes in the area to be predicted in the table 4 after discretization into Row prediction.The present embodiment is also built using NaiveBayes methods, original C4.5 methods and random forest method according to data in table 2 It founds prediction model to predict the bump danger classes in table 4, such as table 6 of the comparison with the prediction result of the method for the present invention It is shown：

The prediction result of 6 bump danger classes of table compares

Algorithm	Accuracy
		NaiveBayes	70%
C4.5 decision trees	80%
		Random forest	80%
The method of the present invention	100%

The bump danger classes in the method for the present invention energy Accurate Prediction area to be predicted, prediction result are excellent as seen from the table In NaiveBayes methods, the prediction result of original C4.5 methods and random forest method.

Finally it should be noted that：The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, it will be understood by those of ordinary skill in the art that：It still may be used To modify to the technical solution recorded in previous embodiment, either which part or all technical features are equal It replaces；And these modifications or replacements, model defined by the claims in the present invention that it does not separate the essence of the corresponding technical solution It encloses.

Claims

1. a kind of bump danger classes prediction technique based on local weighted C4.5 algorithms, it is characterised in that：Including following Step：

Step 1, the bump data of acquisition known class are as sample data, if the sample data set of acquisition is T, sample Category set is C, and k ' is sample class sum, and the quantity of sample is N；

Step 2 describes criterion (MDLP, the to the connection attribute data in the sample data of known class using minimum Minimum Description Length Principle) carry out discretization；

Step 3, acquisition region to be predicted bump attribute data, and by connection attribute data therein and phase in step 2 It answers attribute data to be compared, connection attribute data institute in bump attribute data in region to be predicted is determined according to comparison result Sequence of intervals, to by the connection attribute Data Discretization in bump attribute data in region to be predicted；

The k adjacent with sample to be predicted is searched in step 4, the discretization d ataset generated from step 2 using k nearest neighbor algorithm Sample, the training set of C4.5 decision trees is made of k sample, and calculates the weight of sample in training set；

Step 5：According to the information gain-ratio of all properties in the weight calculation training set of sample data in training set, in root node In the generating process of other each branch nodes, the maximum attribute of information gain-ratio in each secondary iterative process is selected to determine as C4.5 The Split Attribute of root node and other each branch nodes in plan tree；

Step 6：Decision tree is established according to Split Attribute, beta pruning, beta pruning process are then carried out to decision tree using pessimistic beta pruning method It is middle that the error rate that number of samples calculates branch node and corresponding leaf node is replaced using sample weights；The finally decision by generating Tree presses danger classes to be predicted with treating the potential impact of estimation range.

2. the bump danger classes prediction technique according to claim 1 based on local weighted C4.5 algorithms, special Sign is：Described in step 2 carry out discretization specific method be：

Step 2.1：By wait for discretization one group of continuous property and its respective classes according to continuous property from small to large suitable Sequence is ranked up；

Step 2.2：The difference of classification corresponding to the continuous property after sequence selects continuous property as boundary Point constitutes boundary point set；If different classes of corresponding attribute value is identical, the attribute value corresponding to minimum classification is selected As separation；

Step 2.3：The information gain for calculating all separations in boundary point set selects the separation of information gain minimum, and Judge whether the separation meets minimum description criterion and retain the separation if met；Otherwise, remove the separation；

Gain (a)=H (C)-H (C | a)

Wherein, a is the separation that separation is concentrated, and H (C) is classification information entropy, and H (C | a) is separation a by category set C is divided into the comentropy after two subsets；

If a_minIt is the separation of information gain minimum, category set C is divided into two subset C₁And C₂, judge a_minWhether The calculation formula for meeting minimum description criterion is as follows：

Step 2.4：It is whether also other in two sequence of intervals that separation in judgment step 2.3 is divided original data set Separation, if so, then the separation in each sequence of intervals reformulates corresponding boundary point set and return to step 2.3, root Continue to judge whether each sequence of intervals retains corresponding separation according to the quantity and respective classes set of sample in sequence of intervals, otherwise Execute step 2.5；

Step 2.5：According to the boundary point set of final choice, sequence of intervals division is carried out to connection attribute data, if finally There is no separation to meet minimum description criterion, then all connection attribute data are divided into a sequence of intervals in the attribute, otherwise Connection attribute data are divided into different sequence of intervals by separation, obtain the discretization results of connection attribute data；

Step 2.6：Whether the connection attribute in judgement sample data set has carried out discretization, if so, 3 are thened follow the steps, Otherwise, step 2.1- steps 2.5 are repeated, all connection attributes of sample data set are subjected to discretization.

3. the bump danger classes prediction technique according to claim 1 based on local weighted C4.5 algorithms, special Sign is：The specific method of the weight of sample is in training set described in step 4：

Wherein, ω_iFor the weight of i-th in the training set sample adjacent with sample to be predicted, i=1,2 ..., k, d_iIt is to be predicted Sample is to i-th of sample data x_iDistance, which uses the attribute data of sample, and is calculated according to range formula, d_maxFor the maximum value of the distance of all samples in sample to training set to be predicted.

4. the bump danger classes prediction technique according to claim 3 based on local weighted C4.5 algorithms, special Sign is：The specific method of the information gain-ratio of attribute is in calculating training set described in step 5

If V is an attribute in training set, v_jFor j-th of attribute value in attribute V, j=1,2 ..., m, m is sample in training set The mutual misaligned attribute value number of the attribute V of notebook data, the category set in training set corresponding to sample data be C '= {c₁、c₂、…、c_n, wherein c_i′For the i-th ' a classification, i '=1,2 ..., n, n is classification corresponding to sample data in training set Sum, the specific method for calculating the information gain-ratio of attribute in training set are：

Wherein,It is c for sample class in training set_i′Sample weight and, ω_C′For the sample of all categories in training set Weight and, p (c_i′) it is that classification is c in training set_i′Sample weight andWith the weight and ω of the sample of all categories_C′ Ratio；

Wherein,It is v for attribute value_jSample weight and, ω_VFor all samples in attribute V weight and,It indicates to belong to Property value be v_jSample in belong to c_i′The sum of sample weights of class, p (v_j) it is that attribute value is v in training set_jSample power Weight and with the weight of all samples and ratio, p (c_i′|v_j) be attribute value it is v_jSample in classification be c_i′Sample power It is again and with all properties value v_jSample weight sum ratio；

I (C ', V)=I (C ')-I (C ' | V)

Gain_radio (V)=I (C ', V)/I (V).