CN111624681A

CN111624681A - Hurricane intensity change prediction method based on data mining

Info

Publication number: CN111624681A
Application number: CN202010454683.3A
Authority: CN
Inventors: 杨祺铭
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2020-09-04

Abstract

The invention discloses a hurricane intensity change prediction method based on a data mining model, which comprises the following steps: the method comprises the following steps: acquiring hurricane meteorological data and preprocessing the hurricane meteorological data; step two: finding a suitable classification algorithm and exploring the possibilities of RI-type hurricane classification; step three: putting the data test set into a hurricane intensity prediction model for integrated training; step four: selecting an optimal hurricane intensity prediction model from the ensemble learning according to an evaluation index system; step five: the hurricane wind power classification prediction experiment is carried out 6 hours, 12 hours and 18 hours in the future of the hurricane; the invention establishes a hurricane intensity model with better performance and less complexity and capable of basically and accurately predicting by utilizing a data mining and integrated learning method on a Weka platform, does not depend on meteorology and dynamics knowledge, does not care about a hurricane physical model and a prediction model of formation reasons, ensures the time for timely early warning and planning a disaster relief scheme, enables people to know the arrival of a hurricane in advance, well prepares for prevention and reduces economic loss.

Description

Hurricane intensity change prediction method based on data mining

Technical Field

The invention relates to the technical field of prediction of hurricane intensity change, in particular to a hurricane intensity change prediction method based on data mining.

Background

Tropical cyclones are cyclonic loops that are generated on tropical and subtropical seas, where cyclones with central wind speeds up to 33 meters per second and above are called typhoons or hurricanes. Although hurricane energy increases rainfall in arid areas while balancing heat, it presents a significant hazard, such as destroying houses, trees, and flooding, that threatens the safety of people's lives and property, as well as placing an economic burden on the country.

However, the cause of hurricanes is not fully understood in current research, and the factors that affect the increase in hurricane intensity include many aspects, some of which are unknown. The existing hurricane forecasting modes are mainly divided into three categories, namely a statistical mode, a statistical-dynamic mode and a numerical mode, when the methods are used for calculating the Maximum Possible Intensity (MPI) of different hurricanes, a proper scheme needs to be selected according to expert experience, and the intensity prediction results of different calculation schemes for the same hurricane event are different. Therefore, there is a need to utilize scientific means to explore a general prediction model for hurricanes that does not depend on meteorology, dynamics knowledge, and does not concern hurricane physical structure and formation cause, so as to reduce the loss caused by hurricanes.

In the wave of continuous innovation of information technology, various industries generate a large amount of data with different types and different structures, a lot of unknown but useful information is hidden in the data, and data mining is a process of searching useful information hidden in the data from the large amount of noisy heterogeneous data through an algorithm. By means of a data mining technology, ambient airflow data when a hurricane is generated, some internal structures of the hurricane and ocean data of a falling surface are explored, the change rule of the hurricane is obtained, a model capable of basically and accurately predicting the change of the hurricane strength is finally obtained, reference information is provided for the government, the time for timely early warning and preparing a disaster relief scheme is ensured, people are informed of the arrival of the hurricane in advance, and therefore the protection can be made, and economic loss is reduced.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a hurricane intensity change prediction method based on a data mining model aiming at the problems that the existing typhoon prediction method has unstable results, incompleteness and accuracy are to be improved.

The technical scheme provided by the invention is as follows: a hurricane intensity variation prediction method based on a data mining model comprises the following steps:

step 1: acquiring and preprocessing hurricane meteorological data

The data preprocessing comprises five parts of hurricane intensity classification, data cleaning, data format conversion, data segmentation and classification data processing into prediction classification data;

step 2: finding suitable classification algorithms and exploring the possibilities of RI type hurricane classification

By utilizing an RI strategy, setting an attribute of 'whether the data set is RI type' as a classification attribute when the hurricane intensity change is predicted, and then training by utilizing a classification algorithm;

and step 3: putting the data test set into a hurricane intensity prediction model for integrated training

And reasonably selecting an algorithm as a Bagging-based classifier to learn the data set so as to obtain an optimal hurricane intensity change prediction model.

And 4, step 4: selecting optimal hurricane intensity prediction model from ensemble learning according to evaluation index system

Sorting 10 prediction functions obtained by the trained system according to classification accuracy, selecting 5 prediction functions with the highest accuracy rate, adding the prediction functions into a decision group, and voting and selecting classification results by considering various indexes;

and 5: hurricane wind force grading prediction experiment for 6 hours, 12 hours and 18 hours in future of hurricane

The experiment of step 4 was conducted based on the selected best set of algorithms of step 3, exploring the ability to predict hurricane wind ratings 6 hours, 12 hours, and 18 hours into the future.

As an improvement, the specific implementation process of the step 1 comprises the following sub-steps:

step 1.1: dividing hurricane intensity into 12 levels according to the size of the central wind speed, and establishing a new data item VCLASS in a data table;

step 1.2: deleting the attribute columns with the data deletion rate higher than 1%, and completing the deletion values of the attribute columns lower than 1% by using a Weka.

Step 1.3: discretizing hurricane rating data using the Weka. filters. unsupervised. attri-bute. numerics to normanal on the Weka platform;

step 1.4: taking the data obtained in the step 1.3 as initial data for a data mining experiment, and separating a training set and a test set from the initial data;

step 1.5: processing the classified data into predicted data, processing the hurricane data according to the name of the hurricane, processing x pieces of data of each hurricane, setting the predicted values as data after 6 hours, 12 hours and 18 hours, namely the values of VCLASS attribute items of data with the prediction levels of i +1, i +2 and i +3 at 6 hours, 12 hours and 18 hours of the ith piece of data, then deleting 3 pieces of data at the tail of each hurricane x piece of data, and finally obtaining 3 groups of data sets and training sets, wherein the data sets are respectively the predicted levels after 6 hours, 12 hours and 18 hours.

As an improvement, the specific implementation process of the step 2 comprises the following sub-steps:

step 2.1: 5 algorithms, REPTree, LMT (Logistic model tree), J48(C4.5), IBk (kNN) and MultilayerPerceptron (BP neural network), were selected for classification, finding suitable classification algorithms and exploring the possibility of RI type hurricane classification.

As an improvement, the specific implementation process of the step 3 comprises the following sub-steps:

step 3.1: dividing the experiment into an experiment 1 and an experiment 2, selecting a ten-fold cross validation method for all experiment groups in the selection of a data mining test method, randomly dividing an input data set into 10 parts by a system, selecting 9 parts of the 10 parts of the input data set as training data in turn, taking 1 part of the input data set as test data to perform the experiment, setting 10 base classifiers for each experiment group to participate in the training in the aspect of parameter setting of a Bagging framework, and adopting the same Bagging setting for all the experiment groups, wherein the experiment is realized by carrying out secondary development on a Bagging function based on Weka;

step 3.1.1: the parameter setting and parameter meaning of five classification algorithms in Weka basically adopt default parameter setting, the modification part is that IBk algorithm sets k value to 5, rossValidate is set to True, the program is allowed to select the optimal k value to classify unknown points between 1-k by a cross validation method in the running process, distance weighting selects 1/distance, and GUI of MultilayerPerceptron is set to True;

step 3.2: experiment 1 is to compare the performance of various algorithms as Bagging base classifiers in 5 algorithms of REPTree, LMT (Logistic model tree), J48(C4.5), IBk (kNN) and MultilayerPerceptron (BP neural network), and find out the algorithm with the accuracy of more than 85% from the performances to carry out experiment 2;

step 3.2.1: five algorithms of REPTree, LMT, J48, IBk and MultilayerPerceptron are obtained and used as Bagging-based classifier algorithm to carry out ensemble learning on the data set;

step 3.2.2, judging whether the algorithm can be adopted according to the indexes of the ten-fold cross validation, trying to adjust parameters to improve the classification accuracy, and selecting a proper algorithm as a base classifier to carry out integrated training after the algorithm achieves proper accuracy through adjustment;

step 3.2.3: establishing a comprehensive evaluation system which is composed of classification accuracy serving as a main index and F-Measure, average absolute error, root mean square error and AUC value serving as auxiliary reference indexes;

step 3.3.1: experiment 2 is to use the combination of two algorithms, three algorithms and four algorithms as the basic classifier of Bagging for integrated training aiming at the proper algorithm obtained in experiment 1;

step 3.3.2: and judging and selecting the data mining test by using a ten-fold cross validation method.

As an improvement, the specific implementation process of the step 4 comprises the following sub-steps:

step 4.1: and selecting the optimal hurricane intensity change prediction model according to an evaluation index system, such as classification accuracy, a confusion matrix, and comparative analysis of the integrated hurricane intensity prediction model by considering indexes such as F-Measure, average absolute error, root mean square error, AUC value and the like.

As an improvement, step 5, the model for obtaining the optimal hurricane intensity variation is the LMT-MultilayerPerceptron model, and the process is ended.

Compared with the prior art, the invention has the advantages that: the method uses a data mining method to analyze a large amount of western Pacific hurricane data, firstly finds out a proper classifier through RI type classification experiments and hurricane wind power strength classification for integrated training, and integrates a plurality of classical single classifier algorithms through Bagging integrated learning to obtain a good hurricane prediction model.

Drawings

FIG. 1 is a general flow chart of the process of the present invention.

FIG. 2 is a comparison of the classification problem approach.

FIG. 3 is a sample preference versus prediction graph.

FIG. 4 is a schematic diagram of a hurricane force prediction model based on Bagging method.

Detailed Description

The following examples are included to provide further detailed description of the present invention and to provide those skilled in the art with a more complete, concise, and exact understanding of the principles and spirit of the invention.

Referring to fig. 1-4, a method for predicting hurricane intensity variations based on a data mining model comprises the following steps:

step 1: acquiring and preprocessing hurricane meteorological data

Because of the difficulty of data gathering, some hurricane data is not successfully gathered, missing, and some data items are meaningless, thus requiring preprocessing of the hurricane data. The data preprocessing comprises five parts of hurricane intensity classification, data cleaning, data format conversion, data segmentation and classification data processing into prediction classification data.

We used the RI strategy proposed by kaplananddematia et al, and set the attribute of "whether RI type" in the data set as a classification attribute when predicting hurricane intensity variations, and then trained using a classification algorithm.

And sorting the 10 prediction functions obtained by the trained system according to the classification accuracy, selecting the 5 prediction functions with the highest accuracy rate, adding the 5 prediction functions into a decision group, and voting and selecting the classification result by considering various indexes.

The specific implementation process of the step 1 comprises the following substeps:

step 1.1: hurricane intensity is divided into 12 levels according to the size of the central wind speed and a new data item VCLASS is created in the data table.

Step 1.2: attribute columns with data loss rate higher than 1% are deleted, and the loss values of attribute columns lower than 1% are completed by using the function of Weka.

Step 1.3: hurricane rating data was discretized using the Weka. filters. unsupervised. attri-bute. numerics to normanal on the Weka platform.

Step 1.4: and (4) taking the data obtained in the step 1.3 as initial data for a data mining experiment, and separating a training set and a test set from the initial data.

The specific implementation process of the step 2 comprises the following substeps:

The specific implementation process of the step 3 comprises the following substeps:

step 3.1: the experiments were divided into experiment 1 and experiment 2. In the selection of the data mining test method, a ten-fold cross validation method is selected for all experimental groups. The system randomly divides the input data set into 10 parts, and selects 9 parts as training data and 1 part as test data in turn to carry out experiments. In the aspect of parameter setting of the Bagging framework, 10 base classifiers are set for each experiment group to participate in training, and all the experiment groups adopt the same Bagging setting. This experiment was carried out by a secondary development of the Bagging function based on Weka.

Step 3.1.1: the parameter setting and parameter meaning of five classification algorithms in Weka basically adopt default parameter setting, the modification part is IBk algorithm to set k value to 5, rossValidate to True, the program is allowed to select the optimal k value to classify unknown points between 1 and k by a cross validation method in the running process, distance weighting selects 1/distance, and GUI of MultilayerPerceptron is set to True.

Step 3.2: experiment 1 is to compare the performance of various algorithms as Bagging-based classifiers among 5 algorithms of REPTree, LMT (Logistic model tree), J48(C4.5), IBk (kNN) and MultilayerPerceptron (BP neural network), and to find out the algorithm with the accuracy of more than 85% from the performance of the Bagging-based classifier, and carry out experiment 2.

Step 3.2.1: five algorithms of REPTree, LMT, J48, IBk and MultilayerPerceptron are obtained to be used as Bagging-based classifier algorithm to carry out ensemble learning on the data set.

And 3.2.2, judging whether the algorithm can be adopted according to the indexes of the ten-fold cross validation, and trying to adjust parameters to improve the classification accuracy. After the algorithm is adjusted to reach the proper accuracy, the proper algorithm is selected as a base classifier for integrated training.

Step 3.2.3: and establishing a comprehensive evaluation system which is composed of classification accuracy serving as a main index and F-Measure, average absolute error, root mean square error and AUC (AUC) values serving as auxiliary reference indexes.

Step 3.3.1: experiment 2 is a suitable algorithm obtained by aiming at experiment 1, and the logic model tree algorithm is used as a main algorithm, and other three algorithms are expanded into bag-based classifier sequences with equal numbers to form 6 combinations, namely an LMT-multilayerPerceptron model, an LMT-J48 model, an LMT-REPTree model, an LMT-multilayerPerceptron-REPTree model, an LMT-multilayerperperperpton-J48 model and an LMT-multilayerPerceptron-REPTree-J48 model. And inputting the test set into Bagging to train the models, wherein the training result is the prediction models with the same number as the base classifier sequences.

The specific implementation process of the step 4 comprises the following substeps:

Step 4.1.1 classification accuracy refers to the percentage of the correct result of model prediction in the total number of samples, which is used to evaluate the classification model. The accuracy calculation formula is shown below, where TP (true positive case) is that the positive class samples are correctly predicted as the positive class, TN (true negative case) is that the negative class samples are correctly predicted as the negative class, FP (false positive case) is that the negative class samples are incorrectly predicted as the positive class, and FN (false negative case) is that the positive class samples are incorrectly predicted as the negative class.

Step 4.1.2: it is not enough to use only classification accuracy as an index for measuring a hurricane prediction model, but accuracy and recall ratio are mutually influenced, and it is difficult to simultaneously satisfy that both the ratios are high, so an F-Measure concept is introduced, the F-Measure is a weighted harmonic mean value of accuracy and recall ratio, and the formula is shown as follows, when alpha is 1, F-Measure is 2PR/(P + R), and when F-Measure is higher, the model performance is better.

Step 4.1.3: the ROC curve is a two-dimensional curve drawn by taking FPR as an abscissa and TPR as an ordinate. Wherein TPR (true normal rate) is recall rate and FPR (false positive rate) is FP/(FP + TN). The AUC value (AreaUnderCurve) is defined as the area under the ROC curve enclosed by the coordinate axes. And the ROC curve is generally positioned above the straight line of y-x, so the value range of AUC is generally between 0.5 and 1, and the higher the AUC value accuracy is, namely the closer the ROC curve is to the upper left corner, the better the classification effect of the classifier is.

Step 4.1.4: the Mean Absolute Error (MAE) is the average of the absolute values of the deviations of all individual predictors from the predicted arithmetic mean. The average absolute error avoids the mutual cancellation of positive and negative values of the error, so that the actual situation of the error of the predicted value can be better reflected, and the formula is represented as follows:

step 4.1.5: root Mean Square Error (RMSE) is the square root of the ratio of the sum of the squares of the predicted values to the deviations from truth to the number of predictions n. The root mean square error is very sensitive to extra or extra small errors occurring in the prediction and is therefore suitable for measuring the accuracy of the model. Is formulated as:

and 5, obtaining that the optimal hurricane intensity change model is an LMT-MultilayerPerceptron model, and ending.

The above examples of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. All such modifications and variations are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A hurricane intensity variation prediction method based on a data mining model is characterized by comprising the following steps:

step 1: acquiring and preprocessing hurricane meteorological data

2. The data mining model-based hurricane force change prediction method of claim 1, wherein the detailed implementation procedure of step 1 comprises the following sub-steps:

3. The data mining model-based hurricane force change prediction method of claim 1, wherein the detailed implementation procedure of step 2 comprises the following sub-steps:

4. The data mining model-based hurricane force change prediction method of claim 1, wherein the detailed implementation procedure of step 3 comprises the following sub-steps:

5. The data mining model-based hurricane force change prediction method of claim 1, wherein the detailed implementation procedure of step 4 comprises the following sub-steps:

6. A data mining model-based hurricane intensity variation prediction method as per claim 1, wherein step 5, deriving the optimal hurricane intensity variation model is LMT-MultilayerPerceptron model, ending.