CN109726826A - Training method, device, storage medium and the electronic equipment of random forest - Google Patents
Training method, device, storage medium and the electronic equipment of random forest Download PDFInfo
- Publication number
- CN109726826A CN109726826A CN201811557766.4A CN201811557766A CN109726826A CN 109726826 A CN109726826 A CN 109726826A CN 201811557766 A CN201811557766 A CN 201811557766A CN 109726826 A CN109726826 A CN 109726826A
- Authority
- CN
- China
- Prior art keywords
- tree
- prediction result
- ballot
- data
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
This disclosure relates to a kind of training method of random forest, device, storage medium and electronic equipment, this method comprises: determining n group training dataset in the first training data;The n tree trained by the description data judgement of first training data, obtains n prediction result;N tree is deleted according to the accuracy of n prediction result and preset threshold, obtains m tree;It is voted according to the corresponding weight of each tree in m tree m tree, to obtain goal tree;It is the second training data that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,;Using second training data as first training data, circulation executes above-mentioned steps, until the accuracy of n prediction result obtains random forest both greater than or equal to the preset threshold.Persistently whole training data can be optimized in the multiple training process to random forest, while the tree for having single features in avoiding training process is increased, improve the accuracy of classification prediction.
Description
Technical field
This disclosure relates to machine learning field, and in particular, to a kind of training method of random forest, device, storage are situated between
Matter and electronic equipment.
Background technique
Random forest is the classifier comprising multiple decision trees, and prediction result of its output is defeated by each tree
Depending on the mode of prediction result out.The decision tree is a kind of Tree-structure Model for supervised learning.It, can in supervised learning
First to give one group of sample, each sample includes one group of attribute (description data) and a classification (prediction result), these classes
Be not it is pre-determined, by learning available one, this group of sample decision tree for having classification feature, the decision tree energy
It is enough that correctly classification (output prediction result) is provided to emerging object.In the related technology, it is trained to random forest
When, usually every decision tree in random forest is once trained by a part of data of full dose training data, then
The most prediction result of number of votes obtained is obtained by ballot mode when carrying out classification prediction to new data.This mode classification can be kept away
Exempt from the over-fitting in classification prediction, improves the generalization of classifier.But the prediction for the decision tree that only experience single is trained is just
True rate is not high, can not cope with the feelings of data characteristics unbalanced (data of some classification are extremely more) in training data in training process
The problem of condition, in turn resulting in the accuracy of entire classification prediction reduces.
Summary of the invention
To overcome the problems in correlation technique, purpose of this disclosure is to provide a kind of training method of random forest,
Device, storage medium and electronic equipment.
To achieve the goals above, according to the first aspect of the embodiments of the present disclosure, a kind of training side of random forest is provided
Method, which comprises
N group training dataset is determined in the first training data, first training data includes the same of event to be predicted
The prediction result of the corresponding description data of class event and the similar event;
The n tree trained by the n group training dataset is judged by the description data, described in obtaining
The corresponding n prediction result of n tree;
Delete operation is executed to the n tree according to the accuracy of the n prediction result and preset threshold, to obtain tree
Set, the tree set include m tree, wherein m is less than or equal to n;
It carries out the first ballot to the m tree according to the corresponding ballot weight of each tree in described m tree to operate, to obtain
Goal tree;
It is the second training data by the corresponding prediction result of the goal tree and the description Data Synthesis;
Using second training data as first training data, circulation is executed from described in full dose training data
Determine that n group training dataset trains the corresponding prediction result of the goal tree and the description Data Synthesis for second to described
The step of data, until the accuracy of the n prediction result is both greater than or equal to the preset threshold, it is random gloomy to obtain
Woods, the random forest include to execute all trees got after the delete operation in one or more circulation implementation procedure
Set.
Optionally, the method also includes:
It is described random gloomy to obtain using the corresponding description data of the event to be predicted as the input of the random forest
Multiple prediction results of more tree output in woods;
The most prediction result of the frequency of occurrence in the multiple prediction result is determined by the second ballot operation, as institute
State the prediction result of event to be predicted.
Optionally, described that the n tree trained by the n group training dataset is commented by the description data
Sentence, to obtain the corresponding n prediction result of the n tree, comprising:
N tree is trained by the n group training dataset;
Using the description data as the input of each tree in described n tree, to obtain the institute of the n tree output
State n prediction result.
Optionally, the accuracy and preset threshold according to the n prediction result, which executes the n tree, deletes behaviour
Make, to obtain tree set, comprising:
When in the n prediction result there are when the u prediction result that accuracy is less than the preset threshold, described in deletion
The corresponding u tree of u prediction result, to obtain the tree set comprising m tree, wherein m=n-u;Alternatively,
When the accuracy of the n prediction result is both greater than or is equal to the preset threshold, the tree comprising m tree is obtained
Set, wherein m=n.
Optionally, described that first ballot is carried out to the m tree according to the corresponding ballot weight of each tree in described m tree
Operation, to obtain goal tree, comprising:
The error rate of each tree is determined according to the accuracy of the prediction result of each tree;
Using the error rate of each tree as the input of preset ballot weight calculation formula, to obtain the franchise
The ballot weight of each tree of re-computation formula output;
Described m tree is divided into multiple ballot groups, wherein each ballot group includes to have the more of identical prediction result
Tree, the quantity of the more trees are the corresponding votes of each ballot group;
The product for appointing the corresponding ballot weight of one tree in the votes and each ballot group is obtained, as described
The percentage of votes obtained of each ballot group;
It obtains to have in the ballot group of highest percentage of votes obtained and appoints one tree, as the goal tree.
According to the second aspect of an embodiment of the present disclosure, a kind of training device of random forest is provided, described device includes:
Data set determining module, for determining n group training dataset, first training data in the first training data
The prediction result of the corresponding description data of similar event and the similar event including event to be predicted;
Random forest judge module, for by the description data to n trained by the n group training dataset
Tree is judged, to obtain the corresponding n prediction result of the n tree;
Random forest removing module, for being set according to the accuracy and preset threshold of the n prediction result described n
Delete operation is executed, to obtain tree set, the tree set includes m tree, wherein m is less than or equal to n;
Goal tree obtains module, for being carried out according to the corresponding ballot weight of each tree in described m tree to the m tree
First ballot operation, to obtain goal tree;
Data Synthesis module, for being the second instruction by the corresponding prediction result of the goal tree and the description Data Synthesis
Practice data;
Execution module is recycled, for using second training data as first training data, circulation to be executed from institute
It states and determines n group training dataset to described by the corresponding prediction result of the goal tree and the description in full dose training data
The step of Data Synthesis is the second training data, until the accuracy of the n prediction result is both greater than or default equal to described
Threshold value, to obtain random forest, the random forest, which is included in one or more circulation implementation procedures, executes the deletion behaviour
All tree set got after work.
Optionally, described device further include:
Data input module, using the corresponding description data of the event to be predicted as the input of the random forest, with
Obtain multiple prediction results of more tree output in the random forest;
As a result determining module determines most pre- of the frequency of occurrence in the multiple prediction result by the second ballot operation
Survey the prediction result as a result, as the event to be predicted.
Optionally, the random forest judges module, comprising:
Random forest trains submodule, for training n tree by the n group training dataset;
Random forest judge submodule, for using it is described description data as described n tree in each tree input,
To obtain the n prediction result of the n tree output.
Optionally, the random forest removing module, is used for:
When in the n prediction result there are when the u prediction result that accuracy is less than the preset threshold, described in deletion
The corresponding u tree of u prediction result, to obtain the tree set comprising m tree, wherein m=n-u;Alternatively,
When the accuracy of the n prediction result is both greater than or is equal to the preset threshold, the tree comprising m tree is obtained
Set, wherein m=n.
Optionally, the goal tree obtains module, comprising:
Error rate determines submodule, and the accuracy for the prediction result according to each tree determines each tree
Error rate;
Weight calculation submodule, for using the error rate of each tree as the defeated of preset ballot weight calculation formula
Enter, to obtain the ballot weight of each tree of the ballot weight calculation formula output;
Ballot group divides submodule, for described m tree to be divided into multiple ballot groups, wherein each ballot group includes
Have more trees of identical prediction result, the quantity of the more trees is the corresponding votes of each ballot group;
Percentage of votes obtained acquisition submodule appoints the corresponding throwing of one tree for obtaining in the votes and each ballot group
The product of ticket weight, the percentage of votes obtained as each ballot group;
Goal tree acquisition submodule appoints one tree for obtaining in the ballot group for having highest percentage of votes obtained, as described
Goal tree.
According to the third aspect of an embodiment of the present disclosure, a kind of computer readable storage medium is provided, calculating is stored thereon with
Machine program realizes the training for the random forest that embodiment of the present disclosure first aspect provides when the computer program is executed by processor
The step of method.
According to a fourth aspect of embodiments of the present disclosure, a kind of electronic equipment is provided, comprising:
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize embodiment of the present disclosure first party
The step of training method for the random forest that face provides.
Through the above technical solutions, the disclosure can determine n group training dataset in the first training data, first instruction
Practicing data includes the corresponding description data of similar event of event to be predicted and the prediction result of the similar event;By this
Description data judge the n tree trained by the n group training dataset, to obtain the corresponding n prediction knot of this n tree
Fruit;Delete operation is executed to this n tree according to the accuracy of the n prediction result and preset threshold, to obtain tree set, the tree
Set includes m tree, wherein m is less than or equal to n;According to this m tree in the corresponding ballot weight of each tree to this m set into
The ballot operation of row first, to obtain goal tree;It is the second instruction that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,
Practice data;Using second training data as first training data, circulation is executed determines n from above-mentioned in full dose training data
The corresponding prediction result of the goal tree is described the step that Data Synthesis is the second training data with this to above-mentioned by group training dataset
Suddenly, until the accuracy of the n prediction result is both greater than or equal to the preset threshold, to obtain random forest, the random forest
Comprising executing all tree set got after the delete operation in one or more circulation implementation procedure.It can be to random
Persistently whole training data is optimized in the multiple training process of forest, has single features in avoiding training process
While setting excessive, the accuracy of classification prediction is improved.
Other feature and advantage of the disclosure will the following detailed description will be given in the detailed implementation section.
Detailed description of the invention
Attached drawing is and to constitute part of specification for providing further understanding of the disclosure, with following tool
Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:
Fig. 1 is a kind of flow chart of the training method of random forest shown according to an exemplary embodiment;
Fig. 2 is the flow chart for implementing the training method of another random forest exemplified according to Fig. 1;
Fig. 3 is the flow chart for implementing a kind of random forest evaluation method exemplified according to Fig.2,;
Fig. 4 is the flow chart for implementing a kind of goal tree acquisition methods exemplified according to Fig.2,;
Fig. 5 is a kind of block diagram of the training device of random forest shown according to an exemplary embodiment;
Fig. 6 is the block diagram for implementing the training device of another random forest exemplified according to Fig.5,;
Fig. 7 is that a kind of random forest for implementing to exemplify according to Fig.6, judges the block diagram of module;
Fig. 8 is that a kind of goal tree for implementing to exemplify according to Fig.6, obtains the block diagram of module;
Fig. 9 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended
The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
Fig. 1 is a kind of flow chart of the training method of random forest shown according to an exemplary embodiment, such as Fig. 1 institute
Show, this method comprises:
Step 101, n group training dataset is determined in the first training data.
Wherein, which includes that the similar event of event to be predicted is corresponding
Data and the prediction result (classification results) of the similar event are described.First training data needs as detailed as possible in principle
Most is described a kind of similar event.The n group training dataset is the n group instruction randomly selected from first training data
Practice data set, each training dataset may include may include mutually between entirely different example or training dataset
The part of coincidence.
By taking the classification prediction that the event to be predicted is some fruit as an example, which is the classification prediction thing of fruit
Part needs to acquire the description data and prediction knot of the classification predicted events (more example as far as possible) of the fruit of types more as far as possible
Fruit, completely to be described to entire classification predicted events.It should be noted that first training data can be as following
Shown in table 1.
Table 1
A | B | C | D | E |
Calusena lansium | White flesh | Crescent shape | It is sweet | Banana |
Shagreen | Red flesh | It is spherical | It is sweet | Watermelon |
Red skin | White flesh | It is spherical | Sweet and sour | Apple |
… | … | … | … | … |
Wherein, one predicted events (or example) of each behavior in table 1, A, B, C and D this four column in data be retouch
Data portion is stated, the data in E column are prediction result part.It should be noted that including a large amount of (example in the first training data
Such as, 100,000) the corresponding a large amount of description data of example and prediction result, table 1 herein only with banana, watermelon and apple these three
It is shown for the corresponding description data of example and prediction result.The n group training data selected from the first training data
The every group of training dataset concentrated has form identical with first training data, that is, also includes the description above data portion
Divide and above-mentioned prediction result part.
Step 102, it describes data by this to judge the n tree trained by the n group training dataset, to obtain
The corresponding n prediction result of this n tree.
Wherein, above-mentioned tree is decision tree (or classification tree), which is the machine learning of existing tree structure
Model, random forest are made of multiple decision trees.
Illustratively, in the step 102, n is trained by the n group training dataset selected in a step 101 first
Decision tree, as initial random forest.It, can be by above-mentioned first training data after getting n decision tree
Be described data judge each tree, that is, all prediction results in above-mentioned first training data are deleted, it will be remaining
Be described data input every and have completed trained decision tree, obtain every decision tree and are directed to above-mentioned be described data
New prediction result, and then get the corresponding accuracy of every decision tree.
By taking above-mentioned table 1 as an example, the description data in this four column of A, B, C and D obtain one group of prediction knot as the input for setting A
Fruit.The prediction result is actually also the column for including a large amount of prediction result data.Can by the prediction result data with
Data in above-mentioned E column compare, to obtain the accuracy of the pre- geodesic structure.It is with 3 examples shown in table 1
, it include: banana, watermelon and lichee in the prediction result, then compared with above-mentioned E column, the accuracy of the prediction result is
2/3。
Step 103, delete operation is executed to this n tree according to the accuracy of the n prediction result and preset threshold, to obtain
Tree is taken to gather.
Wherein, which includes m tree, and m is less than or equal to n.
Illustratively, which may include: when there are the u that accuracy is less than the preset threshold in the n prediction result
When a prediction result, the corresponding u tree of the u prediction result is deleted, to obtain the tree set comprising m tree, wherein m=n-
u;Alternatively, the tree set comprising m tree is obtained when the accuracy of the n prediction result is both greater than or is equal to the preset threshold,
Wherein, m=n.It is understood that when there is the accuracy of any one prediction result default less than this in the n prediction result
When threshold value, the accuracy for needing to delete prediction result in n tree is less than the tree of the preset threshold, retains remaining m tree conduct
Above-mentioned tree set, and the first ballot operation in 104 (the ballot operation of subsidiary weight) is selected wherein just through the following steps
True rate highest and the prediction result of the strongest decision tree output of generalization.When the accuracy of the n prediction result is both greater than or waits
When the preset threshold, it is believed that have been realized in the purpose optimized to decision tree needed for composition random forest, then
This programme the step 103 stop, and obtain this n tree (and this accuracy judgment step before reservation one or more
A tree set), as trained random forest.
Illustratively, the step 106 of this programme is held actually to recycle the step of executing to step 101 to 105 in each circulation
During row, it is less than u prediction result of the preset threshold there are accuracy in the n prediction result and remains m in turn
After tree, this m tree can be regarded as a tree and gathered.When progress, for example, determining the n after 5 circulation implementation procedures
The accuracy of a prediction result is then retaining the corresponding n tree of n current prediction result both greater than or equal to the preset threshold
Outside, the 5 trees set retained in 5 times above-mentioned circulation implementation procedures is also obtained.It is understood that being wrapped in each tree set
The quantity of the decision tree contained is not quite similar.In this way, the random forest that this programme finally determines is (practical also at this by this n tree
The tree set got in step 103) and other 5 all trees compositions set in set.
Step 104, the first ballot is carried out to this m tree according to the corresponding ballot weight of each tree in this m tree to operate, with
Obtain goal tree.
Illustratively, the ballot weight is used to describe the important of the votes that each tree obtains in the first ballot operation
Degree.For example, the numerical value in bracket is the power of each tree for this 4 tree A (1), tree B (2), tree C (4) and tree D (10) trees
Weight.When carry out first is voted and is operated, obtained voting results are as follows: tree A obtains 5 tickets, and tree B obtains 3 tickets, and tree C obtains 2 tickets, and tree D obtains 1 ticket,
Then while tree A gained vote is more, tree D gained vote is less, but because tree D weight is big, and " victor " of this ballot is tree D.
Step 105, the corresponding prediction result of the goal tree is described Data Synthesis with this is the second training data.
Step 106, using second training data as first training data, circulation is executed from above-mentioned in full dose training number
Data Synthesis is described into for the second training with this for the corresponding prediction result of the goal tree to above-mentioned according to middle determining n group training dataset
The step of data, until the accuracy of the n prediction result is both greater than or is equal to the preset threshold, to obtain random forest.
Wherein, which includes to get after executing the delete operation in one or more circulation implementation procedures
All tree set.
In conclusion the disclosure can determine n group training dataset in the first training data, the first training data packet
Include the corresponding description data of similar event of event to be predicted and the prediction result of the similar event;Data are described by this
The n tree trained by the n group training dataset is judged, to obtain the corresponding n prediction result of this n tree;According to
The accuracy and preset threshold of the n prediction result execute delete operation to this n tree, to obtain tree set, the tree set packet
It is set containing m, wherein m is less than or equal to n;First is carried out to this m tree according to the corresponding ballot weight of each tree in this m tree
Ballot operation, to obtain goal tree;It is the second training data that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,;
Using second training data as first training data, circulation is executed determines the training of n group from above-mentioned in full dose training data
The corresponding prediction result of the goal tree is described the step of Data Synthesis is the second training data with this to above-mentioned by data set, until
The accuracy of the n prediction result is both greater than or equal to the preset threshold, and to obtain random forest, which is included in one
All tree set got after the delete operation are executed in a or multiple circulation implementation procedures.It can be to the more of random forest
Persistently whole training data is optimized in secondary training process, the tree for having single features in avoiding random forest is excessive,
While guaranteeing the generalization of random forest classification prediction, the accuracy of classification prediction is improved.
Fig. 2 is the flow chart for implementing the training method of another random forest exemplified according to Fig. 1, such as Fig. 2 institute
Show, after above-mentioned steps 106, this method can also include:
Step 107, random to obtain this using the corresponding description data of the event to be predicted as the input of the random forest
Multiple prediction results of more tree output in forest.
Step 108, the most prediction knot of the frequency of occurrence in above-mentioned multiple prediction results is determined by the second ballot operation
Fruit, the prediction result as the event to be predicted.
Illustratively, after getting the random forest, existing event to be predicted can be retouched by the random forest
It states data to be predicted, wherein every decision tree in random forest can all export a prediction result.In above-mentioned multiple predictions
As a result in, the most prediction knot of a frequency of occurrence can be selected by the second ballot operation (no weight votes) of random forest
Final prediction result of the fruit as the event to be predicted.
Still by taking the classification of above-mentioned fruit prediction as an example, if including 30 trees, the event to be predicted in the random forest
Corresponding description data are shagreen, green flesh, spherical, sweet.The random forest can describe data according to this and export 30 prediction knots
Fruit, wherein 25 are grape, and 3 are green apple, and 2 are Kiwi berry.In this way, then will wherein (frequency of occurrence be most for percentage of votes obtained highest
It is more) grape as the final prediction result.
Fig. 3 is the flow chart for implementing a kind of random forest evaluation method exemplified according to Fig.2, as shown in figure 3, on
Stating step 102 may include:
Step 1021, n tree is trained by the n group training dataset.
Illustratively, which is properly termed as the pre-training step of random forest, and what is obtained after the pre-training step is first
Beginning random forest is also equipped with certain defect in terms of precision of classifying, it is therefore desirable to combine Ada in the following steps
The theory of Boosting method repeatedly trains each tree in random forest, and in the training process persistently to first
Training data (i.e. full dose training data) optimizes, and suitably to strengthen the effect of crucial training data, it is random to improve this
The precision of forest.
Step 1022, data are described into as the input of each tree in this n tree, to obtain this n tree output for this
The n prediction result.
Fig. 4 is the flow chart for implementing a kind of goal tree acquisition methods exemplified according to Fig.2, as shown in figure 4, above-mentioned
Step 104 may include:
Step 1041, the error rate of each tree is determined according to the accuracy of the prediction result of each tree.
Illustratively, the error rate of decision tree is the difference of 1 accuracy for subtracting the decision tree.
Step 1042, using the error rate of each tree as the input of preset ballot weight calculation formula, to be somebody's turn to do
The ballot weight of each tree of ballot weight calculation formula output.
Illustratively, shown in the ballot weight calculation formula such as following equation (1):
Wherein, W indicates the ballot weight, eiIndicate the error rate of i-th tree in this m tree.
Step 1043, this m tree is divided into multiple ballot groups.
Wherein, each ballot group includes to have more trees of identical prediction result, and the quantity of this more trees is that this is each
The corresponding votes of ballot group.
Step 1044, the product for obtaining votes ballot weight corresponding with one tree is appointed in each ballot group, makees
For the percentage of votes obtained of each ballot group.
Step 1045, it obtains to have in the ballot group of highest percentage of votes obtained and appoints one tree, as the goal tree.
Illustratively, in the first ballot operation of subsidiary weight, the percentage of votes obtained of each ballot group is ballot weight and reality
The product of obtained votes.And, it is understood that multiple prediction results in each ballot group are identical pre-
It surveys as a result, these prediction results correspond to multiple identical trees.It therefore, can be from the ballot group for having highest percentage of votes obtained
Optional one tree, as the goal tree.
In conclusion the disclosure can determine n group training dataset in the first training data, the first training data packet
Include the corresponding description data of similar event of event to be predicted and the prediction result of the similar event;Data are described by this
The n tree trained by the n group training dataset is judged, to obtain the corresponding n prediction result of this n tree;According to
The accuracy and preset threshold of the n prediction result execute delete operation to this n tree, to obtain tree set, the tree set packet
It is set containing m, wherein m is less than or equal to n;First is carried out to this m tree according to the corresponding ballot weight of each tree in this m tree
Ballot operation, to obtain goal tree;It is the second training data that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,;
Using second training data as first training data, circulation is executed determines the training of n group from above-mentioned in full dose training data
The corresponding prediction result of the goal tree is described the step of Data Synthesis is the second training data with this to above-mentioned by data set, until
The accuracy of the n prediction result is both greater than or equal to the preset threshold, and to obtain random forest, which is included in one
All tree set got after the delete operation are executed in a or multiple circulation implementation procedures.It can be to the more of random forest
Persistently whole training data is optimized in secondary training process, the tree for having single features in avoiding random forest is excessive,
While guaranteeing the generalization of random forest classification prediction, the accuracy of classification prediction is improved.
Fig. 5 is a kind of block diagram of the training device of random forest shown according to an exemplary embodiment, as shown in figure 5,
The device 500 includes:
Data set determining module 510, for determining n group training dataset in the first training data, the first training number
According to the prediction result of the corresponding description data of similar event and the similar event that include event to be predicted;
Random forest judges module 520, for describing data to n trained by the n group training dataset by this
Tree is judged, to obtain the corresponding n prediction result of this n tree;
Random forest removing module 530, for being set according to the accuracy and preset threshold of the n prediction result this n
Delete operation is executed, to obtain tree set, which includes m tree, wherein m is less than or equal to n;
Goal tree obtains module 540, for being carried out according to the corresponding ballot weight of each tree in this m tree to this m tree
First ballot operation, to obtain goal tree;
Data Synthesis module 550 is the second instruction for the corresponding prediction result of the goal tree to be described Data Synthesis with this
Practice data;
Recycle execution module 560, for using second training data as first training data, circulation execute from this
Determine that the corresponding prediction result of the goal tree is described Data Synthesis with this to this and is by n group training dataset in full dose training data
The step of second training data, until the accuracy of the n prediction result is both greater than or equal to the preset threshold, it is random to obtain
Forest, the random forest include all trees collection for executing in one or more circulation implementation procedures and getting after the delete operation
It closes.
Fig. 6 is the block diagram for implementing the training device of another random forest exemplified according to Fig.5, as shown in fig. 6,
The device 500 further include:
Data input module 570, using the corresponding description data of the event to be predicted as the input of the random forest, to obtain
Take more in the random forest multiple prediction results for setting output;
As a result determining module 580 determine that the frequency of occurrence in above-mentioned multiple prediction results is most by the second ballot operation
Prediction result, the prediction result as the event to be predicted.
Fig. 7 is the block diagram that a kind of random forest for implementing to exemplify according to Fig.6, judges module, as shown in fig. 7, should be with
Machine forest judges module 520, comprising:
Random forest trains submodule 521, for training n tree by the n group training dataset;
Random forest judge submodule 522, for using this describe data as this n tree in each tree input,
To obtain the n prediction result of this n tree output.
Optionally, the random forest removing module 530, is used for:
When pre- there are when the u prediction result that accuracy is less than the preset threshold, deleting the u in the n prediction result
The corresponding u tree of result is surveyed, to obtain the tree set comprising m tree, wherein m=n-u;Alternatively,
When the accuracy of the n prediction result is both greater than or is equal to the preset threshold, obtains the tree comprising m tree and collect
It closes, wherein m=n.
Fig. 8 is that a kind of goal tree for implementing to exemplify according to Fig.6, obtains the block diagram of module, as shown in figure 8, the target
Tree obtains module 540, comprising:
Error rate determines submodule 541, and the accuracy for the prediction result according to each tree determines each tree
Error rate;
Weight calculation submodule 542, for using the error rate of each tree as preset ballot weight calculation formula
Input, to obtain the ballot weight of each tree of ballot weight calculation formula output;
Ballot group divides submodule 543, for this m tree to be divided into multiple ballot groups, wherein each ballot group includes
Have more of identical prediction result trees, the quantity of this more trees are the corresponding votes of each ballot group;
Percentage of votes obtained acquisition submodule 544, for obtaining votes throwing corresponding with one tree is appointed in each ballot group
The product of ticket weight, the percentage of votes obtained as each ballot group;
Goal tree acquisition submodule 545 appoints one tree for obtaining in the ballot group for having highest percentage of votes obtained, as this
Goal tree.
In conclusion the disclosure can determine n group training dataset in the first training data, the first training data packet
Include the corresponding description data of similar event of event to be predicted and the prediction result of the similar event;Data are described by this
The n tree trained by the n group training dataset is judged, to obtain the corresponding n prediction result of this n tree;According to
The accuracy and preset threshold of the n prediction result execute delete operation to this n tree, to obtain tree set, the tree set packet
It is set containing m, wherein m is less than or equal to n;First is carried out to this m tree according to the corresponding ballot weight of each tree in this m tree
Ballot operation, to obtain goal tree;It is the second training data that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,;
Using second training data as first training data, circulation is executed determines the training of n group from above-mentioned in full dose training data
The corresponding prediction result of the goal tree is described the step of Data Synthesis is the second training data with this to above-mentioned by data set, until
The accuracy of the n prediction result is both greater than or equal to the preset threshold, and to obtain random forest, which is included in one
All tree set got after the delete operation are executed in a or multiple circulation implementation procedures.It can be to the more of random forest
Persistently whole training data is optimized in secondary training process, the tree for having single features in avoiding random forest is excessive,
While guaranteeing the generalization of random forest classification prediction, the accuracy of classification prediction is improved.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 9 is the block diagram of a kind of electronic equipment 900 shown according to an exemplary embodiment.As shown in figure 9, the electronics is set
Standby 900 may include: processor 901, memory 902, multimedia component 903, input/output (I/O) interface 904, Yi Jitong
Believe component 905.
Wherein, processor 901 is used to control the integrated operation of the electronic equipment 900, to complete above-mentioned random forest
All or part of the steps in training method.Memory 902 is for storing various types of data to support in the electronic equipment
900 operation, these data for example may include any application or method for operating on the electronic equipment 900
Instruction and the relevant data of application program, such as contact data, the message of transmitting-receiving, picture, audio, video etc..This is deposited
Reservoir 902 can by any kind of volatibility or non-volatile memory device or they synthesize realization, such as it is static with
Machine accesses memory (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory
(Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable
Read-only memory (Erasable Programmable Read-Only Memory, abbreviation EPROM), programmable read only memory
(Programmable Read-Only Memory, abbreviation PROM), and read-only memory (Read-Only Memory, referred to as
ROM), magnetic memory, flash memory, disk or CD.Multimedia component 903 may include screen and audio component.Wherein
Screen for example can be touch screen, and audio component is used for output and/or input audio signal.For example, audio component may include
One microphone, microphone is for receiving external audio signal.The received audio signal can be further stored in storage
Device 902 is sent by communication component 905.Audio component further includes at least one loudspeaker, is used for output audio signal.I/O
Interface 904 provides interface between processor 901 and other interface modules, other above-mentioned interface modules can be keyboard, mouse,
Button etc..These buttons can be virtual push button or entity button.Communication component 905 is for the electronic equipment 900 and other
Wired or wireless communication is carried out between equipment.Wireless communication, such as Wi-Fi, bluetooth, near-field communication (Near Field
Communication, abbreviation NFC), 2G, 3G or 4G or they one or more of synthesis, therefore corresponding communication
Component 905 may include: Wi-Fi module, bluetooth module, NFC module.
In one exemplary embodiment, electronic equipment 900 can be by one or more application specific integrated circuit
(Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor (Digital
Signal Processor, abbreviation DSP), digital signal processing appts (Digital Signal Processing Device,
Abbreviation DSPD), programmable logic device (Programmable Logic Device, abbreviation PLD), field programmable gate array
(Field Programmable Gate Array, abbreviation FPGA), controller, microcontroller, microprocessor or other electronics member
Part is realized, for executing the training method of above-mentioned random forest.
In a further exemplary embodiment, a kind of computer readable storage medium including program instruction, example are additionally provided
It such as include the memory 902 of program instruction, above procedure instruction can be executed by the processor 901 of electronic equipment 900 on to complete
The training method for the random forest stated.
The preferred embodiment of the disclosure is described in detail in conjunction with attached drawing above, still, the disclosure is not limited to above-mentioned reality
The detail in mode is applied, in the range of the technology design of the disclosure, those skilled in the art are considering specification and practice
After the disclosure, it is readily apparent that other embodiments of the disclosure, belongs to the protection scope of the disclosure.
It is further to note that specific technical features described in the above specific embodiments, in not lance
In the case where shield, it can be combined in any appropriate way.Simultaneously between a variety of different embodiments of the disclosure
Any combination can also be carried out, as long as it, without prejudice to the thought of the disclosure, equally should be considered as disclosure disclosure of that.
The disclosure is not limited to the precision architecture being described above out, and the scope of the present disclosure is only limited by the attached claims
System.
Claims (10)
1. a kind of training method of random forest, which is characterized in that the described method includes:
N group training dataset is determined in the first training data, first training data includes the similar thing of event to be predicted
The prediction result of the corresponding description data of part and the similar event;
The n tree trained by the n group training dataset is judged by the description data, to obtain described n
Set corresponding n prediction result;
Delete operation is executed to the n tree according to the accuracy of the n prediction result and preset threshold, to obtain tree collection
It closes, the tree set includes m tree, wherein m is less than or equal to n;
It carries out the first ballot to the m tree according to the corresponding ballot weight of each tree in described m tree to operate, to obtain target
Tree;
It is the second training data by the corresponding prediction result of the goal tree and the description Data Synthesis;
Using second training data as first training data, circulation is executed from described and is determined in full dose training data
N group training dataset to it is described by the corresponding prediction result of the goal tree and the description Data Synthesis be the second training data
The step of, until the accuracy of the n prediction result is both greater than or equal to the preset threshold, to obtain random forest, institute
Stating random forest includes to execute all tree set got after the delete operation in one or more circulation implementation procedure.
2. the method according to claim 1, wherein the method also includes:
Using the corresponding description data of the event to be predicted as the input of the random forest, to obtain in the random forest
More tree output multiple prediction results;
Determine the most prediction result of the frequency of occurrence in the multiple prediction result by the second ballot operation, as it is described to
The prediction result of predicted events.
3. the method according to claim 1, wherein described trained by the description data to by the n group
The n tree that data set trains is judged, to obtain the corresponding n prediction result of the n tree, comprising:
N tree is trained by the n group training dataset;
Using the description data as the input of each tree in described n tree, to obtain the n of the n tree output
A prediction result.
4. the method according to claim 1, wherein the accuracy according to the n prediction result and pre-
If threshold value executes delete operation to the n tree, to obtain tree set, comprising:
When, there are when the u prediction result that accuracy is less than the preset threshold, it is a deleting the u in the n prediction result
The corresponding u tree of prediction result, to obtain the tree set comprising m tree, wherein m=n-u;Alternatively,
When the accuracy of the n prediction result is both greater than or is equal to the preset threshold, obtains the tree comprising m tree and collect
It closes, wherein m=n.
5. the method according to claim 1, wherein described according to the corresponding ballot of each tree in described m tree
Weight carries out the first ballot to the m tree and operates, to obtain goal tree, comprising:
The error rate of each tree is determined according to the accuracy of the prediction result of each tree;
Using the error rate of each tree as the input of preset ballot weight calculation formula, to obtain the franchise restatement
Calculate the ballot weight of each tree of formula output;
Described m tree is divided into multiple ballot groups, wherein each ballot group includes more for having identical prediction result
Tree, the quantity of the more trees are the corresponding votes of each ballot group;
The product for appointing the corresponding ballot weight of one tree in the votes and each ballot group is obtained, as described each
The percentage of votes obtained of ballot group;
It obtains to have in the ballot group of highest percentage of votes obtained and appoints one tree, as the goal tree.
6. a kind of training device of random forest, which is characterized in that described device includes:
Data set determining module, for n group training dataset determining in the first training data, first training data includes
The prediction result of the corresponding description data of the similar event of event to be predicted and the similar event;
Random forest judge module, for by the description data to n trained by the n group training dataset set into
Row is judged, to obtain the corresponding n prediction result of the n tree;
Random forest removing module, for being executed according to the accuracy and preset threshold of the n prediction result to the n tree
Delete operation, to obtain tree set, the tree set includes m tree, wherein m is less than or equal to n;
Goal tree obtains module, for carrying out first to the m tree according to the corresponding ballot weight of each tree in described m tree
Ballot operation, to obtain goal tree;
Data Synthesis module, for being the second training number by the corresponding prediction result of the goal tree and the description Data Synthesis
According to;
Recycle execution module, for will second training data as first training data, circulation execution from it is described
Determine n group training dataset to described by the corresponding prediction result of the goal tree and the description data in full dose training data
The step of synthesizing the second training data, until the accuracy of the n prediction result is both greater than or is equal to the preset threshold,
To obtain random forest, the random forest includes to obtain after executing the delete operation in one or more circulation implementation procedure
All tree set got.
7. device according to claim 6, which is characterized in that described device further include:
Data input module, using the corresponding description data of the event to be predicted as the input of the random forest, to obtain
Multiple prediction results of more tree output in the random forest;
As a result determining module determines the most prediction knot of the frequency of occurrence in the multiple prediction result by the second ballot operation
Fruit, the prediction result as the event to be predicted.
8. device according to claim 6, which is characterized in that the goal tree obtains module, comprising:
Error rate determines submodule, and the error of each tree is determined for the accuracy according to the prediction result of each tree
Rate;
Weight calculation submodule, for using the error rate of each tree as the input of preset ballot weight calculation formula,
To obtain the ballot weight of each tree of the ballot weight calculation formula output;
Ballot group divides submodule, for described m tree to be divided into multiple ballot groups, wherein each ballot group includes to have
More trees of identical prediction result, the quantity of the more trees are the corresponding votes of each ballot group;
Percentage of votes obtained acquisition submodule appoints the corresponding franchise of one tree for obtaining in the votes and each ballot group
The product of weight, the percentage of votes obtained as each ballot group;
Goal tree acquisition submodule appoints one tree for obtaining in the ballot group for having highest percentage of votes obtained, as the target
Tree.
9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program quilt
The step of any one of claim 1-5 the method is realized when processor executes.
10. a kind of electronic equipment characterized by comprising
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize described in any one of claim 1-5
The step of method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811557766.4A CN109726826B (en) | 2018-12-19 | 2018-12-19 | Training method and device for random forest, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811557766.4A CN109726826B (en) | 2018-12-19 | 2018-12-19 | Training method and device for random forest, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109726826A true CN109726826A (en) | 2019-05-07 |
CN109726826B CN109726826B (en) | 2021-08-13 |
Family
ID=66296251
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811557766.4A Active CN109726826B (en) | 2018-12-19 | 2018-12-19 | Training method and device for random forest, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109726826B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110264342A (en) * | 2019-06-19 | 2019-09-20 | 深圳前海微众银行股份有限公司 | A kind of business audit method and device based on machine learning |
CN111160647A (en) * | 2019-12-30 | 2020-05-15 | 第四范式(北京)技术有限公司 | Money laundering behavior prediction method and device |
CN113516173A (en) * | 2021-05-27 | 2021-10-19 | 江西五十铃汽车有限公司 | Evaluation method for static and dynamic interference of whole vehicle based on random forest and decision tree |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104391970A (en) * | 2014-12-04 | 2015-03-04 | 深圳先进技术研究院 | Attribute subspace weighted random forest data processing method |
WO2015066564A1 (en) * | 2013-10-31 | 2015-05-07 | Cancer Prevention And Cure, Ltd. | Methods of identification and diagnosis of lung diseases using classification systems and kits thereof |
CN105844300A (en) * | 2016-03-24 | 2016-08-10 | 河南师范大学 | Optimized classification method and optimized classification device based on random forest algorithm |
US9519868B2 (en) * | 2012-06-21 | 2016-12-13 | Microsoft Technology Licensing, Llc | Semi-supervised random decision forests for machine learning using mahalanobis distance to identify geodesic paths |
CN108846338A (en) * | 2018-05-29 | 2018-11-20 | 南京林业大学 | Polarization characteristic selection and classification method based on object-oriented random forest |
-
2018
- 2018-12-19 CN CN201811557766.4A patent/CN109726826B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9519868B2 (en) * | 2012-06-21 | 2016-12-13 | Microsoft Technology Licensing, Llc | Semi-supervised random decision forests for machine learning using mahalanobis distance to identify geodesic paths |
WO2015066564A1 (en) * | 2013-10-31 | 2015-05-07 | Cancer Prevention And Cure, Ltd. | Methods of identification and diagnosis of lung diseases using classification systems and kits thereof |
CN104391970A (en) * | 2014-12-04 | 2015-03-04 | 深圳先进技术研究院 | Attribute subspace weighted random forest data processing method |
CN105844300A (en) * | 2016-03-24 | 2016-08-10 | 河南师范大学 | Optimized classification method and optimized classification device based on random forest algorithm |
CN108846338A (en) * | 2018-05-29 | 2018-11-20 | 南京林业大学 | Polarization characteristic selection and classification method based on object-oriented random forest |
Non-Patent Citations (2)
Title |
---|
WEISHI MAN 等: "Image classification based on improved random forest algorithm", 《2018 IEEE 3RD INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA)》 * |
冯开平 等: "基于加权KNN与随机森林的表情识别方法", 《软件导刊》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110264342A (en) * | 2019-06-19 | 2019-09-20 | 深圳前海微众银行股份有限公司 | A kind of business audit method and device based on machine learning |
CN111160647A (en) * | 2019-12-30 | 2020-05-15 | 第四范式(北京)技术有限公司 | Money laundering behavior prediction method and device |
CN111160647B (en) * | 2019-12-30 | 2023-08-22 | 第四范式(北京)技术有限公司 | Money laundering behavior prediction method and device |
CN113516173A (en) * | 2021-05-27 | 2021-10-19 | 江西五十铃汽车有限公司 | Evaluation method for static and dynamic interference of whole vehicle based on random forest and decision tree |
Also Published As
Publication number | Publication date |
---|---|
CN109726826B (en) | 2021-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112101190B (en) | Remote sensing image classification method, storage medium and computing device | |
Fuentes et al. | High-performance deep neural network-based tomato plant diseases and pests diagnosis system with refinement filter bank | |
CN105701120B (en) | The method and apparatus for determining semantic matching degree | |
US20180365525A1 (en) | Multi-sampling model training method and device | |
CN108304921A (en) | The training method and image processing method of convolutional neural networks, device | |
CN107844784A (en) | Face identification method, device, computer equipment and readable storage medium storing program for executing | |
WO2020155300A1 (en) | Model prediction method and device | |
CN107230108A (en) | The processing method and processing device of business datum | |
CN109948680B (en) | Classification method and system for medical record data | |
CN111461168A (en) | Training sample expansion method and device, electronic equipment and storage medium | |
CN109726826A (en) | Training method, device, storage medium and the electronic equipment of random forest | |
CN107909141A (en) | A kind of data analysing method and device based on grey wolf optimization algorithm | |
CN112232944B (en) | Method and device for creating scoring card and electronic equipment | |
CN106803039A (en) | The homologous decision method and device of a kind of malicious file | |
Nimmagadda et al. | Cricket score and winning prediction using data mining | |
CN111178656A (en) | Credit model training method, credit scoring device and electronic equipment | |
CN109670623A (en) | Neural net prediction method and device | |
CN109670567A (en) | Neural net prediction method and device | |
CN109829471A (en) | Training method, device, storage medium and the electronic equipment of random forest | |
CN110413682A (en) | A kind of the classification methods of exhibiting and system of data | |
CN107135402A (en) | A kind of method and device for recognizing TV station's icon | |
CN115762530A (en) | Voiceprint model training method and device, computer equipment and storage medium | |
CN114742221A (en) | Deep neural network model pruning method, system, equipment and medium | |
CN107403199A (en) | Data processing method and device | |
Dang-Nhu | Evaluating disentanglement of structured representations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |