CN109726826A - Training method, device, storage medium and the electronic equipment of random forest - Google Patents

Training method, device, storage medium and the electronic equipment of random forest Download PDF

Info

Publication number
CN109726826A
CN109726826A CN201811557766.4A CN201811557766A CN109726826A CN 109726826 A CN109726826 A CN 109726826A CN 201811557766 A CN201811557766 A CN 201811557766A CN 109726826 A CN109726826 A CN 109726826A
Authority
CN
China
Prior art keywords
tree
prediction result
ballot
data
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811557766.4A
Other languages
Chinese (zh)
Other versions
CN109726826B (en
Inventor
高睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neusoft Corp
Original Assignee
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neusoft Corp filed Critical Neusoft Corp
Priority to CN201811557766.4A priority Critical patent/CN109726826B/en
Publication of CN109726826A publication Critical patent/CN109726826A/en
Application granted granted Critical
Publication of CN109726826B publication Critical patent/CN109726826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

This disclosure relates to a kind of training method of random forest, device, storage medium and electronic equipment, this method comprises: determining n group training dataset in the first training data;The n tree trained by the description data judgement of first training data, obtains n prediction result;N tree is deleted according to the accuracy of n prediction result and preset threshold, obtains m tree;It is voted according to the corresponding weight of each tree in m tree m tree, to obtain goal tree;It is the second training data that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,;Using second training data as first training data, circulation executes above-mentioned steps, until the accuracy of n prediction result obtains random forest both greater than or equal to the preset threshold.Persistently whole training data can be optimized in the multiple training process to random forest, while the tree for having single features in avoiding training process is increased, improve the accuracy of classification prediction.

Description

Training method, device, storage medium and the electronic equipment of random forest
Technical field
This disclosure relates to machine learning field, and in particular, to a kind of training method of random forest, device, storage are situated between Matter and electronic equipment.
Background technique
Random forest is the classifier comprising multiple decision trees, and prediction result of its output is defeated by each tree Depending on the mode of prediction result out.The decision tree is a kind of Tree-structure Model for supervised learning.It, can in supervised learning First to give one group of sample, each sample includes one group of attribute (description data) and a classification (prediction result), these classes Be not it is pre-determined, by learning available one, this group of sample decision tree for having classification feature, the decision tree energy It is enough that correctly classification (output prediction result) is provided to emerging object.In the related technology, it is trained to random forest When, usually every decision tree in random forest is once trained by a part of data of full dose training data, then The most prediction result of number of votes obtained is obtained by ballot mode when carrying out classification prediction to new data.This mode classification can be kept away Exempt from the over-fitting in classification prediction, improves the generalization of classifier.But the prediction for the decision tree that only experience single is trained is just True rate is not high, can not cope with the feelings of data characteristics unbalanced (data of some classification are extremely more) in training data in training process The problem of condition, in turn resulting in the accuracy of entire classification prediction reduces.
Summary of the invention
To overcome the problems in correlation technique, purpose of this disclosure is to provide a kind of training method of random forest, Device, storage medium and electronic equipment.
To achieve the goals above, according to the first aspect of the embodiments of the present disclosure, a kind of training side of random forest is provided Method, which comprises
N group training dataset is determined in the first training data, first training data includes the same of event to be predicted The prediction result of the corresponding description data of class event and the similar event;
The n tree trained by the n group training dataset is judged by the description data, described in obtaining The corresponding n prediction result of n tree;
Delete operation is executed to the n tree according to the accuracy of the n prediction result and preset threshold, to obtain tree Set, the tree set include m tree, wherein m is less than or equal to n;
It carries out the first ballot to the m tree according to the corresponding ballot weight of each tree in described m tree to operate, to obtain Goal tree;
It is the second training data by the corresponding prediction result of the goal tree and the description Data Synthesis;
Using second training data as first training data, circulation is executed from described in full dose training data Determine that n group training dataset trains the corresponding prediction result of the goal tree and the description Data Synthesis for second to described The step of data, until the accuracy of the n prediction result is both greater than or equal to the preset threshold, it is random gloomy to obtain Woods, the random forest include to execute all trees got after the delete operation in one or more circulation implementation procedure Set.
Optionally, the method also includes:
It is described random gloomy to obtain using the corresponding description data of the event to be predicted as the input of the random forest Multiple prediction results of more tree output in woods;
The most prediction result of the frequency of occurrence in the multiple prediction result is determined by the second ballot operation, as institute State the prediction result of event to be predicted.
Optionally, described that the n tree trained by the n group training dataset is commented by the description data Sentence, to obtain the corresponding n prediction result of the n tree, comprising:
N tree is trained by the n group training dataset;
Using the description data as the input of each tree in described n tree, to obtain the institute of the n tree output State n prediction result.
Optionally, the accuracy and preset threshold according to the n prediction result, which executes the n tree, deletes behaviour Make, to obtain tree set, comprising:
When in the n prediction result there are when the u prediction result that accuracy is less than the preset threshold, described in deletion The corresponding u tree of u prediction result, to obtain the tree set comprising m tree, wherein m=n-u;Alternatively,
When the accuracy of the n prediction result is both greater than or is equal to the preset threshold, the tree comprising m tree is obtained Set, wherein m=n.
Optionally, described that first ballot is carried out to the m tree according to the corresponding ballot weight of each tree in described m tree Operation, to obtain goal tree, comprising:
The error rate of each tree is determined according to the accuracy of the prediction result of each tree;
Using the error rate of each tree as the input of preset ballot weight calculation formula, to obtain the franchise The ballot weight of each tree of re-computation formula output;
Described m tree is divided into multiple ballot groups, wherein each ballot group includes to have the more of identical prediction result Tree, the quantity of the more trees are the corresponding votes of each ballot group;
The product for appointing the corresponding ballot weight of one tree in the votes and each ballot group is obtained, as described The percentage of votes obtained of each ballot group;
It obtains to have in the ballot group of highest percentage of votes obtained and appoints one tree, as the goal tree.
According to the second aspect of an embodiment of the present disclosure, a kind of training device of random forest is provided, described device includes:
Data set determining module, for determining n group training dataset, first training data in the first training data The prediction result of the corresponding description data of similar event and the similar event including event to be predicted;
Random forest judge module, for by the description data to n trained by the n group training dataset Tree is judged, to obtain the corresponding n prediction result of the n tree;
Random forest removing module, for being set according to the accuracy and preset threshold of the n prediction result described n Delete operation is executed, to obtain tree set, the tree set includes m tree, wherein m is less than or equal to n;
Goal tree obtains module, for being carried out according to the corresponding ballot weight of each tree in described m tree to the m tree First ballot operation, to obtain goal tree;
Data Synthesis module, for being the second instruction by the corresponding prediction result of the goal tree and the description Data Synthesis Practice data;
Execution module is recycled, for using second training data as first training data, circulation to be executed from institute It states and determines n group training dataset to described by the corresponding prediction result of the goal tree and the description in full dose training data The step of Data Synthesis is the second training data, until the accuracy of the n prediction result is both greater than or default equal to described Threshold value, to obtain random forest, the random forest, which is included in one or more circulation implementation procedures, executes the deletion behaviour All tree set got after work.
Optionally, described device further include:
Data input module, using the corresponding description data of the event to be predicted as the input of the random forest, with Obtain multiple prediction results of more tree output in the random forest;
As a result determining module determines most pre- of the frequency of occurrence in the multiple prediction result by the second ballot operation Survey the prediction result as a result, as the event to be predicted.
Optionally, the random forest judges module, comprising:
Random forest trains submodule, for training n tree by the n group training dataset;
Random forest judge submodule, for using it is described description data as described n tree in each tree input, To obtain the n prediction result of the n tree output.
Optionally, the random forest removing module, is used for:
When in the n prediction result there are when the u prediction result that accuracy is less than the preset threshold, described in deletion The corresponding u tree of u prediction result, to obtain the tree set comprising m tree, wherein m=n-u;Alternatively,
When the accuracy of the n prediction result is both greater than or is equal to the preset threshold, the tree comprising m tree is obtained Set, wherein m=n.
Optionally, the goal tree obtains module, comprising:
Error rate determines submodule, and the accuracy for the prediction result according to each tree determines each tree Error rate;
Weight calculation submodule, for using the error rate of each tree as the defeated of preset ballot weight calculation formula Enter, to obtain the ballot weight of each tree of the ballot weight calculation formula output;
Ballot group divides submodule, for described m tree to be divided into multiple ballot groups, wherein each ballot group includes Have more trees of identical prediction result, the quantity of the more trees is the corresponding votes of each ballot group;
Percentage of votes obtained acquisition submodule appoints the corresponding throwing of one tree for obtaining in the votes and each ballot group The product of ticket weight, the percentage of votes obtained as each ballot group;
Goal tree acquisition submodule appoints one tree for obtaining in the ballot group for having highest percentage of votes obtained, as described Goal tree.
According to the third aspect of an embodiment of the present disclosure, a kind of computer readable storage medium is provided, calculating is stored thereon with Machine program realizes the training for the random forest that embodiment of the present disclosure first aspect provides when the computer program is executed by processor The step of method.
According to a fourth aspect of embodiments of the present disclosure, a kind of electronic equipment is provided, comprising:
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize embodiment of the present disclosure first party The step of training method for the random forest that face provides.
Through the above technical solutions, the disclosure can determine n group training dataset in the first training data, first instruction Practicing data includes the corresponding description data of similar event of event to be predicted and the prediction result of the similar event;By this Description data judge the n tree trained by the n group training dataset, to obtain the corresponding n prediction knot of this n tree Fruit;Delete operation is executed to this n tree according to the accuracy of the n prediction result and preset threshold, to obtain tree set, the tree Set includes m tree, wherein m is less than or equal to n;According to this m tree in the corresponding ballot weight of each tree to this m set into The ballot operation of row first, to obtain goal tree;It is the second instruction that the corresponding prediction result of the goal tree, which is described Data Synthesis with this, Practice data;Using second training data as first training data, circulation is executed determines n from above-mentioned in full dose training data The corresponding prediction result of the goal tree is described the step that Data Synthesis is the second training data with this to above-mentioned by group training dataset Suddenly, until the accuracy of the n prediction result is both greater than or equal to the preset threshold, to obtain random forest, the random forest Comprising executing all tree set got after the delete operation in one or more circulation implementation procedure.It can be to random Persistently whole training data is optimized in the multiple training process of forest, has single features in avoiding training process While setting excessive, the accuracy of classification prediction is improved.
Other feature and advantage of the disclosure will the following detailed description will be given in the detailed implementation section.
Detailed description of the invention
Attached drawing is and to constitute part of specification for providing further understanding of the disclosure, with following tool Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:
Fig. 1 is a kind of flow chart of the training method of random forest shown according to an exemplary embodiment;
Fig. 2 is the flow chart for implementing the training method of another random forest exemplified according to Fig. 1;
Fig. 3 is the flow chart for implementing a kind of random forest evaluation method exemplified according to Fig.2,;
Fig. 4 is the flow chart for implementing a kind of goal tree acquisition methods exemplified according to Fig.2,;
Fig. 5 is a kind of block diagram of the training device of random forest shown according to an exemplary embodiment;
Fig. 6 is the block diagram for implementing the training device of another random forest exemplified according to Fig.5,;
Fig. 7 is that a kind of random forest for implementing to exemplify according to Fig.6, judges the block diagram of module;
Fig. 8 is that a kind of goal tree for implementing to exemplify according to Fig.6, obtains the block diagram of module;
Fig. 9 is the block diagram of a kind of electronic equipment shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all implementations consistent with this disclosure.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects be described in detail in claims, the disclosure.
Fig. 1 is a kind of flow chart of the training method of random forest shown according to an exemplary embodiment, such as Fig. 1 institute Show, this method comprises:
Step 101, n group training dataset is determined in the first training data.
Wherein, which includes that the similar event of event to be predicted is corresponding Data and the prediction result (classification results) of the similar event are described.First training data needs as detailed as possible in principle Most is described a kind of similar event.The n group training dataset is the n group instruction randomly selected from first training data Practice data set, each training dataset may include may include mutually between entirely different example or training dataset The part of coincidence.
By taking the classification prediction that the event to be predicted is some fruit as an example, which is the classification prediction thing of fruit Part needs to acquire the description data and prediction knot of the classification predicted events (more example as far as possible) of the fruit of types more as far as possible Fruit, completely to be described to entire classification predicted events.It should be noted that first training data can be as following Shown in table 1.
Table 1
A B C D E
Calusena lansium White flesh Crescent shape It is sweet Banana
Shagreen Red flesh It is spherical It is sweet Watermelon
Red skin White flesh It is spherical Sweet and sour Apple
Wherein, one predicted events (or example) of each behavior in table 1, A, B, C and D this four column in data be retouch Data portion is stated, the data in E column are prediction result part.It should be noted that including a large amount of (example in the first training data Such as, 100,000) the corresponding a large amount of description data of example and prediction result, table 1 herein only with banana, watermelon and apple these three It is shown for the corresponding description data of example and prediction result.The n group training data selected from the first training data The every group of training dataset concentrated has form identical with first training data, that is, also includes the description above data portion Divide and above-mentioned prediction result part.
Step 102, it describes data by this to judge the n tree trained by the n group training dataset, to obtain The corresponding n prediction result of this n tree.
Wherein, above-mentioned tree is decision tree (or classification tree), which is the machine learning of existing tree structure Model, random forest are made of multiple decision trees.
Illustratively, in the step 102, n is trained by the n group training dataset selected in a step 101 first Decision tree, as initial random forest.It, can be by above-mentioned first training data after getting n decision tree Be described data judge each tree, that is, all prediction results in above-mentioned first training data are deleted, it will be remaining Be described data input every and have completed trained decision tree, obtain every decision tree and are directed to above-mentioned be described data New prediction result, and then get the corresponding accuracy of every decision tree.
By taking above-mentioned table 1 as an example, the description data in this four column of A, B, C and D obtain one group of prediction knot as the input for setting A Fruit.The prediction result is actually also the column for including a large amount of prediction result data.Can by the prediction result data with Data in above-mentioned E column compare, to obtain the accuracy of the pre- geodesic structure.It is with 3 examples shown in table 1 , it include: banana, watermelon and lichee in the prediction result, then compared with above-mentioned E column, the accuracy of the prediction result is 2/3。
Step 103, delete operation is executed to this n tree according to the accuracy of the n prediction result and preset threshold, to obtain Tree is taken to gather.
Wherein, which includes m tree, and m is less than or equal to n.
Illustratively, which may include: when there are the u that accuracy is less than the preset threshold in the n prediction result When a prediction result, the corresponding u tree of the u prediction result is deleted, to obtain the tree set comprising m tree, wherein m=n- u;Alternatively, the tree set comprising m tree is obtained when the accuracy of the n prediction result is both greater than or is equal to the preset threshold, Wherein, m=n.It is understood that when there is the accuracy of any one prediction result default less than this in the n prediction result When threshold value, the accuracy for needing to delete prediction result in n tree is less than the tree of the preset threshold, retains remaining m tree conduct Above-mentioned tree set, and the first ballot operation in 104 (the ballot operation of subsidiary weight) is selected wherein just through the following steps True rate highest and the prediction result of the strongest decision tree output of generalization.When the accuracy of the n prediction result is both greater than or waits When the preset threshold, it is believed that have been realized in the purpose optimized to decision tree needed for composition random forest, then This programme the step 103 stop, and obtain this n tree (and this accuracy judgment step before reservation one or more A tree set), as trained random forest.
Illustratively, the step 106 of this programme is held actually to recycle the step of executing to step 101 to 105 in each circulation During row, it is less than u prediction result of the preset threshold there are accuracy in the n prediction result and remains m in turn After tree, this m tree can be regarded as a tree and gathered.When progress, for example, determining the n after 5 circulation implementation procedures The accuracy of a prediction result is then retaining the corresponding n tree of n current prediction result both greater than or equal to the preset threshold Outside, the 5 trees set retained in 5 times above-mentioned circulation implementation procedures is also obtained.It is understood that being wrapped in each tree set The quantity of the decision tree contained is not quite similar.In this way, the random forest that this programme finally determines is (practical also at this by this n tree The tree set got in step 103) and other 5 all trees compositions set in set.
Step 104, the first ballot is carried out to this m tree according to the corresponding ballot weight of each tree in this m tree to operate, with Obtain goal tree.
Illustratively, the ballot weight is used to describe the important of the votes that each tree obtains in the first ballot operation Degree.For example, the numerical value in bracket is the power of each tree for this 4 tree A (1), tree B (2), tree C (4) and tree D (10) trees Weight.When carry out first is voted and is operated, obtained voting results are as follows: tree A obtains 5 tickets, and tree B obtains 3 tickets, and tree C obtains 2 tickets, and tree D obtains 1 ticket, Then while tree A gained vote is more, tree D gained vote is less, but because tree D weight is big, and " victor " of this ballot is tree D.
Step 105, the corresponding prediction result of the goal tree is described Data Synthesis with this is the second training data.
Step 106, using second training data as first training data, circulation is executed from above-mentioned in full dose training number Data Synthesis is described into for the second training with this for the corresponding prediction result of the goal tree to above-mentioned according to middle determining n group training dataset The step of data, until the accuracy of the n prediction result is both greater than or is equal to the preset threshold, to obtain random forest.
Wherein, which includes to get after executing the delete operation in one or more circulation implementation procedures All tree set.
In conclusion the disclosure can determine n group training dataset in the first training data, the first training data packet Include the corresponding description data of similar event of event to be predicted and the prediction result of the similar event;Data are described by this The n tree trained by the n group training dataset is judged, to obtain the corresponding n prediction result of this n tree;According to The accuracy and preset threshold of the n prediction result execute delete operation to this n tree, to obtain tree set, the tree set packet It is set containing m, wherein m is less than or equal to n;First is carried out to this m tree according to the corresponding ballot weight of each tree in this m tree Ballot operation, to obtain goal tree;It is the second training data that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,; Using second training data as first training data, circulation is executed determines the training of n group from above-mentioned in full dose training data The corresponding prediction result of the goal tree is described the step of Data Synthesis is the second training data with this to above-mentioned by data set, until The accuracy of the n prediction result is both greater than or equal to the preset threshold, and to obtain random forest, which is included in one All tree set got after the delete operation are executed in a or multiple circulation implementation procedures.It can be to the more of random forest Persistently whole training data is optimized in secondary training process, the tree for having single features in avoiding random forest is excessive, While guaranteeing the generalization of random forest classification prediction, the accuracy of classification prediction is improved.
Fig. 2 is the flow chart for implementing the training method of another random forest exemplified according to Fig. 1, such as Fig. 2 institute Show, after above-mentioned steps 106, this method can also include:
Step 107, random to obtain this using the corresponding description data of the event to be predicted as the input of the random forest Multiple prediction results of more tree output in forest.
Step 108, the most prediction knot of the frequency of occurrence in above-mentioned multiple prediction results is determined by the second ballot operation Fruit, the prediction result as the event to be predicted.
Illustratively, after getting the random forest, existing event to be predicted can be retouched by the random forest It states data to be predicted, wherein every decision tree in random forest can all export a prediction result.In above-mentioned multiple predictions As a result in, the most prediction knot of a frequency of occurrence can be selected by the second ballot operation (no weight votes) of random forest Final prediction result of the fruit as the event to be predicted.
Still by taking the classification of above-mentioned fruit prediction as an example, if including 30 trees, the event to be predicted in the random forest Corresponding description data are shagreen, green flesh, spherical, sweet.The random forest can describe data according to this and export 30 prediction knots Fruit, wherein 25 are grape, and 3 are green apple, and 2 are Kiwi berry.In this way, then will wherein (frequency of occurrence be most for percentage of votes obtained highest It is more) grape as the final prediction result.
Fig. 3 is the flow chart for implementing a kind of random forest evaluation method exemplified according to Fig.2, as shown in figure 3, on Stating step 102 may include:
Step 1021, n tree is trained by the n group training dataset.
Illustratively, which is properly termed as the pre-training step of random forest, and what is obtained after the pre-training step is first Beginning random forest is also equipped with certain defect in terms of precision of classifying, it is therefore desirable to combine Ada in the following steps The theory of Boosting method repeatedly trains each tree in random forest, and in the training process persistently to first Training data (i.e. full dose training data) optimizes, and suitably to strengthen the effect of crucial training data, it is random to improve this The precision of forest.
Step 1022, data are described into as the input of each tree in this n tree, to obtain this n tree output for this The n prediction result.
Fig. 4 is the flow chart for implementing a kind of goal tree acquisition methods exemplified according to Fig.2, as shown in figure 4, above-mentioned Step 104 may include:
Step 1041, the error rate of each tree is determined according to the accuracy of the prediction result of each tree.
Illustratively, the error rate of decision tree is the difference of 1 accuracy for subtracting the decision tree.
Step 1042, using the error rate of each tree as the input of preset ballot weight calculation formula, to be somebody's turn to do The ballot weight of each tree of ballot weight calculation formula output.
Illustratively, shown in the ballot weight calculation formula such as following equation (1):
Wherein, W indicates the ballot weight, eiIndicate the error rate of i-th tree in this m tree.
Step 1043, this m tree is divided into multiple ballot groups.
Wherein, each ballot group includes to have more trees of identical prediction result, and the quantity of this more trees is that this is each The corresponding votes of ballot group.
Step 1044, the product for obtaining votes ballot weight corresponding with one tree is appointed in each ballot group, makees For the percentage of votes obtained of each ballot group.
Step 1045, it obtains to have in the ballot group of highest percentage of votes obtained and appoints one tree, as the goal tree.
Illustratively, in the first ballot operation of subsidiary weight, the percentage of votes obtained of each ballot group is ballot weight and reality The product of obtained votes.And, it is understood that multiple prediction results in each ballot group are identical pre- It surveys as a result, these prediction results correspond to multiple identical trees.It therefore, can be from the ballot group for having highest percentage of votes obtained Optional one tree, as the goal tree.
In conclusion the disclosure can determine n group training dataset in the first training data, the first training data packet Include the corresponding description data of similar event of event to be predicted and the prediction result of the similar event;Data are described by this The n tree trained by the n group training dataset is judged, to obtain the corresponding n prediction result of this n tree;According to The accuracy and preset threshold of the n prediction result execute delete operation to this n tree, to obtain tree set, the tree set packet It is set containing m, wherein m is less than or equal to n;First is carried out to this m tree according to the corresponding ballot weight of each tree in this m tree Ballot operation, to obtain goal tree;It is the second training data that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,; Using second training data as first training data, circulation is executed determines the training of n group from above-mentioned in full dose training data The corresponding prediction result of the goal tree is described the step of Data Synthesis is the second training data with this to above-mentioned by data set, until The accuracy of the n prediction result is both greater than or equal to the preset threshold, and to obtain random forest, which is included in one All tree set got after the delete operation are executed in a or multiple circulation implementation procedures.It can be to the more of random forest Persistently whole training data is optimized in secondary training process, the tree for having single features in avoiding random forest is excessive, While guaranteeing the generalization of random forest classification prediction, the accuracy of classification prediction is improved.
Fig. 5 is a kind of block diagram of the training device of random forest shown according to an exemplary embodiment, as shown in figure 5, The device 500 includes:
Data set determining module 510, for determining n group training dataset in the first training data, the first training number According to the prediction result of the corresponding description data of similar event and the similar event that include event to be predicted;
Random forest judges module 520, for describing data to n trained by the n group training dataset by this Tree is judged, to obtain the corresponding n prediction result of this n tree;
Random forest removing module 530, for being set according to the accuracy and preset threshold of the n prediction result this n Delete operation is executed, to obtain tree set, which includes m tree, wherein m is less than or equal to n;
Goal tree obtains module 540, for being carried out according to the corresponding ballot weight of each tree in this m tree to this m tree First ballot operation, to obtain goal tree;
Data Synthesis module 550 is the second instruction for the corresponding prediction result of the goal tree to be described Data Synthesis with this Practice data;
Recycle execution module 560, for using second training data as first training data, circulation execute from this Determine that the corresponding prediction result of the goal tree is described Data Synthesis with this to this and is by n group training dataset in full dose training data The step of second training data, until the accuracy of the n prediction result is both greater than or equal to the preset threshold, it is random to obtain Forest, the random forest include all trees collection for executing in one or more circulation implementation procedures and getting after the delete operation It closes.
Fig. 6 is the block diagram for implementing the training device of another random forest exemplified according to Fig.5, as shown in fig. 6, The device 500 further include:
Data input module 570, using the corresponding description data of the event to be predicted as the input of the random forest, to obtain Take more in the random forest multiple prediction results for setting output;
As a result determining module 580 determine that the frequency of occurrence in above-mentioned multiple prediction results is most by the second ballot operation Prediction result, the prediction result as the event to be predicted.
Fig. 7 is the block diagram that a kind of random forest for implementing to exemplify according to Fig.6, judges module, as shown in fig. 7, should be with Machine forest judges module 520, comprising:
Random forest trains submodule 521, for training n tree by the n group training dataset;
Random forest judge submodule 522, for using this describe data as this n tree in each tree input, To obtain the n prediction result of this n tree output.
Optionally, the random forest removing module 530, is used for:
When pre- there are when the u prediction result that accuracy is less than the preset threshold, deleting the u in the n prediction result The corresponding u tree of result is surveyed, to obtain the tree set comprising m tree, wherein m=n-u;Alternatively,
When the accuracy of the n prediction result is both greater than or is equal to the preset threshold, obtains the tree comprising m tree and collect It closes, wherein m=n.
Fig. 8 is that a kind of goal tree for implementing to exemplify according to Fig.6, obtains the block diagram of module, as shown in figure 8, the target Tree obtains module 540, comprising:
Error rate determines submodule 541, and the accuracy for the prediction result according to each tree determines each tree Error rate;
Weight calculation submodule 542, for using the error rate of each tree as preset ballot weight calculation formula Input, to obtain the ballot weight of each tree of ballot weight calculation formula output;
Ballot group divides submodule 543, for this m tree to be divided into multiple ballot groups, wherein each ballot group includes Have more of identical prediction result trees, the quantity of this more trees are the corresponding votes of each ballot group;
Percentage of votes obtained acquisition submodule 544, for obtaining votes throwing corresponding with one tree is appointed in each ballot group The product of ticket weight, the percentage of votes obtained as each ballot group;
Goal tree acquisition submodule 545 appoints one tree for obtaining in the ballot group for having highest percentage of votes obtained, as this Goal tree.
In conclusion the disclosure can determine n group training dataset in the first training data, the first training data packet Include the corresponding description data of similar event of event to be predicted and the prediction result of the similar event;Data are described by this The n tree trained by the n group training dataset is judged, to obtain the corresponding n prediction result of this n tree;According to The accuracy and preset threshold of the n prediction result execute delete operation to this n tree, to obtain tree set, the tree set packet It is set containing m, wherein m is less than or equal to n;First is carried out to this m tree according to the corresponding ballot weight of each tree in this m tree Ballot operation, to obtain goal tree;It is the second training data that the corresponding prediction result of the goal tree, which is described Data Synthesis with this,; Using second training data as first training data, circulation is executed determines the training of n group from above-mentioned in full dose training data The corresponding prediction result of the goal tree is described the step of Data Synthesis is the second training data with this to above-mentioned by data set, until The accuracy of the n prediction result is both greater than or equal to the preset threshold, and to obtain random forest, which is included in one All tree set got after the delete operation are executed in a or multiple circulation implementation procedures.It can be to the more of random forest Persistently whole training data is optimized in secondary training process, the tree for having single features in avoiding random forest is excessive, While guaranteeing the generalization of random forest classification prediction, the accuracy of classification prediction is improved.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 9 is the block diagram of a kind of electronic equipment 900 shown according to an exemplary embodiment.As shown in figure 9, the electronics is set Standby 900 may include: processor 901, memory 902, multimedia component 903, input/output (I/O) interface 904, Yi Jitong Believe component 905.
Wherein, processor 901 is used to control the integrated operation of the electronic equipment 900, to complete above-mentioned random forest All or part of the steps in training method.Memory 902 is for storing various types of data to support in the electronic equipment 900 operation, these data for example may include any application or method for operating on the electronic equipment 900 Instruction and the relevant data of application program, such as contact data, the message of transmitting-receiving, picture, audio, video etc..This is deposited Reservoir 902 can by any kind of volatibility or non-volatile memory device or they synthesize realization, such as it is static with Machine accesses memory (Static Random Access Memory, abbreviation SRAM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, abbreviation EEPROM), erasable programmable Read-only memory (Erasable Programmable Read-Only Memory, abbreviation EPROM), programmable read only memory (Programmable Read-Only Memory, abbreviation PROM), and read-only memory (Read-Only Memory, referred to as ROM), magnetic memory, flash memory, disk or CD.Multimedia component 903 may include screen and audio component.Wherein Screen for example can be touch screen, and audio component is used for output and/or input audio signal.For example, audio component may include One microphone, microphone is for receiving external audio signal.The received audio signal can be further stored in storage Device 902 is sent by communication component 905.Audio component further includes at least one loudspeaker, is used for output audio signal.I/O Interface 904 provides interface between processor 901 and other interface modules, other above-mentioned interface modules can be keyboard, mouse, Button etc..These buttons can be virtual push button or entity button.Communication component 905 is for the electronic equipment 900 and other Wired or wireless communication is carried out between equipment.Wireless communication, such as Wi-Fi, bluetooth, near-field communication (Near Field Communication, abbreviation NFC), 2G, 3G or 4G or they one or more of synthesis, therefore corresponding communication Component 905 may include: Wi-Fi module, bluetooth module, NFC module.
In one exemplary embodiment, electronic equipment 900 can be by one or more application specific integrated circuit (Application Specific Integrated Circuit, abbreviation ASIC), digital signal processor (Digital Signal Processor, abbreviation DSP), digital signal processing appts (Digital Signal Processing Device, Abbreviation DSPD), programmable logic device (Programmable Logic Device, abbreviation PLD), field programmable gate array (Field Programmable Gate Array, abbreviation FPGA), controller, microcontroller, microprocessor or other electronics member Part is realized, for executing the training method of above-mentioned random forest.
In a further exemplary embodiment, a kind of computer readable storage medium including program instruction, example are additionally provided It such as include the memory 902 of program instruction, above procedure instruction can be executed by the processor 901 of electronic equipment 900 on to complete The training method for the random forest stated.
The preferred embodiment of the disclosure is described in detail in conjunction with attached drawing above, still, the disclosure is not limited to above-mentioned reality The detail in mode is applied, in the range of the technology design of the disclosure, those skilled in the art are considering specification and practice After the disclosure, it is readily apparent that other embodiments of the disclosure, belongs to the protection scope of the disclosure.
It is further to note that specific technical features described in the above specific embodiments, in not lance In the case where shield, it can be combined in any appropriate way.Simultaneously between a variety of different embodiments of the disclosure Any combination can also be carried out, as long as it, without prejudice to the thought of the disclosure, equally should be considered as disclosure disclosure of that. The disclosure is not limited to the precision architecture being described above out, and the scope of the present disclosure is only limited by the attached claims System.

Claims (10)

1. a kind of training method of random forest, which is characterized in that the described method includes:
N group training dataset is determined in the first training data, first training data includes the similar thing of event to be predicted The prediction result of the corresponding description data of part and the similar event;
The n tree trained by the n group training dataset is judged by the description data, to obtain described n Set corresponding n prediction result;
Delete operation is executed to the n tree according to the accuracy of the n prediction result and preset threshold, to obtain tree collection It closes, the tree set includes m tree, wherein m is less than or equal to n;
It carries out the first ballot to the m tree according to the corresponding ballot weight of each tree in described m tree to operate, to obtain target Tree;
It is the second training data by the corresponding prediction result of the goal tree and the description Data Synthesis;
Using second training data as first training data, circulation is executed from described and is determined in full dose training data N group training dataset to it is described by the corresponding prediction result of the goal tree and the description Data Synthesis be the second training data The step of, until the accuracy of the n prediction result is both greater than or equal to the preset threshold, to obtain random forest, institute Stating random forest includes to execute all tree set got after the delete operation in one or more circulation implementation procedure.
2. the method according to claim 1, wherein the method also includes:
Using the corresponding description data of the event to be predicted as the input of the random forest, to obtain in the random forest More tree output multiple prediction results;
Determine the most prediction result of the frequency of occurrence in the multiple prediction result by the second ballot operation, as it is described to The prediction result of predicted events.
3. the method according to claim 1, wherein described trained by the description data to by the n group The n tree that data set trains is judged, to obtain the corresponding n prediction result of the n tree, comprising:
N tree is trained by the n group training dataset;
Using the description data as the input of each tree in described n tree, to obtain the n of the n tree output A prediction result.
4. the method according to claim 1, wherein the accuracy according to the n prediction result and pre- If threshold value executes delete operation to the n tree, to obtain tree set, comprising:
When, there are when the u prediction result that accuracy is less than the preset threshold, it is a deleting the u in the n prediction result The corresponding u tree of prediction result, to obtain the tree set comprising m tree, wherein m=n-u;Alternatively,
When the accuracy of the n prediction result is both greater than or is equal to the preset threshold, obtains the tree comprising m tree and collect It closes, wherein m=n.
5. the method according to claim 1, wherein described according to the corresponding ballot of each tree in described m tree Weight carries out the first ballot to the m tree and operates, to obtain goal tree, comprising:
The error rate of each tree is determined according to the accuracy of the prediction result of each tree;
Using the error rate of each tree as the input of preset ballot weight calculation formula, to obtain the franchise restatement Calculate the ballot weight of each tree of formula output;
Described m tree is divided into multiple ballot groups, wherein each ballot group includes more for having identical prediction result Tree, the quantity of the more trees are the corresponding votes of each ballot group;
The product for appointing the corresponding ballot weight of one tree in the votes and each ballot group is obtained, as described each The percentage of votes obtained of ballot group;
It obtains to have in the ballot group of highest percentage of votes obtained and appoints one tree, as the goal tree.
6. a kind of training device of random forest, which is characterized in that described device includes:
Data set determining module, for n group training dataset determining in the first training data, first training data includes The prediction result of the corresponding description data of the similar event of event to be predicted and the similar event;
Random forest judge module, for by the description data to n trained by the n group training dataset set into Row is judged, to obtain the corresponding n prediction result of the n tree;
Random forest removing module, for being executed according to the accuracy and preset threshold of the n prediction result to the n tree Delete operation, to obtain tree set, the tree set includes m tree, wherein m is less than or equal to n;
Goal tree obtains module, for carrying out first to the m tree according to the corresponding ballot weight of each tree in described m tree Ballot operation, to obtain goal tree;
Data Synthesis module, for being the second training number by the corresponding prediction result of the goal tree and the description Data Synthesis According to;
Recycle execution module, for will second training data as first training data, circulation execution from it is described Determine n group training dataset to described by the corresponding prediction result of the goal tree and the description data in full dose training data The step of synthesizing the second training data, until the accuracy of the n prediction result is both greater than or is equal to the preset threshold, To obtain random forest, the random forest includes to obtain after executing the delete operation in one or more circulation implementation procedure All tree set got.
7. device according to claim 6, which is characterized in that described device further include:
Data input module, using the corresponding description data of the event to be predicted as the input of the random forest, to obtain Multiple prediction results of more tree output in the random forest;
As a result determining module determines the most prediction knot of the frequency of occurrence in the multiple prediction result by the second ballot operation Fruit, the prediction result as the event to be predicted.
8. device according to claim 6, which is characterized in that the goal tree obtains module, comprising:
Error rate determines submodule, and the error of each tree is determined for the accuracy according to the prediction result of each tree Rate;
Weight calculation submodule, for using the error rate of each tree as the input of preset ballot weight calculation formula, To obtain the ballot weight of each tree of the ballot weight calculation formula output;
Ballot group divides submodule, for described m tree to be divided into multiple ballot groups, wherein each ballot group includes to have More trees of identical prediction result, the quantity of the more trees are the corresponding votes of each ballot group;
Percentage of votes obtained acquisition submodule appoints the corresponding franchise of one tree for obtaining in the votes and each ballot group The product of weight, the percentage of votes obtained as each ballot group;
Goal tree acquisition submodule appoints one tree for obtaining in the ballot group for having highest percentage of votes obtained, as the target Tree.
9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program quilt The step of any one of claim 1-5 the method is realized when processor executes.
10. a kind of electronic equipment characterized by comprising
Memory is stored thereon with computer program;
Processor, for executing the computer program in the memory, to realize described in any one of claim 1-5 The step of method.
CN201811557766.4A 2018-12-19 2018-12-19 Training method and device for random forest, storage medium and electronic equipment Active CN109726826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811557766.4A CN109726826B (en) 2018-12-19 2018-12-19 Training method and device for random forest, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811557766.4A CN109726826B (en) 2018-12-19 2018-12-19 Training method and device for random forest, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN109726826A true CN109726826A (en) 2019-05-07
CN109726826B CN109726826B (en) 2021-08-13

Family

ID=66296251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811557766.4A Active CN109726826B (en) 2018-12-19 2018-12-19 Training method and device for random forest, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN109726826B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264342A (en) * 2019-06-19 2019-09-20 深圳前海微众银行股份有限公司 A kind of business audit method and device based on machine learning
CN111160647A (en) * 2019-12-30 2020-05-15 第四范式(北京)技术有限公司 Money laundering behavior prediction method and device
CN113516173A (en) * 2021-05-27 2021-10-19 江西五十铃汽车有限公司 Evaluation method for static and dynamic interference of whole vehicle based on random forest and decision tree

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391970A (en) * 2014-12-04 2015-03-04 深圳先进技术研究院 Attribute subspace weighted random forest data processing method
WO2015066564A1 (en) * 2013-10-31 2015-05-07 Cancer Prevention And Cure, Ltd. Methods of identification and diagnosis of lung diseases using classification systems and kits thereof
CN105844300A (en) * 2016-03-24 2016-08-10 河南师范大学 Optimized classification method and optimized classification device based on random forest algorithm
US9519868B2 (en) * 2012-06-21 2016-12-13 Microsoft Technology Licensing, Llc Semi-supervised random decision forests for machine learning using mahalanobis distance to identify geodesic paths
CN108846338A (en) * 2018-05-29 2018-11-20 南京林业大学 Polarization characteristic selection and classification method based on object-oriented random forest

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9519868B2 (en) * 2012-06-21 2016-12-13 Microsoft Technology Licensing, Llc Semi-supervised random decision forests for machine learning using mahalanobis distance to identify geodesic paths
WO2015066564A1 (en) * 2013-10-31 2015-05-07 Cancer Prevention And Cure, Ltd. Methods of identification and diagnosis of lung diseases using classification systems and kits thereof
CN104391970A (en) * 2014-12-04 2015-03-04 深圳先进技术研究院 Attribute subspace weighted random forest data processing method
CN105844300A (en) * 2016-03-24 2016-08-10 河南师范大学 Optimized classification method and optimized classification device based on random forest algorithm
CN108846338A (en) * 2018-05-29 2018-11-20 南京林业大学 Polarization characteristic selection and classification method based on object-oriented random forest

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEISHI MAN 等: "Image classification based on improved random forest algorithm", 《2018 IEEE 3RD INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA)》 *
冯开平 等: "基于加权KNN与随机森林的表情识别方法", 《软件导刊》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264342A (en) * 2019-06-19 2019-09-20 深圳前海微众银行股份有限公司 A kind of business audit method and device based on machine learning
CN111160647A (en) * 2019-12-30 2020-05-15 第四范式(北京)技术有限公司 Money laundering behavior prediction method and device
CN111160647B (en) * 2019-12-30 2023-08-22 第四范式(北京)技术有限公司 Money laundering behavior prediction method and device
CN113516173A (en) * 2021-05-27 2021-10-19 江西五十铃汽车有限公司 Evaluation method for static and dynamic interference of whole vehicle based on random forest and decision tree

Also Published As

Publication number Publication date
CN109726826B (en) 2021-08-13

Similar Documents

Publication Publication Date Title
CN112101190B (en) Remote sensing image classification method, storage medium and computing device
Fuentes et al. High-performance deep neural network-based tomato plant diseases and pests diagnosis system with refinement filter bank
CN105701120B (en) The method and apparatus for determining semantic matching degree
US20180365525A1 (en) Multi-sampling model training method and device
CN108304921A (en) The training method and image processing method of convolutional neural networks, device
CN107844784A (en) Face identification method, device, computer equipment and readable storage medium storing program for executing
WO2020155300A1 (en) Model prediction method and device
CN107230108A (en) The processing method and processing device of business datum
CN109948680B (en) Classification method and system for medical record data
CN111461168A (en) Training sample expansion method and device, electronic equipment and storage medium
CN109726826A (en) Training method, device, storage medium and the electronic equipment of random forest
CN107909141A (en) A kind of data analysing method and device based on grey wolf optimization algorithm
CN112232944B (en) Method and device for creating scoring card and electronic equipment
CN106803039A (en) The homologous decision method and device of a kind of malicious file
Nimmagadda et al. Cricket score and winning prediction using data mining
CN111178656A (en) Credit model training method, credit scoring device and electronic equipment
CN109670623A (en) Neural net prediction method and device
CN109670567A (en) Neural net prediction method and device
CN109829471A (en) Training method, device, storage medium and the electronic equipment of random forest
CN110413682A (en) A kind of the classification methods of exhibiting and system of data
CN107135402A (en) A kind of method and device for recognizing TV station's icon
CN115762530A (en) Voiceprint model training method and device, computer equipment and storage medium
CN114742221A (en) Deep neural network model pruning method, system, equipment and medium
CN107403199A (en) Data processing method and device
Dang-Nhu Evaluating disentanglement of structured representations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant