CN102402713B - machine learning method and device - Google Patents

machine learning method and device Download PDF

Info

Publication number
CN102402713B
CN102402713B CN201010280239.0A CN201010280239A CN102402713B CN 102402713 B CN102402713 B CN 102402713B CN 201010280239 A CN201010280239 A CN 201010280239A CN 102402713 B CN102402713 B CN 102402713B
Authority
CN
China
Prior art keywords
sorter
seed
example collection
collection
utilize
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010280239.0A
Other languages
Chinese (zh)
Other versions
CN102402713A (en
Inventor
杨宇航
于浩
孟遥
陆应亮
夏迎炬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201010280239.0A priority Critical patent/CN102402713B/en
Publication of CN102402713A publication Critical patent/CN102402713A/en
Application granted granted Critical
Publication of CN102402713B publication Critical patent/CN102402713B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of machine learning method and device.Described machine learning method comprises: the seed S set that the data centralization automatic marking utilizing diverse ways never to mark is different with acquisition n 1, S 2..., S n, n is natural number and n>=2; Utilize the seed S set of the individual automatic marking of described n 1, S 2..., S ntrain corresponding n sorter C respectively 1, C 2..., C n; For each seed S set in the seed set of the individual automatic marking of described n i, i=1,2 ..., n, utilizes removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set iverify; And utilize described n seed S set of empirical tests 1, S 2..., S nagain train corresponding n sorter C respectively 1, C 2..., C n.

Description

Machine learning method and device
Technical field
The present invention relates to machine learning field, more specifically, relate to a kind of fault-tolerant machine learning method and device.
Background technology
Machine learning is intended to research computing machine and how simulates or to realize the learning behavior of the mankind, to obtain new knowledge or skills, reorganizes the performance that the existing structure of knowledge makes it constantly to improve self.Machine learning method and device are widely used in the task of different field, such as computer vision, natural language processing, bioinformatics etc.
Machine learning can be divided into supervise learning and the large class of non supervised learning two.Generally, guideless learning method uses the data set training classifier do not marked.Fig. 1 shows the indicative flowchart that a kind of nothing of the prior art instructs machine learning method.In step s 110, random labelling is carried out to the data set do not marked, obtain training set.In the step s 120, training set training classifier is used.In step s 130, which, pending example collection is predicted with the sorter trained.Guideless learning method marks data set without the need to dropping into a large amount of manpower, but because data set is without mark, effect may not be very desirable.
Fig. 2 shows a kind of indicative flowchart having guidance machine learning method of the prior art.In step S210, with the training set training classifier of artificial mark.In step S220, predict pending example collection with the sorter trained.Supervised learning approach uses the data of a large amount of artificial check and correction, thus can obtain good effect.But such method is difficult to field or the application of being transplanted to resource-constrained.
Therefore machine learning method often faces such awkward situation: unsupervised approach possibility effect is not very good, and has the method for guidance to need to consume a large amount of manpower and materials for preparing corpus.
In order to overcome this awkward situation, there is half directed learning method.Fig. 3 shows the indicative flowchart of a kind of half guidance machine learning method of the prior art.Compared with the non supervised learning method of Fig. 1, in Fig. 3 when training classifier, except the training set using data centralization random labelling and the acquisition never marked, also use the training set of artificial mark.Fig. 4 shows the indicative flowchart of another kind of half guidance machine learning method of the prior art.In the method for Fig. 4, artificial mark and an acquisition seed set in step S410, and train a sorter with this seed set in the step s 420.In addition, in order to improve the performance of sorter, in step S430, predict pending example collection with sorter; In step S440, example the highest for the middle confidence level that predicts the outcome is added in seed set; And in step S450, utilize the seed set training classifier again adding example.Repeat step S430 to S450, until meet the repetition end condition of regulation.
The language material that half method instructed can use mark simultaneously and not mark, but still depend critically upon the scale and quality of mark language material.It is still how the significant challenge of machine learning field face at artificial degree of participation and aspect of performance seeking balance.
Summary of the invention
Give hereinafter about brief overview of the present invention, to provide about the basic comprehension in some of the present invention.Should be appreciated that this general introduction is not summarize about exhaustive of the present invention.It is not that intention determines key of the present invention or pith, and nor is it intended to limit the scope of the present invention.Its object is only provide some concept in simplified form, in this, as the preorder in greater detail discussed after a while.
In view of the above situation of prior art, the present invention aims to provide a kind of efficient, fault-tolerant machine learning method and device.
According to an aspect of the present invention, a kind of machine learning method comprises: the seed S set that the data centralization automatic marking utilizing diverse ways never to mark is different with acquisition n 1, S 2..., S n, n is natural number and n>=2; Utilize the seed S set of the individual automatic marking of described n 1, S 2..., S ntrain corresponding n sorter C respectively 1, C 2..., C n; For each seed S set in the seed set of the individual automatic marking of described n i, i=1,2 ..., n, utilizes removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set iverify; And utilize described n seed S set of empirical tests 1, S 2..., S nagain train corresponding n sorter C respectively 1, C 2..., C n.
According to a further aspect in the invention, a kind of machine learning device comprises: initialization unit, is configured for: the seed S set that the data centralization automatic marking utilizing diverse ways never to mark is different with acquisition n 1, S 2..., S n, n is natural number and n>=2; Utilize the seed S set of the individual automatic marking of described n 1, S 2..., S ntrain corresponding n sorter C respectively 1, C 2..., C n; And for each seed S set in the seed set of the individual automatic marking of described n i, i=1,2 ..., n, utilizes removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set iverify; And optimize and processing unit, be configured for: described n the seed S set utilizing empirical tests 1, S 2..., S nagain train corresponding n sorter C respectively 1, C 2..., C n.
In said method and device, by distinct methods, automatic marking is carried out to the data set do not marked, without the need to artificial participation, improve learning efficiency.In addition, by carrying out cross validation with sorter to seed set, and utilization trains corresponding sorter again through the seed set of cross validation, effectively controls the noise introduced by automatic marking, achieves fault-tolerant study.
By below in conjunction with the detailed description of accompanying drawing to most preferred embodiment of the present invention, these and other advantage of the present invention will be more obvious.
Accompanying drawing explanation
Below with reference to the accompanying drawings illustrate embodiments of the invention, above and other objects, features and advantages of the present invention can be understood more easily.Parts in accompanying drawing are just in order to illustrate principle of the present invention.In the accompanying drawings, same or similar technical characteristic or parts will adopt same or similar Reference numeral to represent.
Fig. 1 shows the indicative flowchart that a kind of nothing of the prior art instructs machine learning method.
Fig. 2 shows a kind of indicative flowchart having guidance machine learning method of the prior art.
Fig. 3 shows the indicative flowchart of a kind of half guidance machine learning method of the prior art.
Fig. 4 shows the indicative flowchart of another kind of half guidance machine learning method of the prior art.
Fig. 5 shows the indicative flowchart of the machine learning method according to the embodiment of the present invention.
Fig. 6 shows the indicative flowchart of the machine learning method of use two sorters according to the embodiment of the present invention.
Fig. 7 shows the indicative flowchart of the machine learning method of use three sorters according to the embodiment of the present invention.
Fig. 8 shows the schematic block diagram of the machine learning device according to the embodiment of the present invention.
Fig. 9 shows and can be used for implementing the schematic block diagram according to the computing machine of the method and apparatus of the embodiment of the present invention.
Embodiment
Below with reference to accompanying drawings embodiments of the invention are described.The element described in an accompanying drawing of the present invention or a kind of embodiment and feature can combine with the element shown in one or more other accompanying drawing or embodiment and feature.It should be noted that for purposes of clarity, accompanying drawing and eliminate expression and the description of unrelated to the invention, parts known to persons of ordinary skill in the art and process in illustrating.
In view of the challenge that there is artificial degree of participation and aspect of performance seeking balance in prior art, present inventor proposes a kind of method of fault-tolerant study (Fault-TolerantLearning) to overcome this problem.
Fault-tolerant concept proposes the earliest in Computer Architecture, when there is data, file corruption for various reasons in systems in which or having lost in finger, system can automatically by these corrupted or lost files and date restoring to the state before having an accident, a kind of technology that system is normally run continuously.
According in the fault-tolerant learning method of the embodiment of the present invention and device, learnt by the language material of the corpus of automatic marking instead of artificial mark or priori, be a kind of machine learning method completely automatically, be therefore easily applied in any specific area or task.In addition, described method and apparatus is respectively used to checking by the different sorter of training and further predicts fault-tolerant to carry out, the raising of guaranteed performance.
Below in conjunction with Fig. 5-8, machine learning method according to the embodiment of the present invention and device are described.
Fig. 5 shows the indicative flowchart of the machine learning method according to the embodiment of the present invention.As shown in the figure, in step S510, the data centralization automatic marking utilizing diverse ways never to mark and the multiple different seed set of acquisition.Here, various automated process can be used to carry out labeled data collection.Those skilled in the art can select suitable automated process according to application scenarios.Such as, under the application scenarios of terminology extraction, the terminology extraction method based on TF-IDF that G.Salton and M.J.McGill proposes in IntroductiontoModernInformationRetrieval.McGraw-Hill in 1983 can be used, or YuhangYang, QinLu and TiejunZhao in 2008 at ChineseTermExtractionUsingMinimalResources.Proceedingsof the22thInternationalConferenceonComputationalLinguistics, the terminology extraction method based on deictic words proposed in 1033-1040 page carrys out labeled data collection, the seed set obtained comprises the term and non-term that utilize this automated process to judge to obtain.
Then, in step S520, the seed set of automatic marking is utilized to train multiple different sorter respectively.A sorter is trained in each seed set.
Then, in step S530, utilize the multiple sorters trained to carry out cross validation to different seed set, to obtain the seed set of empirical tests.That is, for the incompatible theory of subset, use other seed set to train part or all sorter in the sorter obtained to verify this seed set.
In step S540, multiple seed set is utilized again to train corresponding sorter.That is, the seed set of this renewal is utilized again to train by this seed set trained listening group.
Next, can process pending example collection with the sorter of again training.This can carry out with reference to the method for prior art, not shown here.
Preferably, in order to improve performance further, cross validation can also be introduced in the process of example collection.Specifically, in step S550, the sorter of again training is utilized to predict pending example collection.In step S560, the example collection of sorter to prediction is utilized to carry out cross validation.Similar with step S530, for one through prediction example collection, part or all sorter in other sorters except the sorter for predicting this example collection can be used to verify this example collection.Then, in step S570, the example in the example collection of empirical tests is added in corresponding seed set, again to train a point corresponding sorter with the seed set of this renewal.That is, the example in the example collection of empirical tests is joined for training in the seed set of the sorter being used for verifying this example collection.Here, exemplarily, the example of some the highest for confidence level in the example collection of empirical tests can be joined in seed set.Repeated execution of steps S540 to S570, repeats end condition (hereinafter also write and do stopping criterion for iteration) until meet.Here, end condition can set as required.Exemplarily, can set when the seed sum in all seed set reaches the number of the example of predetermined needs mark, termination of iterations.
In the above-mentioned methods, the language material of automatic marking is used and the language material of unartificial mark learns.The seed of automatic mark is higher than random labeled accuracy rate, makes to adopt automated process to obtain seed set more meaningful.In addition, different sorters can be trained with multiple relatively independent visual angle (as different seed set, different characteristic sets etc.), make proof procedure more effective.
In addition, in the above-mentioned methods, owing to using the language material of automatic marking, noise information may exist from the beginning, and all may increase after each iteration.In order to control noises effectively, make result more reliable, train multiple sorter be respectively used to seed set checking and with the seed set training classifier again after checking, to alleviate noise, raising performance.The prediction of example collection and checking is carried out and with the seed set training classifier again of example adding empirical tests, noise is alleviated further, and performance improves further with multiple sorter.
Fig. 6 shows the indicative flowchart of the machine learning method of use two sorters according to the embodiment of the present invention.In figure 6, given do not mark data set D, example collection U to be marked, need mark instance number be n.
First, a kind of method is adopted automatically to generate seed S set 1, adopt and alternatively automatically generate seed S set 2.
Then, seed S set is utilized 1train first sorter C 1, utilize seed S set 2train first sorter C 2.
Then, sorter C is utilized 1and C 2to the seed S set through automatic marking 1and S 2carry out cross validation.Specifically, sorter C is utilized 1mark seed S set 2, utilize sorter C 2mark seed S set 1.From seed S set 1and S 2there is inconsistent seed in middle annotation results of deleting automatic marking result and sorter respectively, obtains the seed S set of empirical tests 1and S 2.
As shown in the square frame 610 in Fig. 6, above-mentioned steps can be generically and collectively referred to as initialization procedure.
In order to improve performance further, in the processing procedure of example collection, also cross validation can be carried out, specific as follows.
First, seed S set is utilized 1training classifier C again 1, utilize seed S set 2training classifier C again 2.
Then, sorter C is utilized 1example in prediction sets U.Specifically, sorter C is utilized 1example in mark set U, chooses the example collection L of m the example composition mark that in annotation results, confidence level is the highest 1, namely through the example collection L of prediction 1.
Equally, sorter C is utilized 2example in prediction sets U.Specifically, sorter C is utilized 2example in mark set U, chooses the example collection L of m the example composition mark that in annotation results, confidence level is the highest 2, namely through the example collection L of prediction 2.
Then, sorter C is utilized 1and C 2to the example collection L through prediction 1and L 2carry out cross validation.Specifically, C is utilized 2again example collection L is marked 1in example, delete L 1middle C 2annotation results and C 1the inconsistent example that predicts the outcome, obtain the example collection L of empirical tests 1.C1 is utilized again to mark example collection L 2in example, delete L 2middle C 1annotation results and C 2the inconsistent example that predicts the outcome, obtain the example collection L of empirical tests 2.
Then, set L 1in example add seed S set to 1, set L 2in example add seed S set to 2, complete an iteration.
Can the beginning of an iteration or at the end of judge whether iteration should stop.As stopping criterion for iteration, such as, can be | S 1∪ S 2| when>=N, termination of iterations; Otherwise continuation iteration.
As shown in the square frame 620 in Fig. 6, above-mentioned steps can be generically and collectively referred to as iterative process.
Fig. 7 shows the indicative flowchart of the machine learning method of use three sorters according to the embodiment of the present invention.Compared with Fig. 6, in the method for Fig. 7, employ three sorters.But in method, each step and Fig. 6 are substantially identical, no longer repeat here.
What deserves to be explained is, figure 7 illustrates and use sorter C 2and C 3to the seed S set of automatic marking 1verify, use sorter C 1and C 3to the seed S set of automatic marking 2verify, use sorter C 1and C 2to the seed S set of automatic marking 3verify.Respectively from seed S set 1, S 2and S 3there is inconsistent seed in middle deletion the result and automatic marking result, to obtain the seed S set of empirical tests 1, S 2and S 3.But, also can only use other sorters of a part to verify a seed set.Such as, only sorter C can be suitable for 2to seed S set 1verify, be only suitable for sorter C 3to seed S set 2verify etc.Here no longer enumerate.
Equally, although illustrated use sorter C in Fig. 7 2and C 3to the example collection L through prediction 1verify, use sorter C 1and C 3to the example collection L through prediction 2verify, use sorter C 1and C 2to the example collection L through prediction 3verify, but, other sorters of part also can be used to verify an example collection.Such as, only sorter C can be suitable for 2to example collection L 1verify, be only suitable for sorter C 2to example collection L 3verify etc.Here no longer enumerate.
More than show the machine learning method example of use two sorters and three sorters, but this is just in order to illustration purpose, instead of will by the present invention's restriction therewith.It will be understood by those skilled in the art that the situation of the multiple sorters that may be used for other arbitrary numbers according to the machine learning method of the embodiment of the present invention, repeat no more here.
Fig. 8 shows the schematic block diagram of the machine learning device according to the embodiment of the present invention.As shown in the figure, machine learning device 800 comprises initialization unit 810 and optimizes and processing unit 820.According to one embodiment of present invention, initialization unit 810 is configurable for: the data centralization automatic marking utilizing diverse ways never to mark with obtain multiple different seed set; The seed set of described multiple automatic marking is utilized to train corresponding multiple sorter respectively; And for each seed set in the seed set of described multiple automatic marking, utilize the part or all of sorter except the sorter of being trained by this seed set in described multiple sorter to verify this seed set.Optimize and processing unit 820 configurable for utilizing multiple seed set of empirical tests again to train corresponding multiple sorter respectively.
According to another embodiment of the present invention, optimization and processing unit 820 are also configured for: utilize multiple sorters of again training to predict example collection respectively, to obtain multiple example collection through prediction accordingly; To each example collection through prediction, the part or all of sorter except the sorter for predicting this example collection in described multiple sorter is utilized to verify this example collection; Example in the example collection of each empirical tests is added corresponding seed set; And again train described in repeating, described example collection to be predicted, describedly each example collection is verified and described example in the example collection of each empirical tests is added corresponding seed set, be met until repeat end condition.
According to another embodiment of the present invention, repetition end condition is the total number reaching the example that predetermined needs mark of seed in whole seed set.
According to another embodiment of the present invention, optimization and processing unit 820 are configured for further: utilize described multiple sorter to mark example collection respectively; And choose respectively described multiple sorter each sorter annotation results in example composition multiple example collection through prediction accordingly of the highest predetermined number of confidence level.
According to another embodiment of the present invention, optimization and processing unit 820 are configured for further by the following example collection verified through prediction: utilize the part or all of sorter except the sorter for predicting this example collection in described multiple sorter to mark this example collection; And there is inconsistent example in the annotation results of deletion prediction result and described part or all of sorter from this example collection.
According to another embodiment of the present invention, initialization unit 810 is configured for the seed set being verified automatic marking by some further: utilize the part or all of sorter except the sorter of being trained by this seed set in described multiple sorter to mark this seed set; And there is inconsistent seed between the annotation results of deleting automatic marking result and described part or all of sorter from this seed set.
About the further details of the operation of the machine learning device according to the embodiment of the present invention, with reference to each embodiment of above-described method, can be not described in detail here.
In said method and device, by automated process, the data set do not marked is marked, without the need to artificial participation, improve learning efficiency.In addition, by carrying out cross validation with sorter to seed set, and utilization trains corresponding sorter again through the seed set of cross validation, effectively controls the noise introduced by automatic marking, achieves fault-tolerant study.
Method and apparatus according to the embodiment of the present invention does not impose any restrictions for practical application scene.Training method etc. for used classifier type, sorter does not also limit.
In addition, in said apparatus, all modules, unit can be configured by software, firmware, hardware or its mode combined.Configure spendable concrete means or mode is well known to those skilled in the art, do not repeat them here.When being realized by software or firmware, install to the computing machine with specialized hardware structure the program forming this software from storage medium or network, this computing machine, when being provided with various program, can perform various functions etc.
Fig. 9 shows and can be used for implementing the schematic block diagram according to the computing machine of the method and apparatus of the embodiment of the present invention.In fig .9, CPU (central processing unit) (CPU) 901 performs various process according to the program stored in ROM (read-only memory) (ROM) 902 or from the program that storage area 908 is loaded into random access memory (RAM) 903.In RAM903, also store the data required when CPU901 performs various process etc. as required.CPU901, ROM902 and RAM903 are connected to each other via bus 904.Input/output interface 905 is also connected to bus 904.
Following parts are connected to input/output interface 905: importation 906 (comprising keyboard, mouse etc.), output 907 (comprise display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage area 908 (comprising hard disk etc.), communications portion 909 (comprising network interface unit such as LAN card, modulator-demodular unit etc.).Communications portion 909 is via network such as the Internet executive communication process.As required, driver 910 also can be connected to input/output interface 905.Detachable media 911 such as disk, CD, magneto-optic disk, semiconductor memory etc. can be installed on driver 910 as required, and the computer program therefrom read is installed in storage area 908 as required.
When series of processes above-mentioned by software simulating, from network such as the Internet or storage medium, such as detachable media 911 installs the program forming software.
It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 9, distributes the detachable media 911 to provide program to user separately with equipment.The example of detachable media 911 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or hard disk that storage medium can be ROM902, comprise in storage area 908 etc., wherein computer program stored, and user is distributed to together with comprising their equipment.
The present invention also proposes a kind of program product storing the instruction code of machine-readable.When described instruction code is read by machine and performs, the above-mentioned method according to the embodiment of the present invention can be performed.
Correspondingly, be also included within of the present invention disclosing for carrying the above-mentioned storage medium storing the program product of the instruction code of machine-readable.Described storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick etc.
Above in the description of the specific embodiment of the invention, the feature described for a kind of embodiment and/or illustrate can use in one or more other embodiment in same or similar mode, combined with the feature in other embodiment, or substitute the feature in other embodiment.
Should emphasize, term " comprises/comprises " existence referring to feature, key element, step or assembly when using herein, but does not get rid of the existence or additional of one or more further feature, key element, step or assembly.
In addition, method of the present invention be not limited to specifications in describe time sequencing perform, also can according to other time sequencing ground, perform concurrently or independently.Therefore, the execution sequence of the method described in this instructions is not construed as limiting technical scope of the present invention.
Although above by the description of specific embodiments of the invention to invention has been disclosure, should be appreciated that, above-mentioned all embodiments and example are all illustrative, and not restrictive.Those skilled in the art can design various amendment of the present invention, improvement or equivalent in the spirit and scope of claims.These amendments, improvement or equivalent also should be believed to comprise in protection scope of the present invention.
remarks
Remarks 1. 1 kinds of machine learning methods, comprising:
The seed S set that the data centralization automatic marking utilizing diverse ways never to mark is different with acquisition n 1, S 2..., S n, n is natural number and n>=2;
Utilize the seed S set of the individual automatic marking of described n 1, S 2..., S ntrain corresponding n sorter C respectively 1, C 2..., C n;
For each seed S set in the seed set of the individual automatic marking of described n i, i=1,2 ..., n, utilizes removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set iverify; And
Utilize described n seed S set of empirical tests 1, S 2..., S nagain train corresponding n sorter C respectively 1, C 2..., C n.
Remarks 2., according to the method for remarks 1, also comprises:
Described n the sorter of again training is utilized to predict example collection respectively, to obtain corresponding n the example collection L through prediction 1, L 2..., L n;
To each example collection L through prediction i, i=1,2 ..., n, utilize in a described n sorter except for this example collection L icarry out the sorter C predicted ioutside part or all of sorter to this example collection L iverify;
By the example collection L of each empirical tests iin example add corresponding seed S set i; And
Again train described in repetition, described example collection to be predicted, describedly each example collection is verified and described example in the example collection of each empirical tests is added corresponding seed set, be met until repeat end condition.
Remarks 3. is according to the method for remarks 2, and wherein, described repetition end condition is:
Described seed S set 1, S 2..., S nin seed sum reach the number of the example of predetermined needs mark.
Remarks 4., according to the method for remarks 2, wherein, describedly carries out prediction to example collection and comprises:
A described n sorter is utilized to mark described example collection respectively; And
The example choosing the predetermined number that confidence level is the highest in the annotation results of each sorter of a described n sorter respectively forms corresponding n the example collection L through prediction 1, L 2..., L n.
Remarks 5. is according to the method for remarks 2, and wherein, described checking is through the example collection L of prediction icomprise:
Utilize in a described n sorter except for this example collection L icarry out the sorter C predicted ioutside part or all of sorter to this example collection L imark; And
From this example collection L ithere is inconsistent example in the annotation results of middle deletion prediction result and described part or all of sorter.
Remarks 6. according to the method for remarks 1, wherein, the seed S set of described checking automatic marking icomprise:
Utilize removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set imark; And
From this seed S set iinconsistent seed is there is between the annotation results of middle deletion automatic marking result and described part or all of sorter.
Remarks 7. 1 kinds of machine learning devices, comprising:
Initialization unit, is configured for:
The seed S set that the data centralization automatic marking utilizing diverse ways never to mark is different with acquisition n 1, S 2..., S n, n is natural number and n>=2;
Utilize the seed S set of the individual automatic marking of described n 1, S 2..., S ntrain corresponding n sorter C respectively 1, C 2..., C n; And
For each seed S set in the seed set of the individual automatic marking of described n i, i=1,2 ..., n, utilizes removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set iverify; And
Optimize and processing unit, be configured for:
Utilize described n seed S set of empirical tests 1, S 2..., S nagain train corresponding n sorter C respectively 1, C 2..., C n.
Remarks 8. is according to the device of remarks 7, and wherein, described optimization and processing unit are also configured for:
Described n the sorter of again training is utilized to predict example collection respectively, to obtain corresponding n the example collection L through prediction 1, L 2..., L n;
To each example collection L through prediction i, i=1,2 ..., n, utilize in a described n sorter except for this example collection L icarry out the sorter C predicted ioutside part or all of sorter to this example collection L iverify;
By the example collection L of each empirical tests iin example add corresponding seed S set i; And
Again train described in repetition, described example collection to be predicted, describedly each example collection is verified and described example in the example collection of each empirical tests is added corresponding seed set, be met until repeat end condition.
Remarks 9. is according to the device of remarks 8, and wherein, described repetition end condition is:
Described seed S set 1, S 2..., S nin seed sum reach the number of the example of predetermined needs mark.
Remarks 10. is according to the device of remarks 8, and wherein, described optimization and processing unit are configured for further:
A described n sorter is utilized to mark described example collection respectively; And
The example choosing the predetermined number that confidence level is the highest in the annotation results of each sorter of a described n sorter respectively forms corresponding n the example collection L through prediction 1, L 2..., L n.
Remarks 11. is according to the device of remarks 8, and wherein, described optimization and processing unit are configured for further by the following example collection L verified through prediction i:
Utilize in a described n sorter except for this example collection L icarry out the sorter C predicted ioutside part or all of sorter to this example collection L imark; And
From this example collection L ithere is inconsistent example in the annotation results of middle deletion prediction result and described part or all of sorter.
Remarks 12. is according to the device of remarks 7, and wherein, described initialization unit is configured for further by the following seed S set verifying automatic marking i:
Utilize removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set imark; And
From this seed S set iinconsistent seed is there is between the annotation results of middle deletion automatic marking result and described part or all of sorter.

Claims (10)

1. a machine learning method, comprising:
The seed S set that the data centralization automatic marking utilizing diverse ways never to mark is different with acquisition n 1, S 2..., S n, n is natural number and n>=2;
Utilize the seed S set of the individual automatic marking of described n 1, S 2..., S ntrain corresponding n sorter C respectively 1, C 2..., C n;
For each seed S set in the seed set of the individual automatic marking of described n i, i=1,2 ..., n, utilizes removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set iverify; And
Utilize described n seed S set of empirical tests 1, S 2..., S nagain train corresponding n sorter C respectively 1, C 2..., C n.
2. method according to claim 1, also comprises:
Described n the sorter of again training is utilized to predict example collection respectively, to obtain corresponding n the example collection L through prediction 1, L 2..., L n;
To each example collection L through prediction i, i=1,2 ..., n, utilize in a described n sorter except for this example collection L icarry out the sorter C predicted ioutside part or all of sorter to this example collection L iverify;
By the example collection L of each empirical tests iin example add corresponding seed S set i; And
Again train described in repetition, described example collection to be predicted, describedly each example collection is verified and described example in the example collection of each empirical tests is added corresponding seed set, be met until repeat end condition.
3. method according to claim 2, wherein, described repetition end condition is:
Described seed S set 1, S 2..., S nin seed sum reach the number of the example of predetermined needs mark.
4. method according to claim 2, wherein, describedly prediction is carried out to example collection comprise:
A described n sorter is utilized to mark described example collection respectively; And
The example choosing the predetermined number that confidence level is the highest in the annotation results of each sorter of a described n sorter respectively forms corresponding n the example collection L through prediction 1, L 2..., L n.
5. method according to claim 2, wherein, described checking is through the example collection L of prediction icomprise:
Utilize in a described n sorter except for this example collection L icarry out the sorter C predicted ioutside part or all of sorter to this example collection L imark; And
From this example collection L ithere is inconsistent example in the annotation results of middle deletion prediction result and described part or all of sorter.
6. method according to claim 1, wherein, the seed S set of described checking automatic marking icomprise:
Utilize removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set imark; And
From this seed S set iinconsistent seed is there is between the annotation results of middle deletion automatic marking result and described part or all of sorter.
7. a machine learning device, comprising:
Initialization unit, is configured for:
The seed S set that the data centralization automatic marking utilizing diverse ways never to mark is different with acquisition n 1, S 2..., S n, n is natural number and n>=2;
Utilize the seed S set of the individual automatic marking of described n 1, S 2..., S ntrain corresponding n sorter C respectively 1, C 2..., C n; And
For each seed S set in the seed set of the individual automatic marking of described n i, i=1,2 ..., n, utilizes removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set iverify; And
Optimize and processing unit, be configured for:
Utilize described n seed S set of empirical tests 1, S 2..., S nagain train corresponding n sorter C respectively 1, C 2..., C n.
8. device according to claim 7, wherein, described optimization and processing unit are also configured for:
Described n the sorter of again training is utilized to predict example collection respectively, to obtain corresponding n the example collection L through prediction 1, L 2..., L n;
To each example collection L through prediction i, i=1,2 ..., n, utilize in a described n sorter except for this example collection L icarry out the sorter C predicted ioutside part or all of sorter to this example collection L iverify;
By the example collection L of each empirical tests iin example add corresponding seed S set i; And
Again train described in repetition, described example collection to be predicted, describedly each example collection is verified and described example in the example collection of each empirical tests is added corresponding seed set, be met until repeat end condition.
9. device according to claim 8, wherein, described optimization and processing unit are configured for further by the following example collection L verified through prediction i:
Utilize in a described n sorter except for this example collection L icarry out the sorter C predicted ioutside part or all of sorter to this example collection L imark; And
From this example collection L ithere is inconsistent example in the annotation results of middle deletion prediction result and described part or all of sorter.
10. device according to claim 7, wherein, described initialization unit is configured for further by the following seed S set verifying automatic marking i:
Utilize removing by this seed S set in a described n sorter ithe sorter C of training ioutside part or all of sorter to this seed S set imark; And
From this seed S set iinconsistent seed is there is between the annotation results of middle deletion automatic marking result and described part or all of sorter.
CN201010280239.0A 2010-09-09 2010-09-09 machine learning method and device Expired - Fee Related CN102402713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010280239.0A CN102402713B (en) 2010-09-09 2010-09-09 machine learning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010280239.0A CN102402713B (en) 2010-09-09 2010-09-09 machine learning method and device

Publications (2)

Publication Number Publication Date
CN102402713A CN102402713A (en) 2012-04-04
CN102402713B true CN102402713B (en) 2015-11-25

Family

ID=45884896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010280239.0A Expired - Fee Related CN102402713B (en) 2010-09-09 2010-09-09 machine learning method and device

Country Status (1)

Country Link
CN (1) CN102402713B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202177B (en) * 2016-06-27 2017-12-15 腾讯科技(深圳)有限公司 A kind of file classification method and device
CN108509969B (en) * 2017-09-06 2021-11-09 腾讯科技(深圳)有限公司 Data labeling method and terminal
CN110147551B (en) * 2019-05-14 2023-07-11 腾讯科技(深圳)有限公司 Multi-category entity recognition model training, entity recognition method, server and terminal
CN112000808B (en) * 2020-09-29 2024-04-16 迪爱斯信息技术股份有限公司 Data processing method and device and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5555345A (en) * 1991-03-25 1996-09-10 Atr Interpreting Telephony Research Laboratories Learning method of neural network
CN1851703A (en) * 2006-05-10 2006-10-25 南京大学 Active semi-monitoring-related feedback method for digital image search
CN101520847A (en) * 2008-02-29 2009-09-02 富士通株式会社 Pattern identification device and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4034602B2 (en) * 2002-06-17 2008-01-16 富士通株式会社 Data classification device, active learning method of data classification device, and active learning program
EP1817693A1 (en) * 2004-09-29 2007-08-15 Panscient Pty Ltd. Machine learning system
JP2009211648A (en) * 2008-03-06 2009-09-17 Kddi Corp Method for reducing support vector

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5555345A (en) * 1991-03-25 1996-09-10 Atr Interpreting Telephony Research Laboratories Learning method of neural network
CN1851703A (en) * 2006-05-10 2006-10-25 南京大学 Active semi-monitoring-related feedback method for digital image search
CN101520847A (en) * 2008-02-29 2009-09-02 富士通株式会社 Pattern identification device and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于元学习策略的分类器融合方法及应用;王浩畅等;《通信学报》;20071031;第28卷(第10期);第7-13页 *
基于小规模标注语料的机器学习方法研究;李庆中等;《计算机应用》;20040229;第24卷(第2期);第56-58页 *
基于无指导机器学习的全文语义自动标注方法;卢志茂等;《自动化学报》;20060331;第32卷(第2期);第228-236页 *

Also Published As

Publication number Publication date
CN102402713A (en) 2012-04-04

Similar Documents

Publication Publication Date Title
JP5987088B2 (en) System and method for using multiple in-line heuristics to reduce false positives
CN102591909B (en) Systems and methods for providing increased scalability in deduplication storage systems
CN102057358B (en) Systems and methods for tracking changes to a volume
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
JP2021006889A (en) Method, apparatus and device for optimizing wake-up model, and storage medium
CN102163203B (en) Method and device for downloading web pages
CN102402713B (en) machine learning method and device
CN102436521B (en) Random verification method and system
CN102291369A (en) Control method and corresponding control device for verifying junk information settings
CN105550443A (en) SystemC based unified stainer array TLM model with accurate cycle
CN105264488A (en) Merging of sorted lists using array pair
CN104090995B (en) The automatic generation method of rebar unit grids in a kind of ABAQUS tire models
CN110674397B (en) Method, device, equipment and readable medium for training age point prediction model
CN104537012B (en) Data processing method and device
CN105740786A (en) Identity identification method and device of writer
US8341538B1 (en) Systems and methods for reducing redundancies in quality-assurance reviews of graphical user interfaces
CN107704341A (en) File access pattern method, apparatus and electronic equipment
CN103140839A (en) Systems and methods for efficient sequential logging on caching-enabled storage devices
CN117056612B (en) Lesson preparation data pushing method and system based on AI assistance
CN108132942A (en) A kind of page generation method and terminal
CN112131587B (en) Intelligent contract pseudo-random number security inspection method, system, medium and device
CN104580109A (en) Method and device for generating click verification code
CN103631714A (en) Method for generating minimum combination testing cases based on matrix multiplicity
CN117744760A (en) Text information identification method and device, storage medium and electronic equipment
CN104850638B (en) ETL concurrent process decision-making technique and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20151125

Termination date: 20180909