CN103324632A - Concept identification method and device based on collaborative learning - Google Patents

Concept identification method and device based on collaborative learning Download PDF

Info

Publication number
CN103324632A
CN103324632A CN2012100779064A CN201210077906A CN103324632A CN 103324632 A CN103324632 A CN 103324632A CN 2012100779064 A CN2012100779064 A CN 2012100779064A CN 201210077906 A CN201210077906 A CN 201210077906A CN 103324632 A CN103324632 A CN 103324632A
Authority
CN
China
Prior art keywords
concept
subset
training data
sequence sorter
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100779064A
Other languages
Chinese (zh)
Other versions
CN103324632B (en
Inventor
李建强
陈宽桐
刘春辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC China Co Ltd
Original Assignee
NEC China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC China Co Ltd filed Critical NEC China Co Ltd
Priority to CN201210077906.4A priority Critical patent/CN103324632B/en
Priority to JP2012271100A priority patent/JP5523543B2/en
Publication of CN103324632A publication Critical patent/CN103324632A/en
Application granted granted Critical
Publication of CN103324632B publication Critical patent/CN103324632B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a concept identification method and a device based on collaborative learning to improve the quality of the concept identification, and particularly relates to improve the quality when building sequence categorizers to carry out the concept identification based on training data of part marks. The method comprises the following steps: dividing the training dataset into at least two subsets, defining the training data contained in the training dataset as a text document with marker words, carrying out collaborative learning based on the training data contained by the subsets and according to the feature word assembly extracted by the training dataset, building at least two sequence categorizers, adopting each acquired sequence categorizer to carry out the concept identification to the present text document respectively, and confirming the concept contained in the present text document according to the concept identified by each sequence categorizer.

Description

A kind of concept identification method and device based on Cooperative Study
Technical field
The present invention relates to field of artificial intelligence, relate in particular to a kind of concept identification method and device based on Cooperative Study.
Background technology
Along with information retrieval (Information Retrieval, IR) development of technology, semantic information retrieval (Semantic Information Retrieval, Semantic IR) has huge development potentiality compared to traditional information retrieval based on keyword (Keywords-Based IR) technology.Wherein, concept identification (Concept Detection) and concept disambiguation (Concept Disambiguation) play an important role in the semantic information retrieval technology.So-called concept identification refers to find the character string for a concept that represents a concept or a plurality of concepts from text.
As shown in Figure 1, in the prior art, the detailed process that the method for employing machine learning is carried out the text document concept identification is as follows:
Store the text document of a large amount of tape labels in the text document storage unit 101, carry out storage in characteristic storage unit 102 after the feature selecting based on the text document of this mark;
Feature based on storage in the text document of tape label of storage in the text document storage unit 101 and the characteristic storage unit 102 is carried out the study of sequence sorter, obtains the sequence sorter and is stored to sequence sorter storage unit 103;
Sequence sorter based on storage in the sequence sorter storage unit 103 carries out concept identification to the test document of storage in the test document storage unit 104, and the concept that obtains identifying also is saved to concept storage unit 105.
Existing method based on machine learning mainly is to have utilized generally labelled training data (Fully labeled training data) to come structure concept recognizer (Concept Recognizer), be sequence sorter (Sequence Classifier), search candidate's concept (Candidate Concepts) of determining in the text document by the sequence sorter that makes up.So-called complete generally labelled training data refers to each text document that training data is concentrated, and its all concepts that comprise all are marked.But, in a lot of situations, retrievable training data all is part mark (the part concept that namely only comprises is labeled out), at this moment, still adopt with based on the identical method of complete generally labelled training data structure sequence sorter, training data based on the part mark makes up the sequence sorter, so that the concept quality that the sequence sorter that constructs identifies is limited.
Summary of the invention
The invention provides a kind of concept identification method and device based on Cooperative Study, in order to improve the concept identification quality, especially improve the quality when making up the sequence sorter and carry out concept identification based on the training data of part mark.
The concrete technical scheme that the embodiment of the invention provides is as follows:
A kind of concept identification method based on Cooperative Study comprises:
Training dataset is divided at least two subsets, and it is the text document of tape label word that described training data is concentrated the training data that comprises;
The training data that comprises based on subset and carry out Cooperative Study according to the Feature Words set that described training dataset extracts makes up at least two sequence sorters;
Adopt each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determine the concept that described current text document comprises according to the concept that each sequence sorter identifies.
A kind of concept identification device based on Cooperative Study comprises:
The first processing unit is used for training dataset is divided at least two subsets, and it is the text document of tape label word that described training data is concentrated the training data that comprises;
The second processing unit carries out Cooperative Study for the training data that comprises based on subset and according to the Feature Words set that described training dataset extracts, and makes up at least two sequence sorters;
The 3rd processing unit is used for adopting each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determines the concept that described current text document comprises according to the concept that each sequence sorter identifies.
Based on technique scheme, in the embodiment of the invention, by training dataset being divided at least two subsets, the training data of the tape label word that comprises based on subset and carry out Cooperative Study according to the Feature Words set that training dataset extracts, make up at least two sequence sorters, namely make up a plurality of sequence sorters, adopt again a plurality of sequence sorters that make up respectively the current text document to be carried out concept identification, and determine the concept that the current text document comprises according to the concept that each sequence sorter identifies, thereby overcome in the existing concept identification method, identification technical matters of low quality when the training dataset based on the part mark makes up the sequence sorter and carries out concept identification, so that when the training data based on the part mark makes up the sequence sorter and carries out concept identification, also can reach the effect when making up the sequence sorter and carrying out concept identification based on complete generally labelled training data, thereby improved the concept identification quality, especially improved the quality when making up the sequence sorter and carry out concept identification based on the training data of part mark.
Description of drawings
Fig. 1 is concept identification process schematic diagram in the prior art;
Fig. 2 is based on the concept identification process schematic diagram of Cooperative Study in the embodiment of the invention;
Fig. 3 is the complete procedure schematic diagram of concept identification in the embodiment of the invention;
Fig. 4 is the complete procedure schematic diagram of another concept identification in the embodiment of the invention;
Fig. 5 is based on the concept identification structure drawing of device of Cooperative Study in the embodiment of the invention.
Embodiment
When reducing concept identification to the requirement of training data, improve the concept identification quality, especially improve quality when the training data structure sequence sorter based on the part mark carries out concept identification, the embodiment of the invention provides a kind of concept identification method and device based on Cooperative Study.
Wherein, Cooperative Study method (Co-learning approach) is the mutation of cooperative training method (Co-training approach), so-called cooperative training method is based on two relatively independent characteristic sets and trains respectively and obtain two sorters, and the method for Cooperative Study does not need feature is divided into two relatively independent set, but training dataset is divided many groups, to make up a plurality of sorters from a plurality of angles.
Below in conjunction with accompanying drawing the preferred embodiment of the present invention is elaborated.
As shown in Figure 2, in the embodiment of the invention, the method detailed flow process of carrying out concept identification based on the method for Cooperative Study is as follows:
Step 201: training dataset is divided at least two subsets, and it is the text document of tape label word that this training data is concentrated the training data that comprises.
Particularly, when training dataset is divided at least two subsets, divide the data volume of the training data that comprises in each subset that obtains greater than setting threshold.
In the practical application, training data is concentrated and is included a large amount of text documents, wherein include the tagged words that is marked in every piece of text document, when being divided into each subset, need to satisfy the text document that comprises in each subset and surpass setting threshold, also can be that the quantity that satisfies the tagged words that each subset comprises surpasses setting threshold, namely will ensure has abundant training data in every subset, obtaining corresponding sequence sorter according to the training data training that each subset comprises.In better realization, the training data that comprises in every subset also need to because the data volume of the training data that each subset comprises is excessive, cause working time oversize to avoid less than certain threshold value, reduces the efficient that makes up the sequence sorter.
In the embodiment of the invention, it can be the text document of part mark that training data is concentrated the text document that comprises, and can be generally labelled text document also, is particularly useful for the text document of part mark.
Step 202: the training data that comprises based on subset and carry out Cooperative Study according to the Feature Words set that training dataset extracts makes up at least two sequence sorters.
In the embodiment of the invention, can be to make up a corresponding sequence sorter for every subset, also can be to make up corresponding each sequence sorter for the part subset.Preferably, the training data that comprises based on every subset and carry out Cooperative Study according to the Feature Words set that training dataset extracts makes up corresponding sequence sorter for every subset, i.e. the corresponding sequence sorter of a subset.
Preferably, when making up the sequence sorter, can obtain final sequence sorter by building process repeatedly.
In the embodiment of the invention, gather based on training data and Feature Words that subset comprises, adopt predetermined algorithm to make up sequence sorter corresponding to subset, adopt each sequence sorter that subset is carried out concept identification, and adopt the new concept that identifies that subset is carried out again mark, the training data and the Feature Words set that comprise based on the subset behind the mark again, adopt predetermined algorithm to rebuild each sequence sorter, and adopt successively each sequence sorter of rebuilding that each subset behind the mark is again carried out concept identification.
In an implementation of the present invention, in order to reduce the complexity of calculating, can preset a threshold value.When the number of the new concept of determining to identify surpasses setting threshold, just adopt this new concept that subset is carried out again mark.Simultaneously, can repeat to make up each sequence sorter, until when determining that the number of the new concept that identifies surpasses setting threshold, finish to rebuild process, and each sequence sorter of current acquisition carried out concept identification as final sequence sorter to current text document.
Preferably, in the training data and the Feature Words set that comprise based on subset first, when adopting predetermined algorithm to make up sequence sorter corresponding to subset, at first carry out initialization, namely make up dictionary based on each tagged words that comprises in the subset, the training data that subset is comprised according to each tagged words that comprises in this dictionary carries out mark (be about to comprise in the dictionary and in the subset unlabelled tagged words mark) in this subset, the training data and the Feature Words set that comprise based on the subset behind the mark adopt predetermined algorithm to make up sequence sorter corresponding to every subset again.
Preferably, after adopting respectively each current sequence sorter that each subset is carried out concept identification, the new concept that employing identifies is again during each subset of mark, when the frequency of the new concept appearance of determining to identify is higher than setting threshold, the concept that this is the new word that serves as a mark increases to dictionary, according to the tagged words that comprises in the dictionary after upgrading the training data that subset comprises is re-started mark.Among this embodiment, when the frequency that occurs in the new concept of determining to identify was higher than setting threshold, the concept that this is the new word that serves as a mark increased to dictionary, otherwise, the concept that this is new abandons as the false concept of wrong identification, thereby reduces the error rate of the new concept that identifies.This repeatedly rebuild the sequence sorter process implementation the multi-angle Cooperative Study, reduced the error rate of identification new ideas based on the temporal voting strategy of implicit a plurality of Weak Classifiers wherein, and then guarantee in each iterative learning process the accuracy rate (sorter obtains the reliability of knowledge) of new mark concept, although each sorter only has the local knowledge (concept and range that it can identify) in the target domain, but finally making up a plurality of sequence sorters that obtain by iterative process repeatedly mutually combines when text document carried out concept identification, can improve recognition accuracy, and improve recall rate.
Among this embodiment, with the number of the new concept that identifies as the Rule of judgment that whether rebuilds the sequence sorter, in the practical application, the Rule of judgment that rebuilds the sequence sorter also can be the ratio that the new concept that identifies is accounted for dictionary, when the ratio that namely accounts for dictionary in the new concept that identifies surpasses setting threshold, rebuild the sequence sorter, otherwise, stop to rebuild process.
The predetermined algorithm that adopts when wherein, making up each sequence sorter can be Hidden Markov Model (HMM) (HMM), maximum entropy model (maximum entropy model), condition random domain model (CRF) scheduling algorithm.Only for for example, also it is included for other algorithm the present embodiment that can be used for structure sequence sorter herein
For example, be illustrated as follows in conjunction with one section following false code to the process that each makes up the sequence sorter:
Figure BDA0000145950330000061
Wherein, β can be set to a larger value, so that namely determine final sequence sorter after only making up a subsequence sorter.
Wherein, can only obtain the sequence sorter by a building process and also be used for current text document is carried out concept identification, also can be to obtain the sequence sorter by the building process more than twice or twice also to be used for current text document is carried out concept identification.
Step 203: adopt each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determine the concept that the current text document comprises according to the concept that each sequence sorter identifies.
Preferably, when the concept that identifies according to each sequence sorter is determined concept that the current text document comprises, will be defined as the concept that the current text document comprises by the concept that at least one sequence sorter identifies; And/or, during the second concept that the first concept that identifies at a sequence sorter identifies for another sequence sorter a part of, the second concept is defined as the concept that the current text document comprises; And/or, when the part of the second concept that the part of the first concept that identifies at a sequence sorter and another sequence sorter identify is identical, will this first concept and the combination of the second concept after be defined as the concept that the current text document comprises.
In the embodiment of the invention, when the concept that identifies according to each sequence sorter is determined concept that the current text document comprises, multiple definite mode can be arranged, be not limited in above several.
For example, after carrying out concept identification for the current text document, there is a concept C among the concept set R1 that adopts the First ray sorter to identify, also has a concept C in the concept set of adopting the second sequence sorter to identify, then C is defined as the concept that the current text document comprises.
Again for example, after carrying out concept identification for the current text document, comprise a concept C among the concept set R1 that adopts a sequence sorter to identify, and all do not comprise concept C in the concept that other sequence sorter the identifies set, then C is defined as the concept that the current text document comprises.
Again for example, after carrying out concept identification for the current text document, there is a concept C1 among the concept set R1 that adopts the First ray sorter to identify, there is a concept C2 among the concept set R2 that adopts the second sequence sorter to identify, and C1 is the part of C2, then C2 is defined as the concept that the current text document comprises.
Again for example, after carrying out concept identification for the current text document, there is a concept C1=AB among the concept set R1 that adopts the First ray sorter to identify, there is a concept C2=BC among the concept set R2 that adopts the second sequence sorter to identify, there are lap B in C1 and C2, and then the combination C3=ABC with C1 and C2 is defined as the concept that the current text document comprises.
The complete procedure of the concept identification that provides below in conjunction with accompanying drawing 3 and 4 pairs of embodiment of the invention of accompanying drawing is described as follows:
At first, the text document T (being training data) based on a large amount of tape labels carries out obtaining the Feature Words set after Feature Words is selected; Then text document that will this a large amount of tape label is divided into a plurality of training data subsets, for example is divided into the n subset; Carry out the repetition learning process based on each training data subclass Feature Words set again, structure obtains a plurality of sequence sorters, the number of this sequence sorter is consistent with the number of training data subset, be that corresponding every subset structure obtains a sequence sorter, for example obtain the n corresponding with a n subset sequence sorter; Each the sequence sorter that adopts structure to obtain respectively carries out concept identification to the test text document, each concept set of tentatively being identified, the corresponding concept set of each sequence sorter, for example, for any one text document among the test text collection of document D, adopt respectively n sequence sorter to carry out concept identification, obtain n concept set of corresponding one text document; According to the merging rule of setting each concept is merged again, obtain final concept set, for example, n concept of corresponding one text document merged a concept that obtains corresponding text document gather.
Based on above-mentioned principle, as shown in Figure 5, also provide a kind of concept identification device based on Cooperative Study in the embodiment of the invention, mainly comprise following processing unit:
The first processing unit 501 is used for training dataset is divided at least two subsets, and it is the text document of tape label word that this training data is concentrated the training data that comprises;
The second processing unit 502 carries out Cooperative Study for the training data that comprises based on subset and according to the Feature Words set that training dataset extracts, and makes up at least two sequence sorters;
The 3rd processing unit 503 is used for adopting each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determines the concept that the current text document comprises according to the concept that each sequence sorter identifies.
Wherein, the 3rd processing unit 503 is concrete is used for when the concept that identifies according to each sequence sorter is determined concept that the current text document comprises the concept that text document comprises before will being defined as by the concept that at least one sequence sorter identifies deserving; And/or, during the second concept that the first concept that identifies at a sequence sorter identifies for another sequence sorter a part of, the concept that text document comprised before this second concept was defined as deserving; And/or, when the part of the second concept that the part of the first concept that identifies at a sequence sorter and another sequence sorter identify is identical, will this first concept and the combination of the second concept after be defined as the concept that the current text document comprises.
Wherein, Cooperative Study is carried out in 502 concrete training data and the Feature Words set that are used for comprising based on every subset of the second processing unit, makes up the sequence sorter for each subset.
Wherein, the second processing unit 502 concrete training data and Feature Words that are used for comprising based on every subset are gathered, make up sequence sorter corresponding to subset, adopt each sequence sorter that subset is carried out respectively concept identification, and adopt the new concept that identifies that subset is carried out again mark, the training data and the Feature Words set that comprise based on each subset behind the mark again, rebuild each sequence sorter, and adopt successively each the sequence sorter rebuild that each subset behind the mark is again carried out concept identification.
In an implementation of the present invention, in order to reduce the complexity of calculating, can preset a threshold value.When the number of the new concept of determining to identify surpasses setting threshold, just adopt this new concept that subset is carried out again mark.Simultaneously, can repeat to make up each sequence sorter, until when determining that the number of the new concept that identifies surpasses setting threshold, finish to rebuild process, and each sequence sorter of current acquisition carried out concept identification as final sequence sorter to current text.
Wherein, the second processing unit 502 concrete each tagged words that is used for comprising based on subset make up dictionary, according to each tagged words that comprises in this dictionary the training data that subset comprises is carried out mark, the training data and the Feature Words set that comprise based on the subset behind the mark make up sequence sorter corresponding to subset again.
Wherein, when the second processing unit 502 also is used for adopting new concept that subset is carried out again mark, when the frequency of determining concept appearance that this is new was higher than setting threshold, the concept that this is new increased to dictionary, according to the dictionary after upgrading the training data that subset comprises was re-started mark.
Preferably, the first processing unit 501 is concrete when being used for that training dataset is divided into each subset, divide obtain the training data that comprises in each subset data volume greater than setting threshold.
Based on technique scheme, in the embodiment of the invention, by training dataset being divided at least two subsets, the training data of the tape label word that comprises based on subset and carry out Cooperative Study according to the Feature Words set that training dataset extracts, make up at least two sequence sorters, namely by making up a plurality of sequence sorters, adopt again a plurality of sequence sorters that make up respectively the current text document to be carried out concept identification, and determine the concept that the current text document comprises according to the concept that each sequence sorter identifies, thereby overcome in the existing concept identification method, identification technical matters of low quality when the training dataset based on the part mark makes up the sequence sorter and carries out concept identification, so that when the training data based on the part mark makes up the sequence sorter and carries out concept identification, also can reach the effect when making up the sequence sorter and carrying out concept identification based on complete generally labelled training data, thereby improved the concept identification quality, especially improved the quality when making up the sequence sorter and carry out concept identification based on the training data of part mark.Simultaneously, by a plurality of sequence sorters the current text document is carried out the concept that concept identification determines that text document comprises, realized the multi-angle concept identification, improved the accuracy of identification and improved recall rate.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (14)

1. the concept identification method based on Cooperative Study is characterized in that, comprising:
Training dataset is divided at least two subsets, and it is the text document of tape label word that described training data is concentrated the training data that comprises;
The training data that comprises based on subset and carry out Cooperative Study according to the Feature Words set that described training dataset extracts makes up at least two sequence sorters;
Adopt each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determine the concept that described current text document comprises according to the concept that each sequence sorter identifies.
2. such as right 1 described method, it is characterized in that, carry out Cooperative Study based on training data and the set of described Feature Words that subset comprises, make up at least two sequence sorters, comprising:
Training data and the set of described Feature Words based on subset comprises make up sequence sorter corresponding to subset;
Adopt each sequence sorter that subset is carried out respectively concept identification; And adopt the new concept that identifies that subset is carried out again mark;
Training data and the set of described Feature Words based on the subset behind the mark again comprises rebuild each sequence sorter; And adopt successively each the sequence sorter rebuild that each subset behind the mark is again carried out concept identification.
3. method as claimed in claim 2 is characterized in that, training data and the set of described Feature Words based on subset comprises make up sequence sorter corresponding to subset, comprising:
Make up dictionary based on each tagged words that comprises in the subset;
According to each tagged words that comprises in the described dictionary training data that described subset comprises is carried out mark, the training data and the set of described Feature Words that comprise based on the subset behind the mark make up sequence sorter corresponding to subset again.
4. method as claimed in claim 3 is characterized in that, adopts the new concept that identifies that subset is carried out again mark, comprising:
When determining that frequency that described new concept occurs is higher than setting threshold, the concept that this is new increases to described dictionary;
According to the dictionary after upgrading the training data that subset comprises is re-started mark.
5. the method for claim 1 is characterized in that, determines the concept that described current text document comprises according to the concept that each sequence sorter identifies, and comprising:
To be defined as by the concept that at least one sequence sorter identifies the concept that described current text document comprises; And/or,
During the second concept that the first concept that identifies at a sequence sorter identifies for another sequence sorter a part of, described the second concept is defined as the concept that described current text document comprises;
And/or,
When the part of the second concept that the part of the first concept that identifies at a sequence sorter and another sequence sorter identify is identical, will described the first concept and the combination of the second concept after be defined as the concept that described current text document comprises.
6. such as each described method of claim 1-5, it is characterized in that, carry out Cooperative Study based on training data and the set of described Feature Words that every subset comprises, make up the sequence sorter for each subset.
7. such as each described method of claim 1-5, it is characterized in that, training dataset is divided into each subset, comprising: divide the data volume of the training data that comprises in each subset of acquisition greater than setting threshold.
8. the concept identification device based on Cooperative Study is characterized in that, comprising:
The first processing unit is used for training dataset is divided at least two subsets, and it is the text document of tape label word that described training data is concentrated the training data that comprises;
The second processing unit carries out Cooperative Study for the training data that comprises based on subset and according to the Feature Words set that described training dataset extracts, and makes up at least two sequence sorters;
The 3rd processing unit is used for adopting each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determines the concept that described current text document comprises according to the concept that each sequence sorter identifies.
9. device as claimed in claim 8, it is characterized in that, training data and described Feature Words that described the second processing unit specifically is used for comprising based on every subset are gathered, make up sequence sorter corresponding to subset, adopt each sequence sorter that subset is carried out respectively concept identification, and adopt the new concept that identifies that subset is carried out again mark, the training data and the set of described Feature Words that comprise based on each subset behind the mark again, rebuild each sequence sorter, and adopt successively each the sequence sorter rebuild that each subset behind the mark is again carried out concept identification.
10. device as claimed in claim 9, it is characterized in that, each tagged words that described the second processing unit specifically is used for comprising based on subset makes up dictionary, according to each tagged words that comprises in the described dictionary training data that described subset comprises is carried out mark, the training data and the set of described Feature Words that comprise based on the subset behind the mark make up sequence sorter corresponding to subset again.
11. device as claimed in claim 10, it is characterized in that, when described the second processing unit also is used for adopting the new concept that identifies that subset is carried out again mark, when determining that frequency that described new concept occurs is higher than setting threshold, the concept that this is new increases to described dictionary, according to the dictionary after upgrading the training data that subset comprises is re-started mark.
12. device as claimed in claim 8, it is characterized in that, described the 3rd processing unit specifically is used for will being defined as by the concept that at least one sequence sorter identifies the concept that described current text document comprises when the concept that identifies according to each sequence sorter is determined concept that described current text document comprises; And/or, during the second concept that the first concept that identifies at a sequence sorter identifies for another sequence sorter a part of, described the second concept is defined as the concept that described current text document comprises; And/or, when the part of the second concept that the part of the first concept that identifies at a sequence sorter and another sequence sorter identify is identical, will described the first concept and the combination of the second concept after be defined as the concept that described current text document comprises.
13. such as each described device of claim 8-12, it is characterized in that, Cooperative Study is carried out in training data and the set of described Feature Words that described the second processing unit specifically is used for comprising based on every subset, makes up the sequence sorter for each subset.
14. such as each described device of claim 8-12, it is characterized in that, when described the first processing unit specifically is used for that training dataset is divided into each subset, divide obtain the training data that comprises in each subset data volume greater than setting threshold.
CN201210077906.4A 2012-03-22 2012-03-22 A kind of concept identification method based on Cooperative Study and device Active CN103324632B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201210077906.4A CN103324632B (en) 2012-03-22 2012-03-22 A kind of concept identification method based on Cooperative Study and device
JP2012271100A JP5523543B2 (en) 2012-03-22 2012-12-12 Concept recognition method and concept recognition device based on co-learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210077906.4A CN103324632B (en) 2012-03-22 2012-03-22 A kind of concept identification method based on Cooperative Study and device

Publications (2)

Publication Number Publication Date
CN103324632A true CN103324632A (en) 2013-09-25
CN103324632B CN103324632B (en) 2016-08-03

Family

ID=49193380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210077906.4A Active CN103324632B (en) 2012-03-22 2012-03-22 A kind of concept identification method based on Cooperative Study and device

Country Status (2)

Country Link
JP (1) JP5523543B2 (en)
CN (1) CN103324632B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294762A (en) * 2016-08-11 2017-01-04 齐鲁工业大学 A kind of entity recognition method based on study
EP2991004A3 (en) * 2014-08-28 2017-01-25 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for labeling training samples
US9659006B2 (en) 2015-06-16 2017-05-23 Cantor Colburn Llp Disambiguation in concept identification

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5984153B2 (en) * 2014-09-22 2016-09-06 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Information processing apparatus, program, and information processing method
US11544579B2 (en) 2016-11-23 2023-01-03 Primal Fusion Inc. System and method for generating training data for machine learning classifier
US11568324B2 (en) 2018-12-20 2023-01-31 Samsung Display Co., Ltd. Adversarial training method for noisy labels
JP7102563B2 (en) * 2021-02-03 2022-07-19 プライマル フュージョン インコーポレイテッド Systems and methods for using knowledge representation with machine learning classifiers
KR102591587B1 (en) * 2021-08-26 2023-10-18 서울여자대학교 산학협력단 Medical image segmentation apparatus and method for segmentating medical image

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050131850A1 (en) * 2003-12-10 2005-06-16 Microsoft Corporation Uncertainty reduction in collaborative bootstrapping
CN101561805A (en) * 2008-04-18 2009-10-21 日电(中国)有限公司 Document classifier generation method and system
CN102208037A (en) * 2011-06-10 2011-10-05 西安电子科技大学 Hyper-spectral image classification method based on Gaussian process classifier collaborative training algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009070321A (en) * 2007-09-18 2009-04-02 Fuji Xerox Co Ltd Device and program for classifying document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050131850A1 (en) * 2003-12-10 2005-06-16 Microsoft Corporation Uncertainty reduction in collaborative bootstrapping
CN101561805A (en) * 2008-04-18 2009-10-21 日电(中国)有限公司 Document classifier generation method and system
CN102208037A (en) * 2011-06-10 2011-10-05 西安电子科技大学 Hyper-spectral image classification method based on Gaussian process classifier collaborative training algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ANTONIA KYRIAKOPOULOU 等: "Using Clustering and Co-Training to Boost Classification Performance", 《19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE》 *
ANTONIA KYRIAKOPOULOU 等: "Using Clustering and Co-Training to Boost Classification Performance", 《19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE》, 31 October 2007 (2007-10-31), pages 325 - 330, XP031440469 *
徐建良 等: "一种基于Co-Training的海洋文献分类方法", 《中国海洋大学学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2991004A3 (en) * 2014-08-28 2017-01-25 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for labeling training samples
US9619758B2 (en) 2014-08-28 2017-04-11 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for labeling training samples
US10796244B2 (en) 2014-08-28 2020-10-06 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for labeling training samples
US9659006B2 (en) 2015-06-16 2017-05-23 Cantor Colburn Llp Disambiguation in concept identification
CN106294762A (en) * 2016-08-11 2017-01-04 齐鲁工业大学 A kind of entity recognition method based on study
CN106294762B (en) * 2016-08-11 2019-12-10 齐鲁工业大学 Entity identification method based on learning

Also Published As

Publication number Publication date
JP5523543B2 (en) 2014-06-18
JP2013196680A (en) 2013-09-30
CN103324632B (en) 2016-08-03

Similar Documents

Publication Publication Date Title
CN103324632A (en) Concept identification method and device based on collaborative learning
CN110334213B (en) Method for identifying time sequence relation of Hanyue news events based on bidirectional cross attention mechanism
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN110532554A (en) A kind of Chinese abstraction generating method, system and storage medium
CN108875816A (en) Merge the Active Learning samples selection strategy of Reliability Code and diversity criterion
CN104516903A (en) Keyword extension method and system and classification corpus labeling method and system
CN110232439B (en) Intention identification method based on deep learning network
CN106202256A (en) Propagate based on semanteme and mix the Web graph of multi-instance learning as search method
CN107004000A (en) A kind of language material generating means and method
CN105808525A (en) Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN107577702B (en) Method for distinguishing traffic information in social media
CN104408153A (en) Short text hash learning method based on multi-granularity topic models
CN110415071B (en) Automobile competitive product comparison method based on viewpoint mining analysis
CN103984943A (en) Scene text identification method based on Bayesian probability frame
CN103425757A (en) Cross-medial personage news searching method and system capable of fusing multi-mode information
CN103324929B (en) Based on the handwritten Chinese recognition methods of minor structure study
CN110738033B (en) Report template generation method, device and storage medium
CN108664474A (en) A kind of resume analytic method based on deep learning
CN105718940A (en) Zero-sample image classification method based on multi-group factor analysis
CN106055560A (en) Method for collecting data of word segmentation dictionary based on statistical machine learning method
CN106886565B (en) Automatic polymerization method for foundation house type
CN104462041A (en) Method for completely detecting hot event from beginning to end
CN102360436B (en) Identification method for on-line handwritten Tibetan characters based on components
CN111159332A (en) Text multi-intention identification method based on bert
CN108846033B (en) Method and device for discovering specific domain vocabulary and training classifier

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant