CN103324632A

CN103324632A - Concept identification method and device based on collaborative learning

Info

Publication number: CN103324632A
Application number: CN2012100779064A
Authority: CN
Inventors: 李建强; 陈宽桐; 刘春辰
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2012-03-22
Filing date: 2012-03-22
Publication date: 2013-09-25
Anticipated expiration: 2032-03-22
Also published as: JP5523543B2; JP2013196680A; CN103324632B

Abstract

The invention discloses a concept identification method and a device based on collaborative learning to improve the quality of the concept identification, and particularly relates to improve the quality when building sequence categorizers to carry out the concept identification based on training data of part marks. The method comprises the following steps: dividing the training dataset into at least two subsets, defining the training data contained in the training dataset as a text document with marker words, carrying out collaborative learning based on the training data contained by the subsets and according to the feature word assembly extracted by the training dataset, building at least two sequence categorizers, adopting each acquired sequence categorizer to carry out the concept identification to the present text document respectively, and confirming the concept contained in the present text document according to the concept identified by each sequence categorizer.

Description

A kind of concept identification method and device based on Cooperative Study

Technical field

The present invention relates to field of artificial intelligence, relate in particular to a kind of concept identification method and device based on Cooperative Study.

Background technology

Along with information retrieval (Information Retrieval, IR) development of technology, semantic information retrieval (Semantic Information Retrieval, Semantic IR) has huge development potentiality compared to traditional information retrieval based on keyword (Keywords-Based IR) technology.Wherein, concept identification (Concept Detection) and concept disambiguation (Concept Disambiguation) play an important role in the semantic information retrieval technology.So-called concept identification refers to find the character string for a concept that represents a concept or a plurality of concepts from text.

As shown in Figure 1, in the prior art, the detailed process that the method for employing machine learning is carried out the text document concept identification is as follows:

Store the text document of a large amount of tape labels in the text document storage unit 101, carry out storage in characteristic storage unit 102 after the feature selecting based on the text document of this mark;

Feature based on storage in the text document of tape label of storage in the text document storage unit 101 and the characteristic storage unit 102 is carried out the study of sequence sorter, obtains the sequence sorter and is stored to sequence sorter storage unit 103;

Sequence sorter based on storage in the sequence sorter storage unit 103 carries out concept identification to the test document of storage in the test document storage unit 104, and the concept that obtains identifying also is saved to concept storage unit 105.

Existing method based on machine learning mainly is to have utilized generally labelled training data (Fully labeled training data) to come structure concept recognizer (Concept Recognizer), be sequence sorter (Sequence Classifier), search candidate's concept (Candidate Concepts) of determining in the text document by the sequence sorter that makes up.So-called complete generally labelled training data refers to each text document that training data is concentrated, and its all concepts that comprise all are marked.But, in a lot of situations, retrievable training data all is part mark (the part concept that namely only comprises is labeled out), at this moment, still adopt with based on the identical method of complete generally labelled training data structure sequence sorter, training data based on the part mark makes up the sequence sorter, so that the concept quality that the sequence sorter that constructs identifies is limited.

Summary of the invention

The invention provides a kind of concept identification method and device based on Cooperative Study, in order to improve the concept identification quality, especially improve the quality when making up the sequence sorter and carry out concept identification based on the training data of part mark.

The concrete technical scheme that the embodiment of the invention provides is as follows:

A kind of concept identification method based on Cooperative Study comprises:

Training dataset is divided at least two subsets, and it is the text document of tape label word that described training data is concentrated the training data that comprises;

The training data that comprises based on subset and carry out Cooperative Study according to the Feature Words set that described training dataset extracts makes up at least two sequence sorters;

Adopt each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determine the concept that described current text document comprises according to the concept that each sequence sorter identifies.

A kind of concept identification device based on Cooperative Study comprises:

The first processing unit is used for training dataset is divided at least two subsets, and it is the text document of tape label word that described training data is concentrated the training data that comprises;

The second processing unit carries out Cooperative Study for the training data that comprises based on subset and according to the Feature Words set that described training dataset extracts, and makes up at least two sequence sorters;

The 3rd processing unit is used for adopting each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determines the concept that described current text document comprises according to the concept that each sequence sorter identifies.

Based on technique scheme, in the embodiment of the invention, by training dataset being divided at least two subsets, the training data of the tape label word that comprises based on subset and carry out Cooperative Study according to the Feature Words set that training dataset extracts, make up at least two sequence sorters, namely make up a plurality of sequence sorters, adopt again a plurality of sequence sorters that make up respectively the current text document to be carried out concept identification, and determine the concept that the current text document comprises according to the concept that each sequence sorter identifies, thereby overcome in the existing concept identification method, identification technical matters of low quality when the training dataset based on the part mark makes up the sequence sorter and carries out concept identification, so that when the training data based on the part mark makes up the sequence sorter and carries out concept identification, also can reach the effect when making up the sequence sorter and carrying out concept identification based on complete generally labelled training data, thereby improved the concept identification quality, especially improved the quality when making up the sequence sorter and carry out concept identification based on the training data of part mark.

Description of drawings

Fig. 1 is concept identification process schematic diagram in the prior art;

Fig. 2 is based on the concept identification process schematic diagram of Cooperative Study in the embodiment of the invention;

Fig. 3 is the complete procedure schematic diagram of concept identification in the embodiment of the invention;

Fig. 4 is the complete procedure schematic diagram of another concept identification in the embodiment of the invention;

Fig. 5 is based on the concept identification structure drawing of device of Cooperative Study in the embodiment of the invention.

Embodiment

When reducing concept identification to the requirement of training data, improve the concept identification quality, especially improve quality when the training data structure sequence sorter based on the part mark carries out concept identification, the embodiment of the invention provides a kind of concept identification method and device based on Cooperative Study.

Wherein, Cooperative Study method (Co-learning approach) is the mutation of cooperative training method (Co-training approach), so-called cooperative training method is based on two relatively independent characteristic sets and trains respectively and obtain two sorters, and the method for Cooperative Study does not need feature is divided into two relatively independent set, but training dataset is divided many groups, to make up a plurality of sorters from a plurality of angles.

Below in conjunction with accompanying drawing the preferred embodiment of the present invention is elaborated.

As shown in Figure 2, in the embodiment of the invention, the method detailed flow process of carrying out concept identification based on the method for Cooperative Study is as follows:

Step 201: training dataset is divided at least two subsets, and it is the text document of tape label word that this training data is concentrated the training data that comprises.

Particularly, when training dataset is divided at least two subsets, divide the data volume of the training data that comprises in each subset that obtains greater than setting threshold.

In the practical application, training data is concentrated and is included a large amount of text documents, wherein include the tagged words that is marked in every piece of text document, when being divided into each subset, need to satisfy the text document that comprises in each subset and surpass setting threshold, also can be that the quantity that satisfies the tagged words that each subset comprises surpasses setting threshold, namely will ensure has abundant training data in every subset, obtaining corresponding sequence sorter according to the training data training that each subset comprises.In better realization, the training data that comprises in every subset also need to because the data volume of the training data that each subset comprises is excessive, cause working time oversize to avoid less than certain threshold value, reduces the efficient that makes up the sequence sorter.

In the embodiment of the invention, it can be the text document of part mark that training data is concentrated the text document that comprises, and can be generally labelled text document also, is particularly useful for the text document of part mark.

Step 202: the training data that comprises based on subset and carry out Cooperative Study according to the Feature Words set that training dataset extracts makes up at least two sequence sorters.

In the embodiment of the invention, can be to make up a corresponding sequence sorter for every subset, also can be to make up corresponding each sequence sorter for the part subset.Preferably, the training data that comprises based on every subset and carry out Cooperative Study according to the Feature Words set that training dataset extracts makes up corresponding sequence sorter for every subset, i.e. the corresponding sequence sorter of a subset.

Preferably, when making up the sequence sorter, can obtain final sequence sorter by building process repeatedly.

In the embodiment of the invention, gather based on training data and Feature Words that subset comprises, adopt predetermined algorithm to make up sequence sorter corresponding to subset, adopt each sequence sorter that subset is carried out concept identification, and adopt the new concept that identifies that subset is carried out again mark, the training data and the Feature Words set that comprise based on the subset behind the mark again, adopt predetermined algorithm to rebuild each sequence sorter, and adopt successively each sequence sorter of rebuilding that each subset behind the mark is again carried out concept identification.

In an implementation of the present invention, in order to reduce the complexity of calculating, can preset a threshold value.When the number of the new concept of determining to identify surpasses setting threshold, just adopt this new concept that subset is carried out again mark.Simultaneously, can repeat to make up each sequence sorter, until when determining that the number of the new concept that identifies surpasses setting threshold, finish to rebuild process, and each sequence sorter of current acquisition carried out concept identification as final sequence sorter to current text document.

Preferably, in the training data and the Feature Words set that comprise based on subset first, when adopting predetermined algorithm to make up sequence sorter corresponding to subset, at first carry out initialization, namely make up dictionary based on each tagged words that comprises in the subset, the training data that subset is comprised according to each tagged words that comprises in this dictionary carries out mark (be about to comprise in the dictionary and in the subset unlabelled tagged words mark) in this subset, the training data and the Feature Words set that comprise based on the subset behind the mark adopt predetermined algorithm to make up sequence sorter corresponding to every subset again.

Preferably, after adopting respectively each current sequence sorter that each subset is carried out concept identification, the new concept that employing identifies is again during each subset of mark, when the frequency of the new concept appearance of determining to identify is higher than setting threshold, the concept that this is the new word that serves as a mark increases to dictionary, according to the tagged words that comprises in the dictionary after upgrading the training data that subset comprises is re-started mark.Among this embodiment, when the frequency that occurs in the new concept of determining to identify was higher than setting threshold, the concept that this is the new word that serves as a mark increased to dictionary, otherwise, the concept that this is new abandons as the false concept of wrong identification, thereby reduces the error rate of the new concept that identifies.This repeatedly rebuild the sequence sorter process implementation the multi-angle Cooperative Study, reduced the error rate of identification new ideas based on the temporal voting strategy of implicit a plurality of Weak Classifiers wherein, and then guarantee in each iterative learning process the accuracy rate (sorter obtains the reliability of knowledge) of new mark concept, although each sorter only has the local knowledge (concept and range that it can identify) in the target domain, but finally making up a plurality of sequence sorters that obtain by iterative process repeatedly mutually combines when text document carried out concept identification, can improve recognition accuracy, and improve recall rate.

Among this embodiment, with the number of the new concept that identifies as the Rule of judgment that whether rebuilds the sequence sorter, in the practical application, the Rule of judgment that rebuilds the sequence sorter also can be the ratio that the new concept that identifies is accounted for dictionary, when the ratio that namely accounts for dictionary in the new concept that identifies surpasses setting threshold, rebuild the sequence sorter, otherwise, stop to rebuild process.

The predetermined algorithm that adopts when wherein, making up each sequence sorter can be Hidden Markov Model (HMM) (HMM), maximum entropy model (maximum entropy model), condition random domain model (CRF) scheduling algorithm.Only for for example, also it is included for other algorithm the present embodiment that can be used for structure sequence sorter herein

For example, be illustrated as follows in conjunction with one section following false code to the process that each makes up the sequence sorter:

Wherein, β can be set to a larger value, so that namely determine final sequence sorter after only making up a subsequence sorter.

Wherein, can only obtain the sequence sorter by a building process and also be used for current text document is carried out concept identification, also can be to obtain the sequence sorter by the building process more than twice or twice also to be used for current text document is carried out concept identification.

Step 203: adopt each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determine the concept that the current text document comprises according to the concept that each sequence sorter identifies.

Preferably, when the concept that identifies according to each sequence sorter is determined concept that the current text document comprises, will be defined as the concept that the current text document comprises by the concept that at least one sequence sorter identifies; And/or, during the second concept that the first concept that identifies at a sequence sorter identifies for another sequence sorter a part of, the second concept is defined as the concept that the current text document comprises; And/or, when the part of the second concept that the part of the first concept that identifies at a sequence sorter and another sequence sorter identify is identical, will this first concept and the combination of the second concept after be defined as the concept that the current text document comprises.

In the embodiment of the invention, when the concept that identifies according to each sequence sorter is determined concept that the current text document comprises, multiple definite mode can be arranged, be not limited in above several.

For example, after carrying out concept identification for the current text document, there is a concept C among the concept set R1 that adopts the First ray sorter to identify, also has a concept C in the concept set of adopting the second sequence sorter to identify, then C is defined as the concept that the current text document comprises.

Again for example, after carrying out concept identification for the current text document, comprise a concept C among the concept set R1 that adopts a sequence sorter to identify, and all do not comprise concept C in the concept that other sequence sorter the identifies set, then C is defined as the concept that the current text document comprises.

Again for example, after carrying out concept identification for the current text document, there is a concept C1 among the concept set R1 that adopts the First ray sorter to identify, there is a concept C2 among the concept set R2 that adopts the second sequence sorter to identify, and C1 is the part of C2, then C2 is defined as the concept that the current text document comprises.

Again for example, after carrying out concept identification for the current text document, there is a concept C1=AB among the concept set R1 that adopts the First ray sorter to identify, there is a concept C2=BC among the concept set R2 that adopts the second sequence sorter to identify, there are lap B in C1 and C2, and then the combination C3=ABC with C1 and C2 is defined as the concept that the current text document comprises.

The complete procedure of the concept identification that provides below in conjunction with accompanying drawing 3 and 4 pairs of embodiment of the invention of accompanying drawing is described as follows:

At first, the text document T (being training data) based on a large amount of tape labels carries out obtaining the Feature Words set after Feature Words is selected; Then text document that will this a large amount of tape label is divided into a plurality of training data subsets, for example is divided into the n subset; Carry out the repetition learning process based on each training data subclass Feature Words set again, structure obtains a plurality of sequence sorters, the number of this sequence sorter is consistent with the number of training data subset, be that corresponding every subset structure obtains a sequence sorter, for example obtain the n corresponding with a n subset sequence sorter; Each the sequence sorter that adopts structure to obtain respectively carries out concept identification to the test text document, each concept set of tentatively being identified, the corresponding concept set of each sequence sorter, for example, for any one text document among the test text collection of document D, adopt respectively n sequence sorter to carry out concept identification, obtain n concept set of corresponding one text document; According to the merging rule of setting each concept is merged again, obtain final concept set, for example, n concept of corresponding one text document merged a concept that obtains corresponding text document gather.

Based on above-mentioned principle, as shown in Figure 5, also provide a kind of concept identification device based on Cooperative Study in the embodiment of the invention, mainly comprise following processing unit:

The first processing unit 501 is used for training dataset is divided at least two subsets, and it is the text document of tape label word that this training data is concentrated the training data that comprises;

The second processing unit 502 carries out Cooperative Study for the training data that comprises based on subset and according to the Feature Words set that training dataset extracts, and makes up at least two sequence sorters;

The 3rd processing unit 503 is used for adopting each the sequence sorter that obtains respectively the current text document to be carried out concept identification, and determines the concept that the current text document comprises according to the concept that each sequence sorter identifies.

Wherein, the 3rd processing unit 503 is concrete is used for when the concept that identifies according to each sequence sorter is determined concept that the current text document comprises the concept that text document comprises before will being defined as by the concept that at least one sequence sorter identifies deserving; And/or, during the second concept that the first concept that identifies at a sequence sorter identifies for another sequence sorter a part of, the concept that text document comprised before this second concept was defined as deserving; And/or, when the part of the second concept that the part of the first concept that identifies at a sequence sorter and another sequence sorter identify is identical, will this first concept and the combination of the second concept after be defined as the concept that the current text document comprises.

Wherein, Cooperative Study is carried out in 502 concrete training data and the Feature Words set that are used for comprising based on every subset of the second processing unit, makes up the sequence sorter for each subset.

Wherein, the second processing unit 502 concrete training data and Feature Words that are used for comprising based on every subset are gathered, make up sequence sorter corresponding to subset, adopt each sequence sorter that subset is carried out respectively concept identification, and adopt the new concept that identifies that subset is carried out again mark, the training data and the Feature Words set that comprise based on each subset behind the mark again, rebuild each sequence sorter, and adopt successively each the sequence sorter rebuild that each subset behind the mark is again carried out concept identification.

In an implementation of the present invention, in order to reduce the complexity of calculating, can preset a threshold value.When the number of the new concept of determining to identify surpasses setting threshold, just adopt this new concept that subset is carried out again mark.Simultaneously, can repeat to make up each sequence sorter, until when determining that the number of the new concept that identifies surpasses setting threshold, finish to rebuild process, and each sequence sorter of current acquisition carried out concept identification as final sequence sorter to current text.

Wherein, the second processing unit 502 concrete each tagged words that is used for comprising based on subset make up dictionary, according to each tagged words that comprises in this dictionary the training data that subset comprises is carried out mark, the training data and the Feature Words set that comprise based on the subset behind the mark make up sequence sorter corresponding to subset again.

Wherein, when the second processing unit 502 also is used for adopting new concept that subset is carried out again mark, when the frequency of determining concept appearance that this is new was higher than setting threshold, the concept that this is new increased to dictionary, according to the dictionary after upgrading the training data that subset comprises was re-started mark.

Preferably, the first processing unit 501 is concrete when being used for that training dataset is divided into each subset, divide obtain the training data that comprises in each subset data volume greater than setting threshold.

Based on technique scheme, in the embodiment of the invention, by training dataset being divided at least two subsets, the training data of the tape label word that comprises based on subset and carry out Cooperative Study according to the Feature Words set that training dataset extracts, make up at least two sequence sorters, namely by making up a plurality of sequence sorters, adopt again a plurality of sequence sorters that make up respectively the current text document to be carried out concept identification, and determine the concept that the current text document comprises according to the concept that each sequence sorter identifies, thereby overcome in the existing concept identification method, identification technical matters of low quality when the training dataset based on the part mark makes up the sequence sorter and carries out concept identification, so that when the training data based on the part mark makes up the sequence sorter and carries out concept identification, also can reach the effect when making up the sequence sorter and carrying out concept identification based on complete generally labelled training data, thereby improved the concept identification quality, especially improved the quality when making up the sequence sorter and carry out concept identification based on the training data of part mark.Simultaneously, by a plurality of sequence sorters the current text document is carried out the concept that concept identification determines that text document comprises, realized the multi-angle concept identification, improved the accuracy of identification and improved recall rate.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims

1. the concept identification method based on Cooperative Study is characterized in that, comprising:

2. such as right 1 described method, it is characterized in that, carry out Cooperative Study based on training data and the set of described Feature Words that subset comprises, make up at least two sequence sorters, comprising:

Training data and the set of described Feature Words based on subset comprises make up sequence sorter corresponding to subset;

Adopt each sequence sorter that subset is carried out respectively concept identification; And adopt the new concept that identifies that subset is carried out again mark;

Training data and the set of described Feature Words based on the subset behind the mark again comprises rebuild each sequence sorter; And adopt successively each the sequence sorter rebuild that each subset behind the mark is again carried out concept identification.

3. method as claimed in claim 2 is characterized in that, training data and the set of described Feature Words based on subset comprises make up sequence sorter corresponding to subset, comprising:

Make up dictionary based on each tagged words that comprises in the subset;

According to each tagged words that comprises in the described dictionary training data that described subset comprises is carried out mark, the training data and the set of described Feature Words that comprise based on the subset behind the mark make up sequence sorter corresponding to subset again.

4. method as claimed in claim 3 is characterized in that, adopts the new concept that identifies that subset is carried out again mark, comprising:

When determining that frequency that described new concept occurs is higher than setting threshold, the concept that this is new increases to described dictionary;

According to the dictionary after upgrading the training data that subset comprises is re-started mark.

5. the method for claim 1 is characterized in that, determines the concept that described current text document comprises according to the concept that each sequence sorter identifies, and comprising:

To be defined as by the concept that at least one sequence sorter identifies the concept that described current text document comprises; And/or,

During the second concept that the first concept that identifies at a sequence sorter identifies for another sequence sorter a part of, described the second concept is defined as the concept that described current text document comprises;

And/or,

When the part of the second concept that the part of the first concept that identifies at a sequence sorter and another sequence sorter identify is identical, will described the first concept and the combination of the second concept after be defined as the concept that described current text document comprises.

6. such as each described method of claim 1-5, it is characterized in that, carry out Cooperative Study based on training data and the set of described Feature Words that every subset comprises, make up the sequence sorter for each subset.

7. such as each described method of claim 1-5, it is characterized in that, training dataset is divided into each subset, comprising: divide the data volume of the training data that comprises in each subset of acquisition greater than setting threshold.

8. the concept identification device based on Cooperative Study is characterized in that, comprising:

9. device as claimed in claim 8, it is characterized in that, training data and described Feature Words that described the second processing unit specifically is used for comprising based on every subset are gathered, make up sequence sorter corresponding to subset, adopt each sequence sorter that subset is carried out respectively concept identification, and adopt the new concept that identifies that subset is carried out again mark, the training data and the set of described Feature Words that comprise based on each subset behind the mark again, rebuild each sequence sorter, and adopt successively each the sequence sorter rebuild that each subset behind the mark is again carried out concept identification.

10. device as claimed in claim 9, it is characterized in that, each tagged words that described the second processing unit specifically is used for comprising based on subset makes up dictionary, according to each tagged words that comprises in the described dictionary training data that described subset comprises is carried out mark, the training data and the set of described Feature Words that comprise based on the subset behind the mark make up sequence sorter corresponding to subset again.

11. device as claimed in claim 10, it is characterized in that, when described the second processing unit also is used for adopting the new concept that identifies that subset is carried out again mark, when determining that frequency that described new concept occurs is higher than setting threshold, the concept that this is new increases to described dictionary, according to the dictionary after upgrading the training data that subset comprises is re-started mark.

12. device as claimed in claim 8, it is characterized in that, described the 3rd processing unit specifically is used for will being defined as by the concept that at least one sequence sorter identifies the concept that described current text document comprises when the concept that identifies according to each sequence sorter is determined concept that described current text document comprises; And/or, during the second concept that the first concept that identifies at a sequence sorter identifies for another sequence sorter a part of, described the second concept is defined as the concept that described current text document comprises; And/or, when the part of the second concept that the part of the first concept that identifies at a sequence sorter and another sequence sorter identify is identical, will described the first concept and the combination of the second concept after be defined as the concept that described current text document comprises.

13. such as each described device of claim 8-12, it is characterized in that, Cooperative Study is carried out in training data and the set of described Feature Words that described the second processing unit specifically is used for comprising based on every subset, makes up the sequence sorter for each subset.

14. such as each described device of claim 8-12, it is characterized in that, when described the first processing unit specifically is used for that training dataset is divided into each subset, divide obtain the training data that comprises in each subset data volume greater than setting threshold.