CN105095156A - Model generation method used for data annotation, data annotation method, model generation device used for data annotation and data annotation device - Google Patents

Model generation method used for data annotation, data annotation method, model generation device used for data annotation and data annotation device Download PDF

Info

Publication number
CN105095156A
CN105095156A CN201510428997.5A CN201510428997A CN105095156A CN 105095156 A CN105095156 A CN 105095156A CN 201510428997 A CN201510428997 A CN 201510428997A CN 105095156 A CN105095156 A CN 105095156A
Authority
CN
China
Prior art keywords
feature
model
observed value
restricted
candidates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510428997.5A
Other languages
Chinese (zh)
Other versions
CN105095156B (en
Inventor
全宗峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510428997.5A priority Critical patent/CN105095156B/en
Publication of CN105095156A publication Critical patent/CN105095156A/en
Application granted granted Critical
Publication of CN105095156B publication Critical patent/CN105095156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention brings forward a model generation method used for data annotation, a data annotation method, a model generation device used for data annotation and a data annotation device. The model generation method used for data annotation comprises following steps: acquiring training corpus and setting up a restrictive candidate tag set corresponding to observed values of the training corpus; selecting a feature template enabling the number of non-zero coefficients at observed values of a feature function and the number of elements in the restrictive candidate tag set corresponding to observed values; and establishing a lattice according to the restrictive candidate tag set and the feature template. According to the restrictive candidate tag set and the lattice, a model used for data annotation is generated. The model generation method helps to increase model generation speed and reduce the data volume of the model so that a basis is provided for fast decoding.

Description

For model generation, data mask method and device that data mark
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of model generation for data mark, data mask method and device.
Background technology
Data mask method service condition random field (ConditionalRandomField general at present, CRF) data mark is done, its mark (tag) list is all possible candidates, when the number of elements of candidates collection is more, the speed that slow CRF trains and decodes seriously can be dragged.Such as, in voice system, time to the phonetic annotation of Chinese characters, the phonetic due to Chinese character has kind more than 1000, then adopt traditional CRF mode, element in the candidates collection of corresponding each Chinese character is more than 1000, in the case, the internal memory that model training uses reaches tens G, even if when carrying out phonetic notation to a short sentence, decode time consumption, in second, has a strong impact on training and the decoding speed of CRF.
Summary of the invention
The present invention is intended to solve one of technical matters in correlation technique at least to a certain extent.
For this reason, one object of the present invention be propose a kind of for data mark model generating method, the method can improve model generation speed, and reduces the data volume of model, thus for fast decoding provide basis.
Another object of the present invention is to propose a kind of data mask method, and the method can improve decoding speed.
Another object of the present invention is to propose a kind of model generation device for data mark.
Another object of the present invention is to propose a kind of data annotation equipment.
For achieving the above object, the model generating method for data mark that first aspect present invention embodiment proposes, comprising: obtain corpus, and the observed value in corresponding described corpus sets up restricted candidates collection; Select feature templates, described feature templates makes the number of the element of fundamental function in the restricted candidates collection that the number of the nonzero coefficient at observed value place is corresponding with described observed value identical; According to described restricted candidates collection and described feature templates, build grid; According to described restricted candidates collection and described grid, generate the model being used for data markers.
The model generating method for data mark that first aspect present invention embodiment proposes, by setting up restricted candidates collection, the number of candidates can be limited, and, select the template meeting above-mentioned requirements, the number of nonzero coefficient can be limited, these can reduce operand, thus model generation speed can be improved, and reduce the data volume of model, thus provide basis for fast decoding.
For achieving the above object, the data mask method that second aspect present invention embodiment proposes, comprising: obtain the model preserved in advance, and described model adopts the method as described in any one of first aspect present invention embodiment to generate; Obtain observation sequence to be marked; According to described model, described observation sequence to be marked is marked.
The data mask method that second aspect present invention embodiment proposes, by adopting above-mentioned model to mark, can realize fast decoding based on this model.
For achieving the above object, the model generation device for data mark that third aspect present invention embodiment proposes, comprising: acquisition module, for obtaining corpus, and the observed value in corresponding described corpus sets up restricted candidates collection; Select module, for selecting feature templates, described feature templates makes the number of the element of fundamental function in the restricted candidates collection that the number of the nonzero coefficient at observed value place is corresponding with described observed value identical; Build module, for according to described restricted candidates collection and described feature templates, build grid; Generation module, for according to described restricted candidates collection and described grid, generates the model being used for data markers.
The model generation device for data mark that third aspect present invention embodiment proposes, by setting up restricted candidates collection, the number of candidates can be limited, and, select the template meeting above-mentioned requirements, the number of nonzero coefficient can be limited, these can reduce operand, thus model generation speed can be improved, and reduce the data volume of model, thus provide basis for fast decoding.
For achieving the above object, the data annotation equipment that fourth aspect present invention embodiment proposes, comprising: the first acquisition module, and for obtaining the model preserved in advance, described model adopts the method as described in any one of first aspect present invention embodiment to generate; Second acquisition module, for obtaining observation sequence to be marked; Labeling module, for according to described model, marks described observation sequence to be marked.
The data annotation equipment that fourth aspect present invention embodiment proposes, by adopting above-mentioned model to mark, can realize fast decoding based on this model.
The aspect that the present invention adds and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.
Accompanying drawing explanation
The present invention above-mentioned and/or additional aspect and advantage will become obvious and easy understand from the following description of the accompanying drawings of embodiments, wherein:
Fig. 1 is the schematic flow sheet of the model generating method for data mark that one embodiment of the invention proposes;
Fig. 2 is the schematic flow sheet building grid in the embodiment of the present invention;
Fig. 3 is the composition schematic diagram of the grid built in the embodiment of the present invention;
Fig. 4 is the schematic flow sheet according to restricted candidates collection and grid generation model in the embodiment of the present invention;
Fig. 5 is the schematic flow sheet of the data mask method that another embodiment of the present invention proposes;
Fig. 6 carries out according to model the schematic flow sheet that marks in the embodiment of the present invention;
Fig. 7 is the structural representation of the model generation device for data mark that another embodiment of the present invention proposes;
Fig. 8 is the structural representation of the data annotation equipment that another embodiment of the present invention proposes.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar module or has module that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.On the contrary, embodiments of the invention comprise fall into attached claims spirit and intension within the scope of all changes, amendment and equivalent.
Fig. 1 is the schematic flow sheet of the model generating method for data mark that one embodiment of the invention proposes, and the method can be applied in the training stage of data mark, and the method comprises:
S11: obtain corpus, and the observed value in corresponding described corpus sets up restricted candidates collection.
Wherein, in existing resource, the sentence having completed mark can be collected, as corpus.
The method of the present embodiment can be applied in the element of candidates collection more and the candidates number of individual data is less when, this situation is such as the phonetic annotation of Chinese characters, and the embodiment of the present invention will for the phonetic annotation of Chinese characters, such as, in voice system, to the scene of the phonetic annotation of Chinese characters.
After supposing to adopt CRF method to carry out pre-service, the corpus obtained is as follows:
I am wo3
Men5
Sleep shui4
Zhao2
Le5
Wherein, wo3 represents that wo sends out 3 sound, and men5 represents that men sends out softly, and all the other are similar.
Wherein, " we fall asleep " can be called observation sequence, and each Chinese character in observation sequence is called observed value, and the sequence of corresponding phonetic composition can be called flag sequence.
Restricted candidates collection refers to that element number is less than the set of threshold value.Comprise one or more element in restricted candidates collection, each element is a mark (tag).During for the phonetic annotation of Chinese characters, each observed value in corpus specifically refers to each Chinese character, and each mark refers to the phonetic of Chinese character.
When setting up restricted candidates collection, can add up the phonetic of Chinese character in corpus, all possible phonetic statistics got up, and can pass through inquiry of Chinese character dictionary, leakage detection is filled a vacancy, and improves the accuracy of restricted candidates collection.
In traditional CRF algorithm, employing be candidates collection without restriction, such as, a corresponding Chinese character, in traditional CRF algorithm, the element in the candidates collection that this Chinese character is corresponding is more than 1000.
And in the present embodiment, will candidates collection be limited, be called restricted candidates collection, a corresponding Chinese character, the element in restricted candidates collection is the existing phonetic of this Chinese character, thus significantly can reduce the element number in candidates collection.Such as, polyphone in more than 20,000 Chinese character in gbk coding criterion is only less than 2,000, the candidate pinyin (element in restricted candidates collection) of the Chinese character of more than 90% only has one, and in these 2,000 polyphones, even if add the change of voice and softly rule, the candidate pinyin (element in restricted candidates collection) of individual Chinese character also can not more than 10.
Therefore, after adopting restricted candidates collection, the element in this restricted candidates collection can not more than 10, compared with more than 1000 traditional element, drastically reduce the area the number of candidate collection interior element, thus reduce the time overhead of model training and decoding process.
For " we fall asleep " this observation sequence, the restricted candidates collection of foundation is as follows:
S12: select feature templates, described feature templates makes the number of the element of fundamental function in the restricted candidates collection that the number of the nonzero coefficient at observed value place is corresponding with described observed value identical.
Also existing characteristics template in traditional CRF algorithm, and computing can be carried out according to feature templates to corpus, obtain feature, and use feature to carry out subsequent arithmetic.
Feature templates is not limited in traditional CRF algorithm, and will the template met certain requirements be selected in the present embodiment, get rid of some undesirable templates.
In some embodiments, feature templates comprises: single order feature templates, the single order feature templates selected meets following condition: there is nonzero coefficient at the element place of the fundamental function using described feature templates to obtain in restricted candidates collection, is zero at the coefficient of other mark.
In CRF algorithm, feature templates comprises single order feature templates and second order feature templates, wherein, can use the second order feature templates in traditional C RF algorithm in the present embodiment.
The single order feature templates selected in the present embodiment is formulated as:
Uij:%x[i 1,j 1]/%x[i 2,j 2]/.../%x[0,0]/.../%x[i m,j m](1)
The single order feature templates got rid of is needed to be formulated as in the present embodiment:
Ulk:%x[l 1,k 1]/%x[l 2,k 2]/.../%x[l m,k m](2)
Wherein, l 1, l 2l mbe not all that 0, i, j, l, k are used for representing that the subscript of template indicates.
Principle illustrates: when using CRF to carry out data mark, note X=x 1, x 2..., x nfor observation sequence, note Y=y 1, y 2..., y nfor flag sequence, note f k(y i-1, y i, X, i) and be fundamental function, note then at given observation sequence X, corresponding flag sequence is that the probability of Y can be write as:
P ( y | X ) = exp ( Σ k λ k F k ) Σ y exp ( Σ k λ k F k )
Wherein λ kthe coefficient of representation feature function, i.e. the CRF training process parameter that will calculate.In labeled data, λ kcan be understood as the value coefficient of fundamental function at candidates place.
In step s 11, the restricted candidates collection of Chinese character has been made.
Below illustrate: when the feature templates selecting shape as (1) formula, the fundamental function obtained only has nonzero coefficient at mark (tag) place of the element belonged in restricted candidates collection of individual Chinese character, and is 0 in other mark values.
Such as: get feature templates U00:%x [0,0], this template meets (1) formula.
Exercise " men5 " of corpus and use this feature templates, obtain feature " U00: " (3)
Get feature templates U01:%x [-1,0]/%x [0,0], this template meets (1) formula
Exercise " men5 " of corpus and use this feature templates, obtain feature " U01: I/" (4)
Observe feature (3) and (4), can find, condition:
Current line observed value is " "
It is the necessary condition that feature (3) or (4) occur, that is, when there are feature (3) or (4), mean that current observed value is for " ", and " " this Chinese character is in S11, has been limited to flag sequence collection { men2, in men5}, therefore the coefficient of feature (3) or (4) correspondence is only when being labeled as men2 or men5, gets nonzero value, and is 0 in other mark values.
Can prove by reduction to absurdity, use shape as (1) formula template all features of obtaining, all there is above character, namely the mark of feature correspondence markings coefficient only in the restricted candidates collection of observed value using shape to obtain as (1) formula template gets nonzero value, and get 0 value in other mark, be defined as feature shrink here.
When at certain observed value x ithe feature at place all has shrinkability, and is all retracted to x irestricted candidates collection on time, what now feature was corresponding is not 0 mark coefficient number just equal x ithe number of restricted candidates collection interior element, quantity is far smaller than all candidates quantity, reduces the parameter that will calculate with this, accelerates computation process.
Below illustrate, when use shape as the feature templates of (2) formula the feature that obtains not there is above character.
Delivery plate Ux0:%x [-1,0], meets (2) formula,
Exercise " men5 " of corpus and use this feature templates, obtain feature " Ux0: I " (5)
Consider an observation sequence " I T ", T is the Chinese character of certain the unknown, at Chinese character T place, uses template Ux0:%x [-1,0] can obtain feature (5).And T can get all Chinese characters (does not limit T under traditional C RF, under restriction labeling sequence, T is not also limited), therefore the coefficient that the mark coefficient of feature (5) correspondence limits on collection at the candidates of all Chinese characters all likely gets non-zero value, the latter is equivalent to all marks.Namely use shape as the feature templates of (2) formula the feature that obtains, do not possess the shrinkability of use (1) formula feature.
S13: the feature templates according to selecting carries out computing to described corpus, obtains the feature of corresponding observed value, and according to described restricted candidates collection and described feature, build grid (lattice).
After selection feature templates, this feature templates can be adopted to obtain corresponding feature.Such as, after feature templates " U00:%x [0; 0] " computing is adopted to " the men5 " row in corpus, feature " U00: " can be obtained, after feature templates " U01:%x [-1; 0]/%x [0,0] " computing is adopted to " the men5 " row in corpus, feature " U01: I/" can be obtained.
By S11 and S12, the feature of acquisition has following character:
(1) shrinkability: the coefficient that fundamental function is corresponding only gets nonzero value on the restricted candidates collection of observed value;
(2) consistent shrinkability: at certain observed value x iall fundamental functions that place produces, are all retracted to x irestricted candidates collection on, namely only at x irestricted candidates collection on coefficient have non-zero value.
Utilize above character, see Fig. 2, the flow process building Lattice can comprise:
S21: corresponding described observed value, the restricted candidates collection corresponding according to described data, sets up node data.
Wherein, the element restricted candidates corresponding for each observed value can concentrated, as the node data that this observed value is corresponding.
Such as, see Fig. 3, the node data of corresponding " I " comprising: " wo3 ", and the node data of corresponding " " comprising: " men2 " and " men5 ", all the other are similar.
S22: corresponding described observed value, adopts described feature templates to carry out computing, obtains characteristic of correspondence.
Such as, feature templates specifically can refer to the single order feature templates meeting formula (1), then adopt meet the single order feature templates of formula (1) can respectively to each row operation of advancing in corpus, obtain corresponding feature, afterwards can by the set of feature composition characteristic.
Such as, corresponding observed value " ", the single order feature templates that employing meets formula (1) is to " men5 " row operation of advancing of corpus, obtain the feature of at least one, at least one feature can form a characteristic set (feature), and characteristic set can be corresponding with each node data.All the other node datas are similar, as is shown in phantom in fig. 3.
S23: by described node data, described feature, as the parameter in the grid built, obtains grid.
Owing to also there are other parameters in lattice, then also comprise when building lattice and obtain all the other parameters.
Such as, feature templates can also comprise second order feature templates (consistent with the second order feature templates of traditional CRF), by second order feature templates, can obtain the transfer relationship between node data, as shown in the solid line between each node data in Fig. 3.
In addition, all the other flow processs built in Lattice can adopt traditional CRF mode to realize.
S14: according to described restricted candidates collection and described grid, generates the model being used for data markers.
In some embodiments, see Fig. 4, described according to described restricted candidates collection and described grid, generate the model being used for data markers, comprising:
S41: using described observed value and the described restricted candidates collection parameter as described model, and correspondingly to preserve.
Such as, corresponding stored " "---{ men2, men5} etc.
S42: calculate the coefficient of described feature on described node data, alternatively marks the element in described restricted candidates collection, and using described feature, described candidates, described coefficient as the parameter of described model, and correspondingly preserves.
Wherein, as shown in Figure 3, node data refers to these marks of men2, men5, and traditional CRF algorithm can be adopted to calculate the coefficient lambda of each feature on each mark k.
Characteristic sum candidates can be obtained in the lattice built, after calculating coefficient, can by these parameter corresponding stored.
Such as, corresponding stored: " U00: "---men2---first coefficient, " U00: "---men5---second coefficient etc.
S43: the parameter of the feature templates of selection as model is preserved.
Such as, feature templates " U00:%x [0,0] " etc. is preserved.
Be understandable that, the different information of above-mentioned storage does not limit in sequential.
Certainly, be understandable that, the model that CRF training process obtains can also comprise all the other parameters, and all the other parameters can adopt the algorithm of traditional C RF training process to obtain.
It should be noted that, other training steps do not related in the present embodiment can adopt the corresponding steps in traditional CRF to realize.
In the present embodiment, by setting up restricted candidates collection, the number of candidates can be limited, and, select the template meeting above-mentioned requirements, the number of nonzero coefficient can be limited, these can reduce operand, thus model generation speed can be improved, and reduce the data volume of model, thus provide basis for fast decoding.
After training obtains model, this model can be adopted to decode, phonetic notation is carried out to Chinese character to be marked.
Fig. 5 is the schematic flow sheet of the data mask method that another embodiment of the present invention proposes, and the method comprises:
S51: obtain the model preserved in advance.
Wherein, this model can adopt the method as above shown in embodiment to generate, and does not repeat them here.
S52: obtain observation sequence to be marked.
Observation sequence to be marked can be made up of observed value to be marked, and when the phonetic annotation of Chinese characters, each observed value to be marked specifically refers to each Chinese character treating phonetic notation.Such as, in voice system, obtain Chinese character to be marked.
Be understandable that, the sequential relationship of S51 and S52 does not limit.
S53: according to described model, marks described observation sequence to be marked.
Wherein, when adopting model to mark, the flow process of feature determination flow process and structure lattice can carry out adaptation amendment with reference to training flow process, and all the other flow processs can realize with reference to traditional CRF decoding process.
In some embodiments, see Fig. 6, described according to described model, described observation sequence to be marked is marked, comprising:
S61: corresponding observed value to be marked, according to the corresponding relation of the observed value in model with restricted candidates collection, obtains the restricted candidates collection of correspondence, and sets up node data according to restricted candidates collection.
Such as, observation sequence X'=x to be marked 1', x 2' ..., x n' represent, then can corresponding each observed value x to be marked i', from model, obtain corresponding restricted candidates collection, and using the node data of each element in this set as lattice.
Be understandable that, if do not obtain corresponding restricted candidates collection in a model, then subsequent treatment can ignore this observed value to be marked, and this observed value does not participate in subsequent arithmetic.
S62: corresponding observed value to be marked, obtains feature according to the feature templates computing in model, and in the feature obtained existing feature in preference pattern.
After obtain feature templates from model, can corresponding each Chinese character to be marked, obtain corresponding feature, afterwards, the feature occurred in training process can be selected from these features, afterwards can by these feature composition characteristic set (feature).
S63: described node data is alternatively marked, and according to the corresponding relation of the candidates preserved in model, feature, coefficient, determine the coefficient of the candidates corresponding with the node data set up and the feature after selecting.
Such as, by S61, observed value x corresponding to be marked i', can set up corresponding node data, each node data is a kind of phonetic notation.Each node data characteristic of correspondence can be obtained by S62.In addition, the corresponding relation between phonetic notation, characteristic sum coefficient in the model obtained after training, can be recorded, therefore, the coefficient of each candidates (node data) can be obtained.
S64: set up the transfer relationship between candidates.
Such as, the feature templates preserved in model, except above-mentioned single order feature templates, also comprises second order feature templates, can set up the transfer relationship between candidates by this second order feature templates.
S65: according to described transfer relationship, the coefficient of candidates, marks observation sequence to be marked.
Such as, according to the transfer relationship between candidates, a kind of marker combination can be formed, and the overall coefficient of often kind of marker combination can be determined according to the coefficient of each candidates, final mark can be completed according to this overall coefficient afterwards.
Be understandable that, other computation processes of decoding can adopt traditional CRF algorithm realization.
In the present embodiment, by adopting above-mentioned model to mark, fast decoding can be realized based on this model.In addition, by selecting the feature templates met certain requirements, can reduce feature quantity, training and decode time all will significantly reduce, and data processing whole structure significantly improves.When whole sentence is marked, the probability of whole sentence can be obtained, be convenient to setting threshold value and exclude the possibility and be less than the result of threshold value.By selecting above-mentioned feature templates, can shrink by realization character, can only need a small amount of training data just can obtain higher accuracy rate.
Fig. 7 is the structural representation of the model generation device for data mark that another embodiment of the present invention proposes, and this device 70 comprises:
Acquisition module 71, for obtaining corpus, and the observed value in corresponding described corpus sets up restricted candidates collection;
Wherein, in existing resource, the sentence having completed mark can be collected, as corpus.
The method of the present embodiment can be applied in that the element of candidates collection is more and the candidates number of individual data is less when, this situation is such as the phonetic annotation of Chinese characters, and the embodiment of the present invention will for the phonetic annotation of Chinese characters.Such as, in voice system, to the scene of the phonetic annotation of Chinese characters.
After supposing to adopt CRF method to carry out pre-service, the corpus obtained is as follows:
I am wo3
Men5
Sleep shui4
Zhao2
Le5
Wherein, wo3 represents that wo sends out 3 sound, and men5 represents that men sends out softly, and all the other are similar.
Wherein, " we fall asleep " can be called observation sequence, and each Chinese character in observation sequence is called observed value, and the sequence of corresponding phonetic composition can be called flag sequence.
Restricted candidates collection refers to that element number is less than the set of threshold value.Comprise one or more element in restricted candidates collection, each element is a mark (tag).During for the phonetic annotation of Chinese characters, each observed value in corpus specifically refers to each Chinese character, and each mark refers to the phonetic of Chinese character.
When setting up restricted candidates collection, can add up the phonetic of Chinese character in corpus, all possible phonetic statistics got up, and can pass through inquiry of Chinese character dictionary, leakage detection is filled a vacancy, and improves the accuracy of restricted candidates collection.
In traditional CRF algorithm, employing be candidates collection without restriction, such as, a corresponding Chinese character, in traditional CRF algorithm, the element in the candidates collection that this Chinese character is corresponding is more than 1000.
And in the present embodiment, will candidates collection be limited, be called restricted candidates collection, a corresponding Chinese character, the element in restricted candidates collection is the existing phonetic of this Chinese character, thus significantly can reduce the element number in candidates collection.Such as, polyphone in more than 20,000 Chinese character in gbk coding criterion is only less than 2,000, the candidate pinyin (element in restricted candidates collection) of the Chinese character of more than 90% only has one, and in these 2,000 polyphones, even if add the change of voice and softly rule, the candidate pinyin (element in restricted candidates collection) of individual Chinese character also can not more than 10.
Therefore, after adopting restricted candidates collection, the element in this restricted candidates collection can not more than 10, compared with more than 1000 traditional element, drastically reduce the area the number of candidate collection interior element, thus reduce the time overhead of model training and decoding process.
For " we fall asleep " this observation sequence, the restricted candidates collection of foundation is as follows:
Select module 72, for selecting feature templates, described feature templates makes the number of the element of fundamental function in the restricted candidates collection that the number of the nonzero coefficient at observed value place is corresponding with described observed value identical;
Described feature templates comprises single order feature templates, and the single order feature templates of selection meets following condition:
There is nonzero coefficient at the element place of the fundamental function using described feature templates to obtain in restricted candidates collection, is zero at the coefficient of other mark.
Also existing characteristics template in traditional CRF algorithm, and computing can be carried out according to feature templates to corpus, obtain feature, and use feature to carry out subsequent arithmetic.
Feature templates is not limited in traditional CRF algorithm, and will the template met certain requirements be selected in the present embodiment, get rid of some undesirable templates.
In some embodiments, feature templates comprises: single order feature templates, the single order feature templates selected meets following condition: there is nonzero coefficient at the element place of the fundamental function using described feature templates to obtain in restricted candidates collection, is zero at the coefficient of other mark.
In CRF algorithm, feature templates comprises single order feature templates and second order feature templates, wherein, can use the second order feature templates in traditional C RF algorithm in the present embodiment.
The single order feature templates selected in the present embodiment is formulated as:
Uij:%x[i 1,j 1]/%x[i 2,j 2]/.../%x[0,0]/.../%x[i m,j m](1)
The single order feature templates got rid of is needed to be formulated as in the present embodiment:
Ulk:%x[l 1,k 1]/%x[l 2,k 2]/.../%x[l m,k m](2)
Wherein, l 1, l 2l mbe not all that 0, i, j, l, k are used for representing that the subscript of template indicates.
Concrete principle illustrates and see the associated description in method, can not repeat them here.
Build module 73, for according to described restricted candidates collection and described feature templates, build grid;
After selection feature templates, this feature templates can be adopted to obtain corresponding feature.Such as, after feature templates " U00:%x [0; 0] " computing is adopted to " the men5 " row in corpus, feature " U00: " can be obtained, after feature templates " U01:%x [-1; 0]/%x [0,0] " computing is adopted to " the men5 " row in corpus, feature " U01: I/" can be obtained.
By S11 and S12, the feature of acquisition has following character:
(1) shrinkability: the coefficient that fundamental function is corresponding only gets nonzero value on the restricted candidates collection of observed value;
(2) consistent shrinkability: at certain observed value x iall fundamental functions that place produces, are all retracted to x irestricted candidates collection on, namely only at x irestricted candidates collection on coefficient have non-zero value.
In some embodiments, described structure module 73 specifically for:
Corresponding described observed value, the restricted candidates collection corresponding according to described data, sets up node data;
Corresponding described observed value, adopts described feature templates to carry out computing, obtains characteristic of correspondence;
By described node data, described feature, as the parameter in the grid built, obtains grid.
Wherein, the element restricted candidates corresponding for each observed value can concentrated, as the node data that this observed value is corresponding.
Such as, see Fig. 3, the node data of corresponding " I " comprising: " wo3 ", and the node data of corresponding " " comprising: " men2 " and " men5 ", all the other are similar.
Such as, feature templates specifically can refer to the single order feature templates meeting formula (1), then adopt meet the single order feature templates of formula (1) can respectively to each row operation of advancing in corpus, obtain corresponding feature, afterwards can by the set of feature composition characteristic.
Such as, corresponding observed value " ", the single order feature templates that employing meets formula (1) is to " men5 " row operation of advancing of corpus, obtain the feature of at least one, at least one feature can form a characteristic set (feature), and characteristic set can be corresponding with each node data.All the other node datas are similar, as is shown in phantom in fig. 3.
Owing to also there are other parameters in lattice, then also comprise when building lattice and obtain all the other parameters.
Such as, feature templates can also comprise second order feature templates (consistent with the second order feature templates of traditional CRF), by second order feature templates, can obtain the transfer relationship between node data, as shown in the solid line in Fig. 3.
In addition, all the other flow processs built in Lattice can adopt traditional CRF mode to realize.
Generation module 74, for according to described restricted candidates collection and described grid, generates the model being used for data markers.
In some embodiments, described generation module 74 specifically for:
Using described observed value and the described restricted candidates collection parameter as described model, and correspondingly to preserve;
Calculate the coefficient of described feature on described node data, the element in described restricted candidates collection is alternatively marked, and using described feature, described candidates, described coefficient as the parameter of described model, and correspondingly to preserve;
The parameter of the feature templates of selection as model is preserved.
Such as, corresponding stored " "---{ men2, men5} etc.
Wherein, as shown in Figure 3, node data refers to these marks of men2, men5, and traditional CRF algorithm can be adopted to calculate the coefficient lambda of each feature on each mark k.
Characteristic sum candidates can be obtained in the lattice built, after calculating coefficient, can by these parameter corresponding stored.
Such as, corresponding stored: " U00: "-men2-first coefficient, " U00: "-men5-second coefficient etc.
Be understandable that, the information of above-mentioned storage does not limit in sequential.
Certainly, be understandable that, the model that CRF training process obtains can also comprise all the other parameters, and all the other parameters can adopt the algorithm of traditional C RF training process to obtain.
It should be noted that, other training steps do not related in the present embodiment can adopt the corresponding steps in traditional CRF to realize.
In the present embodiment, by setting up restricted candidates collection, the number of candidates can be limited, and, select the template meeting above-mentioned requirements, the number of nonzero coefficient can be limited, these can reduce operand, thus model generation speed can be improved, and reduce the data volume of model, thus provide basis for fast decoding.
After training obtains model, this model can be adopted to decode, phonetic notation is carried out to Chinese character to be marked.
Fig. 8 is the structural representation of the data annotation equipment that another embodiment of the present invention proposes, and this device 80 comprises:
First acquisition module 81, for obtaining the model preserved in advance.
Wherein, this model can adopt the method as above shown in embodiment to generate, and does not repeat them here.
Second acquisition module 82, for obtaining observation sequence to be marked;
Observation sequence to be marked can be made up of observed value to be marked, and when the phonetic annotation of Chinese characters, each observed value to be marked specifically refers to each Chinese character treating phonetic notation.Such as, in voice system, obtain Chinese character to be marked.
Labeling module 83, for according to described model, marks described observation sequence to be marked.
Wherein, when adopting model to mark, the flow process of feature determination flow process and structure lattice can carry out adaptation amendment with reference to training flow process, and all the other flow processs can realize with reference to traditional CRF decoding process.
In some embodiments, described labeling module 83 specifically for:
Corresponding observed value to be marked, according to the corresponding relation of the observed value in model with restricted candidates collection, obtains the restricted candidates collection of correspondence, and sets up node data according to restricted candidates collection;
Corresponding observed value to be marked, obtains feature according to the feature templates computing in model, and in the feature obtained existing feature in preference pattern;
Described node data is alternatively marked, and according to the corresponding relation of the candidates preserved in model, feature, coefficient, determines the coefficient of the candidates corresponding with the node data set up and the feature after selecting;
Set up the transfer relationship between candidates;
According to described transfer relationship, the coefficient of candidates, marks observation sequence to be marked.
Such as, observation sequence X'=x to be marked 1', x 2' ..., x n' represent, then can corresponding each observed value x to be marked i', from model, obtain corresponding restricted candidates collection, and using the node data of each element in this set as lattice.
Be understandable that, if do not obtain corresponding restricted candidates collection in a model, then subsequent treatment can ignore this observed value to be marked, and this observed value does not participate in subsequent arithmetic.
After obtain feature templates from model, can corresponding each Chinese character to be marked, obtain corresponding feature, afterwards, the feature occurred in training process can be selected from these features, afterwards can by these feature composition characteristic set (feature).
Such as, by correlation step above, observed value x corresponding to be marked i', can set up corresponding node data, each node data is a kind of phonetic notation.Each node data characteristic of correspondence can be obtained by correlation step above.In addition, the corresponding relation between phonetic notation, characteristic sum coefficient in the model obtained after training, can be recorded, therefore, the coefficient of each candidates (node data) can be obtained.
Such as, the feature templates preserved in model, except above-mentioned single order feature templates, also comprises second order feature templates, can set up the transfer relationship between candidates by this second order feature templates.
Such as, according to the transfer relationship between candidates, a kind of marker combination can be formed, and the overall coefficient of often kind of marker combination can be determined according to the coefficient of each candidates, final mark can be completed according to this overall coefficient afterwards.
Be understandable that, other computation processes of decoding can adopt traditional CRF algorithm realization.
In the present embodiment, by adopting above-mentioned model to mark, fast decoding can be realized based on this model.In addition, by selecting the feature templates met certain requirements, can reduce feature quantity, training and decode time all will significantly reduce, and data processing whole structure significantly improves.When whole sentence is marked, the probability of whole sentence can be obtained, be convenient to setting threshold value and exclude the possibility and be less than the result of threshold value.By selecting above-mentioned feature templates, can shrink by realization character, can only need a small amount of training data just can obtain higher accuracy rate.
It should be noted that, in describing the invention, term " first ", " second " etc. only for describing object, and can not be interpreted as instruction or hint relative importance.In addition, in describing the invention, except as otherwise noted, the implication of " multiple " refers at least two.
Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.
Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.
Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is that the hardware that can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, also can be that the independent physics of unit exists, also can be integrated in a module by two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.If described integrated module using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.
The above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.
In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.
Although illustrate and describe embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, revises, replace and modification.

Claims (14)

1., for a model generating method for data mark, it is characterized in that, comprising:
Obtain corpus, and the observed value in corresponding described corpus sets up restricted candidates collection;
Select feature templates, described feature templates makes the number of the element of fundamental function in the restricted candidates collection that the number of the nonzero coefficient at observed value place is corresponding with described observed value identical;
According to described restricted candidates collection and described feature templates, build grid;
According to described restricted candidates collection and described grid, generate the model being used for data markers.
2. method according to claim 1, is characterized in that, the observed value in described corpus is Chinese character, and the element in described restricted candidates collection is phonetic.
3. method according to claim 1 and 2, is characterized in that, described feature templates comprises single order feature templates, and the single order feature templates of selection meets following condition:
There is nonzero coefficient at the element place of the fundamental function using described feature templates to obtain in restricted candidates collection, is zero at the coefficient of other mark.
4. method according to claim 1 and 2, is characterized in that, described according to described restricted candidates collection and described feature templates, builds grid, comprising:
Corresponding described observed value, the restricted candidates collection corresponding according to described data, sets up node data;
Corresponding described observed value, adopts described feature templates to carry out computing, obtains characteristic of correspondence;
By described node data, described feature, as the parameter in the grid built, obtains grid.
5. method according to claim 1 and 2, is characterized in that, described according to described restricted candidates collection and described grid, generates the model being used for data markers, comprising:
Using described observed value and the described restricted candidates collection parameter as described model, and correspondingly to preserve;
Calculate the coefficient of described feature on described node data, the element in described restricted candidates collection is alternatively marked, and using described feature, described candidates, described coefficient as the parameter of described model, and correspondingly to preserve;
The parameter of the feature templates of selection as model is preserved.
6. a data mask method, is characterized in that, comprising:
Obtain the model preserved in advance, described model adopts the method as described in any one of claim 1-5 to generate;
Obtain observation sequence to be marked;
According to described model, described observation sequence to be marked is marked.
7. method according to claim 6, is characterized in that, described according to described model, marks, comprising described observation sequence to be marked:
Corresponding observed value to be marked, according to the corresponding relation of the observed value in model with restricted candidates collection, obtains the restricted candidates collection of correspondence, and sets up node data according to restricted candidates collection;
Corresponding observed value to be marked, obtains feature according to the feature templates computing in model, and in the feature obtained existing feature in preference pattern;
Described node data is alternatively marked, and according to the corresponding relation of the candidates preserved in model, feature, coefficient, determines the coefficient of the candidates corresponding with the node data set up and the feature after selecting;
Set up the transfer relationship between candidates;
According to described transfer relationship, the coefficient of candidates, marks observation sequence to be marked.
8., for a model generation device for data mark, it is characterized in that, comprising:
Acquisition module, for obtaining corpus, and the observed value in corresponding described corpus sets up restricted candidates collection;
Select module, for selecting feature templates, described feature templates makes the number of the element of fundamental function in the restricted candidates collection that the number of the nonzero coefficient at observed value place is corresponding with described observed value identical;
Build module, for according to described restricted candidates collection and described feature templates, build grid;
Generation module, for according to described restricted candidates collection and described grid, generates the model being used for data markers.
9. device according to claim 8, is characterized in that, the observed value in described corpus is Chinese character, and the element in described restricted candidates collection is phonetic.
10. device according to claim 8 or claim 9, it is characterized in that, described feature templates comprises single order feature templates, and the single order feature templates of selection meets following condition:
There is nonzero coefficient at the element place of the fundamental function using described feature templates to obtain in restricted candidates collection, is zero at the coefficient of other mark.
11. devices according to claim 8 or claim 9, is characterized in that, described structure module specifically for:
Corresponding described observed value, the restricted candidates collection corresponding according to described data, sets up node data;
Corresponding described observed value, adopts described feature templates to carry out computing, obtains characteristic of correspondence;
By described node data, described feature, as the parameter in the grid built, obtains grid.
12. devices according to claim 8 or claim 9, is characterized in that, described generation module specifically for:
Using described observed value and the described restricted candidates collection parameter as described model, and correspondingly to preserve;
Calculate the coefficient of described feature on described node data, the element in described restricted candidates collection is alternatively marked, and using described feature, described candidates, described coefficient as the parameter of described model, and correspondingly to preserve;
The parameter of the feature templates of selection as model is preserved.
13. 1 kinds of data annotation equipments, is characterized in that, comprising:
First acquisition module, for obtaining the model preserved in advance, described model adopts the method as described in any one of claim 1-5 to generate;
Second acquisition module, for obtaining observation sequence to be marked;
Labeling module, for according to described model, marks described observation sequence to be marked.
14. devices according to claim 13, is characterized in that, described labeling module specifically for:
Corresponding observed value to be marked, according to the corresponding relation of the observed value in model with restricted candidates collection, obtains the restricted candidates collection of correspondence, and sets up node data according to restricted candidates collection;
Corresponding observed value to be marked, obtains feature according to the feature templates computing in model, and in the feature obtained existing feature in preference pattern;
Described node data is alternatively marked, and according to the corresponding relation of the candidates preserved in model, feature, coefficient, determines the coefficient of the candidates corresponding with the node data set up and the feature after selecting;
Set up the transfer relationship between candidates;
According to described transfer relationship, the coefficient of candidates, marks observation sequence to be marked.
CN201510428997.5A 2015-07-20 2015-07-20 Model generation method used for data annotation, data annotation method, model generation device used for data annotation and data annotation device Active CN105095156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510428997.5A CN105095156B (en) 2015-07-20 2015-07-20 Model generation method used for data annotation, data annotation method, model generation device used for data annotation and data annotation device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510428997.5A CN105095156B (en) 2015-07-20 2015-07-20 Model generation method used for data annotation, data annotation method, model generation device used for data annotation and data annotation device

Publications (2)

Publication Number Publication Date
CN105095156A true CN105095156A (en) 2015-11-25
CN105095156B CN105095156B (en) 2017-05-10

Family

ID=54575634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510428997.5A Active CN105095156B (en) 2015-07-20 2015-07-20 Model generation method used for data annotation, data annotation method, model generation device used for data annotation and data annotation device

Country Status (1)

Country Link
CN (1) CN105095156B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271687A (en) * 2007-03-20 2008-09-24 株式会社东芝 Method and device for pronunciation conversion estimation and speech synthesis
US20110078554A1 (en) * 2009-09-30 2011-03-31 Microsoft Corporation Webpage entity extraction through joint understanding of page structures and sentences
CN104142909A (en) * 2014-05-07 2014-11-12 腾讯科技(深圳)有限公司 Method and device for phonetic annotation of Chinese characters
CN104142916A (en) * 2014-01-08 2014-11-12 腾讯科技(深圳)有限公司 Method and device for setting CRF (conditional random fields) predicted value

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271687A (en) * 2007-03-20 2008-09-24 株式会社东芝 Method and device for pronunciation conversion estimation and speech synthesis
US20110078554A1 (en) * 2009-09-30 2011-03-31 Microsoft Corporation Webpage entity extraction through joint understanding of page structures and sentences
CN104142916A (en) * 2014-01-08 2014-11-12 腾讯科技(深圳)有限公司 Method and device for setting CRF (conditional random fields) predicted value
CN104142909A (en) * 2014-05-07 2014-11-12 腾讯科技(深圳)有限公司 Method and device for phonetic annotation of Chinese characters

Also Published As

Publication number Publication date
CN105095156B (en) 2017-05-10

Similar Documents

Publication Publication Date Title
CN104965819B (en) A kind of biomedical event trigger word recognition methods based on syntax term vector
CN105824802A (en) Method and device for acquiring knowledge graph vectoring expression
CN103154936B (en) For the method and system of robotization text correction
WO2021072852A1 (en) Sequence labeling method and system, and computer device
Tran et al. Neural metric learning for fast end-to-end relation extraction
CN105373529A (en) Intelligent word segmentation method based on hidden Markov model
CN104538024A (en) Speech synthesis method, apparatus and equipment
CN106372060A (en) Search text labeling method and device
CN103853710A (en) Coordinated training-based dual-language named entity identification method
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN106776517A (en) Automatic compose poem method and apparatus and system
JP5523543B2 (en) Concept recognition method and concept recognition device based on co-learning
CN104965821A (en) Data annotation method and apparatus
CN101609672A (en) A kind of speech recognition semantic confidence feature extracting methods and device
CN112613322B (en) Text processing method, device, equipment and storage medium
Pham et al. Nnvlp: A neural network-based vietnamese language processing toolkit
CN110889412B (en) Medical long text positioning and classifying method and device in physical examination report
CN117957543A (en) System and method for natural language understanding system based on iterative intent detection and slot filling neural layer
CN107506345A (en) The construction method and device of language model
CN113688207B (en) Modeling processing method and device based on structural reading understanding of network
US20220164536A1 (en) Method and apparatus for sequence labeling on entity text, and non-transitory computer-readable recording medium
Easton et al. A large neighborhood search heuristic for the longest common subsequence problem
Zhou et al. Supervised learning enhanced quantum circuit transformation
CN113821637A (en) Long text classification method and device, computer equipment and readable storage medium
CN113222160B (en) Quantum state conversion method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant