The content of the invention
The invention provides the Vietnamese part of speech labeling method for combining conversion of parts of speech part of speech disambiguation model and dictionary, especially considers
Influence of the conversion of parts of speech to part-of-speech tagging, the accuracy of the part-of-speech tagging of Vietnamese is effectively improved, for solving traditional mark
The problem of accuracy of note method is relatively low.
The technical scheme is that:With reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary,
Methods described concretely comprises the following steps:
Step1, first manual sorting obtain Vietnamese dictionary;
Step2, non-conversion of parts of speech dictionary and conversion of parts of speech dictionary are secondly obtained based on the Vietnamese dictionary of manual sorting;
Step3, secondly according to Vietnamese language feature, have chosen Vietnamese part-of-speech tagging feature set, construct conversion of parts of speech
Part of speech disambiguation model;
Step4, further according to constructed conversion of parts of speech part of speech disambiguation model and non-conversion of parts of speech dictionary respectively to new in Vietnamese
Hear the conversion of parts of speech in the testing material obtained on the net and non-conversion of parts of speech carries out part of speech mark automatically;
Step5, finally the automatic fusion of result progress of two kinds of marks is obtained finally marking result.
The step Step3's concretely comprises the following steps:
Step3.1, first, different type language material is crawled by web crawler, and carries out the pretreatment of language material, in advance
Processing includes data de-noising, makees word segmentation processing with participle instrument;
Step3.2, secondly, matched according to Vietnamese dictionary, write the conversion of parts of speech that automatic program identification goes out in language material
Set;
Step3.3, then, according to Vietnamese conversion of parts of speech characteristic, choose the feature of conversion of parts of speech;Subsequently according to selection this
A little features are dissolved into training corpus;
It is Step3.4, last, statistical analysis calculating is carried out using maximum entropy model, with reference to the conversion of parts of speech feature in Step3.3
And contextual feature, generate Vietnamese conversion of parts of speech part of speech disambiguation model.
The step Step3.1's concretely comprises the following steps:
Step3.1.1, it be have collected from the news website of Vietnamese including news, amusement, economic type article;
Step3.1.2, first pass around including arranging, going noise operation, form the language material of text sentence level;
Step3.1.3, secondly the language material of text sentence level is segmented and by Vietnamese using Vietnamese participle instrument
Yan expert manually proofreads, and forms the participle language material of Sentence-level;
Step3.1.4 and then artificial part-of-speech tagging and chunk parsing are carried out to participle language material;
Step3.1.5, finally by arrange Vietnamese dictionary obtain conversion of parts of speech dictionary;Based on this dictionary, pass through volume
Journey extracts Vietnamese conversion of parts of speech field language material from the part-of-speech tagging corpus built, for conversion of parts of speech part of speech disambiguation model
Structure.
In the step Step3.3, its feature of conversion of parts of speech part of speech disambiguation model is mainly chosen:Word and word contextual information
Feature;Part of speech contextual information feature;Chunk and chunk contextual information feature;Word sentence element feature in sentence.
The step Step4's concretely comprises the following steps:
Step4.1, Vietnamese conversion of parts of speech dictionary is primarily based on, conversion of parts of speech is extracted from the testing material for treat part-of-speech tagging
With non-conversion of parts of speech;
Step4.2 then using conversion of parts of speech part of speech disambiguation model to conversion of parts of speech carry out disambiguation, after obtaining conversion of parts of speech disambiguation
Mark result;
It is Step4.3, last, the non-conversion of parts of speech extracted is matched according to non-conversion of parts of speech part of speech dictionary, obtained non-simultaneous
Class word marks result.
In the step Step5, for incite somebody to action both after obtaining the part-of-speech tagging of conversion of parts of speech and the part-of-speech tagging of non-conversion of parts of speech
The method combined is directly to replace, because conversion of parts of speech dictionary and non-conversion of parts of speech dictionary are had in same Vietnamese dictionary
Resulting, so directly replacing will not cause to conflict.
The beneficial effects of the invention are as follows:
The present invention especially considers the influence of conversion of parts of speech, language material is divided into conversion of parts of speech in the research of Vietnamese part-of-speech tagging
It is marked respectively with non-conversion of parts of speech, and is arranged based on Vietnamese dictionary and obtained non-conversion of parts of speech dictionary and conversion of parts of speech word
Allusion quotation:For non-conversion of parts of speech, it is contemplated that the part-of-speech tagging based on part of speech dictionary can realize the good experiment close to 100% accuracy rate
As a result, this is well more many than the experimental result of the algorithm based on statistics, and when avoiding handmarking's language material it is possible that
The possibility of marking error, reduce workload during mark language material;For conversion of parts of speech, the language that the present invention combines Vietnamese is special
Property, constructs conversion of parts of speech corpus, have chosen above-mentioned conversion of parts of speech feature, effectively improve Vietnamese part-of-speech tagging it is correct
Rate.
Embodiment 1:As Figure 1-4, with reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary,
Methods described concretely comprises the following steps:
Step1, first manual sorting obtain Vietnamese dictionary;It can be derived from from website (http://
Vdict.com/) swash the word got, such as have 30565 entries;
Step2, non-conversion of parts of speech dictionary and conversion of parts of speech dictionary are secondly obtained based on the Vietnamese dictionary of manual sorting;Its
In obtained conversion of parts of speech dictionary be 2659;
Selection Vietnamese dictionary is that coverage rate is wider because Vietnamese dictionary is comparatively relatively more comprehensive, can be covered absolutely
Most of real corpus, resulting conversion of parts of speech and non-conversion of parts of speech dictionary can also cover conversion of parts of speech and non-ambiguous category in real corpus
Word.
Step3, secondly according to Vietnamese language feature, have chosen Vietnamese part-of-speech tagging feature set, construct conversion of parts of speech
Part of speech disambiguation model;
Step4, further according to constructed conversion of parts of speech part of speech disambiguation model and non-conversion of parts of speech dictionary respectively to new in Vietnamese
Hear the conversion of parts of speech in the testing material obtained on the net and non-conversion of parts of speech carries out part of speech mark automatically;
Step5, finally the automatic fusion of result progress of two kinds of marks is obtained finally marking result.
Further, the step Step3 is concretely comprised the following steps:
Step3.1, first, different type language material is crawled by web crawler, and carries out the pretreatment of language material, in advance
Processing includes data de-noising, makees word segmentation processing with participle instrument;
Step3.2, secondly, matched according to Vietnamese dictionary, write the conversion of parts of speech that automatic program identification goes out in language material
Set;
Step3.3, then, according to Vietnamese conversion of parts of speech characteristic, choose the feature of conversion of parts of speech;Subsequently according to selection this
A little features are dissolved into training corpus;
It is Step3.4, last, statistical analysis calculating is carried out using maximum entropy model, with reference to the conversion of parts of speech feature in Step3.3
And contextual feature, generate Vietnamese conversion of parts of speech part of speech disambiguation model.
Further, the step Step3.1 is concretely comprised the following steps:
Step3.1.1, it be have collected from the news website of Vietnamese including news, amusement, economic type article;
Step3.1.2, first pass around including arranging, going noise operation, form the language material of text sentence level;
Step3.1.3, secondly the language material of text sentence level is segmented and by Vietnamese using Vietnamese participle instrument
Yan expert manually proofreads, and forms the participle language material of Sentence-level;
Step3.1.4 and then artificial part-of-speech tagging and chunk parsing are carried out to participle language material;
Step3.1.5, finally by arrange Vietnamese dictionary obtain conversion of parts of speech dictionary;Based on this dictionary, pass through volume
Journey extracts Vietnamese conversion of parts of speech field language material from the part-of-speech tagging corpus built, for conversion of parts of speech part of speech disambiguation model
Structure.
Why Vietnamese conversion of parts of speech field language material is extracted from the news website of Vietnamese, be because conversion of parts of speech field language
Material, it is impossible to obtained elsewhere, also no related data can be taken and use, news website of the present invention selection from Vietnamese
On.
Further, in the step Step3.3, its feature of conversion of parts of speech part of speech disambiguation model is mainly chosen:Word and word
Contextual information feature;Part of speech contextual information feature;Chunk and chunk contextual information feature;Word in sentence sentence into
Dtex is levied.
(1) word and word contextual information feature (morphological pattern contains the information of abundant form);
Certain rule of the morphological pattern of word to word part of speech itself and the rule to the word context, for example, some folded morphologies
The word of things or action is described as AABB formulas typically represent, the word of repetitive operation is typicallyed represent shaped like ABAB formulas, shaped like ABB formulas one
As represent performance things state, quantity, the word etc. of sound;
(2) part of speech contextual information feature (part of speech can represent the modified relationship between part of speech);
The part of speech of the word of the context of word is to the rule of the part of speech of the word in sentence, for example, typically can in a sentence
Containing verb, noun, for another example in sentence pronoun it is latter as connect verb, adverbial word or adjective, connect noun as verb is latter, it is secondary
Word etc..
(3) chunk and chunk contextual information feature (representing that the word acts on played in sentence, the information such as modified relationship);
Part of speech feature in chunk between part of speech feature, and chunk, for example, noun chunk is typically by adjective and noun structure
Into, for another example noun chunk it is previous as be verb chunk.
(4) word sentence element feature (subject, predicate, adverbial modifier etc.) in sentence.
Composition of the word in sentence and the rule of the word part of speech, such as:Predicate is generally verb, subject be generally pronoun or
Noun etc..
Further, the step Step4 is concretely comprised the following steps:
Step4.1, Vietnamese conversion of parts of speech dictionary is primarily based on, conversion of parts of speech is extracted from the testing material for treat part-of-speech tagging
With non-conversion of parts of speech;
Step4.2 then using conversion of parts of speech part of speech disambiguation model to conversion of parts of speech carry out disambiguation, after obtaining conversion of parts of speech disambiguation
Mark result;
It is Step4.3, last, the non-conversion of parts of speech extracted is matched according to non-conversion of parts of speech part of speech dictionary, obtained non-simultaneous
Class word marks result.
Further, in the step Step5, for obtaining the part-of-speech tagging of conversion of parts of speech and the part-of-speech tagging of non-conversion of parts of speech
The method combined both afterwards is directly to replace, because conversion of parts of speech dictionary and non-conversion of parts of speech dictionary are all to have same Vietnam
Obtained by language dictionary, so directly replacing will not cause to conflict.
The present embodiment is used as training corpus and testing material by the Vietnamese sentence crawled in Vietnam's news website, climbs
The webpage got forms text corpus by steps such as Rule Extraction, duplicate removal, artificial marks, constructs scale as 27878
Sentence and 396,946 conversion of parts of speech field storehouses, for the invention provides the support of language material;
In order to verify the effect of the name entity of the invention identified, unified evaluation criterion will be used:Accuracy rate
(Precision) as the evaluation criterion of the present invention, performance of the invention is weighed.
The present invention is in order to verify that the validity of the invention, possible designs following groups are verified:
Experiment one:The participle accuracy that can effectively improve Vietnamese is added after conversion of parts of speech model disambiguation in order to demonstrate.
The 27878 part-of-speech tagging language materials marked are divided into ten parts by this experiment, then carry out ten times of cross-validation experiments, respectively
Ten times of cross validation realities are carried out respectively using popular recently MEM, CRF, SVM and part of speech dictionary+conversion of parts of speech disambiguation model
Test, compare Average Accuracy.Experimental result is as shown in Figure 3.The different characteristic of table 1 extracts performance shadow to domain entities hyponymy
Ring;
10 times of cross validation's experiments of table
From the experimental data of table 1, MEM, CRF++, SVMmulticlass, part of speech dictionary+conversion of parts of speech disambiguation model are put down
Equal accuracy rate is 91.62%, 93.71%, 94.67% and 95.22% respectively, wherein, SVMmulticlass models accuracy rate ratio
CRF++ is higher by 0.96%, CRF++ and is higher by 2.09% than MEM, and the accuracy rate ratio of part of speech dictionary+conversion of parts of speech disambiguation model
SVMmulticlass models are high by 0.55%.Also as shown in figure 4, so as to demonstrate add conversion of parts of speech model disambiguation after can be effectively
Improve the participle accuracy of Vietnamese.
Experiment two:In order to verify the validity of present system, with part-of-speech tagging model of the present invention and existing part of speech mark
Note instrument VietTagger and SVMmulticlass model carry out contrast experiment, and experimental result is as shown in table 2.
The part-of-speech tagging experimental result of table 2 contrasts
System |
Precision |
VieTagger |
92.13% |
SVMulticlass |
94.67% |
Proposed method |
95.22% |
As can be seen that the part of speech mark proposed by the invention based on part of speech dictionary and conversion of parts of speech disambiguation models coupling in table 2
Injecting method achieves good annotation results, higher than VietTagger by 3.09%, higher than SVM multiclass by 0.55%, so as to
It is effective and feasible to demonstrate the inventive method.
Above in conjunction with accompanying drawing to the present invention embodiment be explained in detail, but the present invention be not limited to it is above-mentioned
Embodiment, can also be before present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge
Put that various changes can be made.