CN107894977A

CN107894977A - With reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary

Info

Publication number: CN107894977A
Application number: CN201711056063.9A
Authority: CN
Inventors: 郭剑毅; 赵晨; 余正涛; 王红斌; 文永华
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-11-01
Filing date: 2017-11-01
Publication date: 2018-04-10

Abstract

The present invention relates to the Vietnamese part of speech labeling method for combining conversion of parts of speech part of speech disambiguation model and dictionary, belong to natural language processing technique field.The present invention obtains non-conversion of parts of speech dictionary and conversion of parts of speech dictionary based on Vietnamese dictionary by arranging first；Secondly according to Vietnamese feature, Vietnamese part-of-speech tagging feature is chosen, forms conversion of parts of speech part of speech disambiguation model；Part of speech mark is carried out to the conversion of parts of speech in testing material and non-conversion of parts of speech respectively further according to conversion of parts of speech part of speech disambiguation model and non-conversion of parts of speech dictionary；Finally the result of two kinds of marks is merged to obtain final mark result.The present invention especially considers influence of the conversion of parts of speech to part-of-speech tagging, effectively improves the accuracy of the part-of-speech tagging of Vietnamese.

Description

With reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary

Technical field

The present invention relates to the Vietnamese part of speech labeling method for combining conversion of parts of speech part of speech disambiguation model and dictionary, belong to nature language Say processing technology field.

Background technology

Part-of-speech tagging is typical sequence labelling task in natural language processing, and part-of-speech tagging is for each word in sentence Assign a correct lexical token；It is widely used in many links of natural language processing process, such as chunk parsing, sentence Method analysis, name Entity recognition, noun phrase recognition, semantic analysis and machine translation etc., are played a very important role.More The research of the part-of-speech tagging of southern language effectively can provide support for the language information processing research work of follow-up Vietnamese, can be with Applied to the machine translation of Vietnamese, information retrieval and speech recognition etc., while it is also language block identifier, Vietnamese syntactic analysis The indispensable basis of device etc..But labeling method accuracy of the prior art is low, the influence of conversion of parts of speech is not accounted for yet, It is therefore desirable to provide a kind of Vietnamese part of speech labeling method of combination conversion of parts of speech.

The content of the invention

The invention provides the Vietnamese part of speech labeling method for combining conversion of parts of speech part of speech disambiguation model and dictionary, especially considers Influence of the conversion of parts of speech to part-of-speech tagging, the accuracy of the part-of-speech tagging of Vietnamese is effectively improved, for solving traditional mark The problem of accuracy of note method is relatively low.

The technical scheme is that：With reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary, Methods described concretely comprises the following steps：

Step1, first manual sorting obtain Vietnamese dictionary；

Step2, non-conversion of parts of speech dictionary and conversion of parts of speech dictionary are secondly obtained based on the Vietnamese dictionary of manual sorting；

Step3, secondly according to Vietnamese language feature, have chosen Vietnamese part-of-speech tagging feature set, construct conversion of parts of speech Part of speech disambiguation model；

Step4, further according to constructed conversion of parts of speech part of speech disambiguation model and non-conversion of parts of speech dictionary respectively to new in Vietnamese Hear the conversion of parts of speech in the testing material obtained on the net and non-conversion of parts of speech carries out part of speech mark automatically；

Step5, finally the automatic fusion of result progress of two kinds of marks is obtained finally marking result.

The step Step3's concretely comprises the following steps：

Step3.1, first, different type language material is crawled by web crawler, and carries out the pretreatment of language material, in advance Processing includes data de-noising, makees word segmentation processing with participle instrument；

Step3.2, secondly, matched according to Vietnamese dictionary, write the conversion of parts of speech that automatic program identification goes out in language material Set；

Step3.3, then, according to Vietnamese conversion of parts of speech characteristic, choose the feature of conversion of parts of speech；Subsequently according to selection this A little features are dissolved into training corpus；

It is Step3.4, last, statistical analysis calculating is carried out using maximum entropy model, with reference to the conversion of parts of speech feature in Step3.3 And contextual feature, generate Vietnamese conversion of parts of speech part of speech disambiguation model.

The step Step3.1's concretely comprises the following steps：

Step3.1.1, it be have collected from the news website of Vietnamese including news, amusement, economic type article；

Step3.1.2, first pass around including arranging, going noise operation, form the language material of text sentence level；

Step3.1.3, secondly the language material of text sentence level is segmented and by Vietnamese using Vietnamese participle instrument Yan expert manually proofreads, and forms the participle language material of Sentence-level；

Step3.1.4 and then artificial part-of-speech tagging and chunk parsing are carried out to participle language material；

Step3.1.5, finally by arrange Vietnamese dictionary obtain conversion of parts of speech dictionary；Based on this dictionary, pass through volume Journey extracts Vietnamese conversion of parts of speech field language material from the part-of-speech tagging corpus built, for conversion of parts of speech part of speech disambiguation model Structure.

In the step Step3.3, its feature of conversion of parts of speech part of speech disambiguation model is mainly chosen：Word and word contextual information Feature；Part of speech contextual information feature；Chunk and chunk contextual information feature；Word sentence element feature in sentence.

The step Step4's concretely comprises the following steps：

Step4.1, Vietnamese conversion of parts of speech dictionary is primarily based on, conversion of parts of speech is extracted from the testing material for treat part-of-speech tagging With non-conversion of parts of speech；

Step4.2 then using conversion of parts of speech part of speech disambiguation model to conversion of parts of speech carry out disambiguation, after obtaining conversion of parts of speech disambiguation Mark result；

It is Step4.3, last, the non-conversion of parts of speech extracted is matched according to non-conversion of parts of speech part of speech dictionary, obtained non-simultaneous Class word marks result.

In the step Step5, for incite somebody to action both after obtaining the part-of-speech tagging of conversion of parts of speech and the part-of-speech tagging of non-conversion of parts of speech The method combined is directly to replace, because conversion of parts of speech dictionary and non-conversion of parts of speech dictionary are had in same Vietnamese dictionary Resulting, so directly replacing will not cause to conflict.

The beneficial effects of the invention are as follows：

The present invention especially considers the influence of conversion of parts of speech, language material is divided into conversion of parts of speech in the research of Vietnamese part-of-speech tagging It is marked respectively with non-conversion of parts of speech, and is arranged based on Vietnamese dictionary and obtained non-conversion of parts of speech dictionary and conversion of parts of speech word Allusion quotation：For non-conversion of parts of speech, it is contemplated that the part-of-speech tagging based on part of speech dictionary can realize the good experiment close to 100% accuracy rate As a result, this is well more many than the experimental result of the algorithm based on statistics, and when avoiding handmarking's language material it is possible that The possibility of marking error, reduce workload during mark language material；For conversion of parts of speech, the language that the present invention combines Vietnamese is special Property, constructs conversion of parts of speech corpus, have chosen above-mentioned conversion of parts of speech feature, effectively improve Vietnamese part-of-speech tagging it is correct Rate.

Brief description of the drawings

Fig. 1 is the overall flow figure in the present invention；

Fig. 2 is the flow chart of conversion of parts of speech disambiguation model construction in the present invention；

Fig. 3 is the result figure of four kinds of models, ten times of cross validation's experiments in the embodiment of the present invention；

Fig. 4 is the result figure of three kinds of model contrast experiments in the embodiment of the present invention.

Embodiment

Embodiment 1：As Figure 1-4, with reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary, Methods described concretely comprises the following steps：

Step1, first manual sorting obtain Vietnamese dictionary；It can be derived from from website (http:// Vdict.com/) swash the word got, such as have 30565 entries；

Step2, non-conversion of parts of speech dictionary and conversion of parts of speech dictionary are secondly obtained based on the Vietnamese dictionary of manual sorting；Its In obtained conversion of parts of speech dictionary be 2659；

Selection Vietnamese dictionary is that coverage rate is wider because Vietnamese dictionary is comparatively relatively more comprehensive, can be covered absolutely Most of real corpus, resulting conversion of parts of speech and non-conversion of parts of speech dictionary can also cover conversion of parts of speech and non-ambiguous category in real corpus Word.

Further, the step Step3 is concretely comprised the following steps：

Further, the step Step3.1 is concretely comprised the following steps：

Why Vietnamese conversion of parts of speech field language material is extracted from the news website of Vietnamese, be because conversion of parts of speech field language Material, it is impossible to obtained elsewhere, also no related data can be taken and use, news website of the present invention selection from Vietnamese On.

Further, in the step Step3.3, its feature of conversion of parts of speech part of speech disambiguation model is mainly chosen：Word and word Contextual information feature；Part of speech contextual information feature；Chunk and chunk contextual information feature；Word in sentence sentence into Dtex is levied.

(1) word and word contextual information feature (morphological pattern contains the information of abundant form)；

Certain rule of the morphological pattern of word to word part of speech itself and the rule to the word context, for example, some folded morphologies The word of things or action is described as AABB formulas typically represent, the word of repetitive operation is typicallyed represent shaped like ABAB formulas, shaped like ABB formulas one As represent performance things state, quantity, the word etc. of sound；

(2) part of speech contextual information feature (part of speech can represent the modified relationship between part of speech)；

The part of speech of the word of the context of word is to the rule of the part of speech of the word in sentence, for example, typically can in a sentence Containing verb, noun, for another example in sentence pronoun it is latter as connect verb, adverbial word or adjective, connect noun as verb is latter, it is secondary Word etc..

(3) chunk and chunk contextual information feature (representing that the word acts on played in sentence, the information such as modified relationship)；

Part of speech feature in chunk between part of speech feature, and chunk, for example, noun chunk is typically by adjective and noun structure Into, for another example noun chunk it is previous as be verb chunk.

(4) word sentence element feature (subject, predicate, adverbial modifier etc.) in sentence.

Composition of the word in sentence and the rule of the word part of speech, such as：Predicate is generally verb, subject be generally pronoun or Noun etc..

Further, the step Step4 is concretely comprised the following steps：

Further, in the step Step5, for obtaining the part-of-speech tagging of conversion of parts of speech and the part-of-speech tagging of non-conversion of parts of speech The method combined both afterwards is directly to replace, because conversion of parts of speech dictionary and non-conversion of parts of speech dictionary are all to have same Vietnam Obtained by language dictionary, so directly replacing will not cause to conflict.

The present embodiment is used as training corpus and testing material by the Vietnamese sentence crawled in Vietnam's news website, climbs The webpage got forms text corpus by steps such as Rule Extraction, duplicate removal, artificial marks, constructs scale as 27878 Sentence and 396,946 conversion of parts of speech field storehouses, for the invention provides the support of language material；

In order to verify the effect of the name entity of the invention identified, unified evaluation criterion will be used：Accuracy rate (Precision) as the evaluation criterion of the present invention, performance of the invention is weighed.

The present invention is in order to verify that the validity of the invention, possible designs following groups are verified：

Experiment one：The participle accuracy that can effectively improve Vietnamese is added after conversion of parts of speech model disambiguation in order to demonstrate. The 27878 part-of-speech tagging language materials marked are divided into ten parts by this experiment, then carry out ten times of cross-validation experiments, respectively Ten times of cross validation realities are carried out respectively using popular recently MEM, CRF, SVM and part of speech dictionary+conversion of parts of speech disambiguation model Test, compare Average Accuracy.Experimental result is as shown in Figure 3.The different characteristic of table 1 extracts performance shadow to domain entities hyponymy Ring；

10 times of cross validation's experiments of table

From the experimental data of table 1, MEM, CRF++, SVMmulticlass, part of speech dictionary+conversion of parts of speech disambiguation model are put down Equal accuracy rate is 91.62%, 93.71%, 94.67% and 95.22% respectively, wherein, SVMmulticlass models accuracy rate ratio CRF++ is higher by 0.96%, CRF++ and is higher by 2.09% than MEM, and the accuracy rate ratio of part of speech dictionary+conversion of parts of speech disambiguation model SVMmulticlass models are high by 0.55%.Also as shown in figure 4, so as to demonstrate add conversion of parts of speech model disambiguation after can be effectively Improve the participle accuracy of Vietnamese.

Experiment two：In order to verify the validity of present system, with part-of-speech tagging model of the present invention and existing part of speech mark Note instrument VietTagger and SVMmulticlass model carry out contrast experiment, and experimental result is as shown in table 2.

The part-of-speech tagging experimental result of table 2 contrasts

System	Precision
		VieTagger	92.13%
SVMulticlass	94.67%
		Proposed method	95.22%

As can be seen that the part of speech mark proposed by the invention based on part of speech dictionary and conversion of parts of speech disambiguation models coupling in table 2 Injecting method achieves good annotation results, higher than VietTagger by 3.09%, higher than SVM multiclass by 0.55%, so as to It is effective and feasible to demonstrate the inventive method.

Above in conjunction with accompanying drawing to the present invention embodiment be explained in detail, but the present invention be not limited to it is above-mentioned Embodiment, can also be before present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge Put that various changes can be made.

Claims

1. combine the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary, it is characterised in that：

Methods described concretely comprises the following steps：

Step1, first manual sorting obtain Vietnamese dictionary；

Step4, further according to constructed conversion of parts of speech part of speech disambiguation model and non-conversion of parts of speech dictionary respectively in Vietnamese News Network Conversion of parts of speech and non-conversion of parts of speech in the testing material of upper acquisition carry out part of speech mark automatically；

2. the Vietnamese part of speech labeling method of combination conversion of parts of speech part of speech disambiguation model according to claim 1 and dictionary, its It is characterised by：The step Step3's concretely comprises the following steps：

Step3.1, first, different type language material is crawled by web crawler, and carries out the pretreatment of language material, is pre-processed Make word segmentation processing including data de-noising, with participle instrument；

Step3.2, secondly, matched according to Vietnamese dictionary, write the ambiguous category set of words that automatic program identification goes out in language material；

Step3.3, then, according to Vietnamese conversion of parts of speech characteristic, choose the feature of conversion of parts of speech；Subsequently according to these spies of selection Sign is dissolved into training corpus；

3. the Vietnamese part of speech labeling method of combination conversion of parts of speech part of speech disambiguation model according to claim 2 and dictionary, its It is characterised by：The step Step3.1's concretely comprises the following steps：

Step3.1.3, secondly the language material of text sentence level is segmented and special by Vietnam's language using Vietnamese participle instrument The artificial check and correction of family, form the participle language material of Sentence-level；

Step3.1.5, finally by arrange Vietnamese dictionary obtain conversion of parts of speech dictionary；Based on this dictionary, by programming from Vietnamese conversion of parts of speech field language material, the structure for conversion of parts of speech part of speech disambiguation model are extracted in the part-of-speech tagging corpus built Build.

4. the Vietnamese part of speech labeling method of combination conversion of parts of speech part of speech disambiguation model according to claim 2 and dictionary, its It is characterised by：In the step Step3.3, its feature of conversion of parts of speech part of speech disambiguation model is mainly chosen：Word and word context letter Cease feature；Part of speech contextual information feature；Chunk and chunk contextual information feature；Word sentence element feature in sentence.

5. the Vietnamese part of speech labeling method of combination conversion of parts of speech part of speech disambiguation model according to claim 1 and dictionary, its It is characterised by：The step Step4's concretely comprises the following steps：

Step4.1, Vietnamese conversion of parts of speech dictionary is primarily based on, conversion of parts of speech and non-is extracted from the testing material for treat part-of-speech tagging Conversion of parts of speech；

Step4.2 then using conversion of parts of speech part of speech disambiguation model to conversion of parts of speech carry out disambiguation, obtain the mark after conversion of parts of speech disambiguation As a result；

It is Step4.3, last, the non-conversion of parts of speech extracted is matched according to non-conversion of parts of speech part of speech dictionary, obtains non-conversion of parts of speech Mark result.

6. the Vietnamese part of speech labeling method of combination conversion of parts of speech part of speech disambiguation model according to claim 1 and dictionary, its It is characterised by：In the step Step5, for incite somebody to action both after obtaining the part-of-speech tagging of conversion of parts of speech and the part-of-speech tagging of non-conversion of parts of speech The method combined is directly to replace.