CN101446941A

CN101446941A - Natural language level and syntax analytic method based on historical information

Info

Publication number: CN101446941A
Application number: CNA2008102436043A
Authority: CN
Inventors: 朱巧明; 周国栋; 李培峰; 李军辉; 钱龙华; 孔芳; 王红玲; 钱培德
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2008-12-10
Filing date: 2008-12-10
Publication date: 2009-06-03

Abstract

The invention discloses a natural language level and syntax analytic method based on historical information, which is characterized in that the analytic method comprises the following steps: firstly, taking each word as an initial module aiming at a sentence which has completed participle; carrying out block identification by adopting a layered manner according to context information; forming a block which can be combined into a novel block to obtain an intermediate result; and repeatedly carrying out the identification and the combination of the block to the intermediate result according to the context information till one block is contained. The block is the root node of a syntax tree, so that the syntax tree for expressing natural language is obtained. The invention can preferably identify the block which can be identified easily during the processing procedure of each layer, and can provide richer context information to identify the complicated block and improve the decision forecasting correctness, thereby improving the analytic performance of syntax.

Description

A kind of natural language level and syntax analytic method based on historical information

Technical field

The present invention relates to a kind of method that natural language is carried out syntactic analysis, realize complicated chunk identification, belong to the natural language processing field in the computational linguistics by step analysis.

Background technology

Syntactic analysis (Syntactic parsing) is a basic problem of natural language processing, also is simultaneously a research difficult problem of generally acknowledging.Its task is according to given grammer, derives the syntactic structure of sentence automatically, i.e. relation between the sentence unit that sentence comprised and these sentence unit.The purpose of syntactic analysis mainly contains two: one is " pedigree " structure of determining that sentence is comprised; Another is the relation of determining between the composition of sentence.Usually, it is input as single sentence, i.e. linear precedence between the word, and output then is a nonlinear data structure, as phrase structure tree (as syntax tree) or directed acyclic graph (as dependence figure) etc.

Syntactic analysis result's quality directly has influence on the explanation of natural language sentences and understanding.That is to say that syntactic analysis is the various application systems of assurance can be handled natural language on the content aspect a core technology.As the foundation stone of numerous Language Processing such as the automatic processing of mechanical translation, information retrieval, information extraction, speech recognition and language material, syntactic analysis has critical role.On the other hand, employed technology can also be used to solve field of bioinformatics in the syntactic analysis, such as problems similar to syntactic analysis such as RNA molecular structure detections.In addition, language is the carrier of thinking, the research of natural language syntactic analysis is helped to study human thinking's essence.Therefore, natural language syntactic analysis Study on Technology has important significance for theories and practical value.

At present, main syntactic analysis model may be summarized to be following three classes:

1. based on the syntactic analysis model of probability context-free grammar

Probability context-free grammar (Probabilistic Context Free Grammar, be called for short PCFG) be also to be the most frequently used syntactic analysis model the earliest, it is one and is the regular simple CFG that has increased probability, has indicated the possibility size of different rewriting rules.Utilize PCFG, the product value of the probability of the rule that is used in can setting by computational analysis is as the probable value of parsing tree.PCFG is simple, the most natural probability model for tree construction, and its mathematical background is readily appreciated that.But the major limitation of conventional P CFG is based on some non-actual context-free independence conditions of setting up, and the given syntax can not cover all language phenomenons, and therefore, the syntactic analysis result who obtains is also often unsatisfactory.For relaxing the independence assumption condition that PCFG does, Many researchers turns to the probability context-free grammar of research vocabularyization.

2. based on the syntactic analysis model of the probability context-free grammar of vocabularyization

Vocabulary PCFG refers to that in parsing tree each non-terminal all is associated with certain word (also can be described as the centre word of this non-terminal, can comprise its part of speech).Collins has realized the statistics syntactic analysis model based on the centre word driving, be not both the model that the latter is the centre word driving with the former maximum, the right part of each derivation rule (except the bottom rule) is divided into center non-terminal, left and right sides modifier three parts, and it is condition that the calculating of left and right sides modifier probability is based on the center non-terminal.For solving the sparse problem of data, the vocabulary context-free rules are decomposed, and when calculating probability, adopt the rollback smoothing technique, solved the coverage rate problem of PCFG well.But because the expression of sentence is actually with context-sensitive, therefore, the result of said method is still unsatisfactory.

3. based on the syntactic analysis model of historical information

As its name suggests, in syntactic analysis model based on historical information, " decision-making " information that need do according to the front, promptly historical information is predicted next step action.Ratnaparkhi is converted into a serial problem of bottom-up decision-making with the foundation of syntax tree, and system is made up of three big functional modules: part of speech mark module, chunk identification module and make up module based on chunk parsing result's syntax tree.Comprise decision information of front and back two unit essential informations and Unit two, front or the like according to different " decision behavior ", the contextual information that is utilized taked in each submodule, and by adopting maximum entropy model to carry out parameter learning.

At present, owing to start late, reason such as treebank resource shortage, Chinese syntactic analysis technology can not show a candle to the development of English syntactic analysis.Existing Chinese parser can not satisfy the needs of upper layer application, and Chinese syntactic analysis has become the bottleneck that the restriction Chinese information processing further develops.With Chinese semantic action mark (SRL, Semantic Role Labeling) be example, based on correct Chinese syntactic analysis result, the performance F1 value of Chinese SRL reaches 93%, output result based on the automatic syntactic analysis model of Chinese, the Chinese SRL accurate rate that obtains is 63%, and both differ up to 30%.

Therefore, seeking the syntactic analysis method of new natural language, make it to be applicable to the various natural languages that comprise Chinese, improve the accuracy of syntactic analysis, is the needs of information processing technology development.

Summary of the invention

The object of the invention provides a kind of natural language level and syntax analytic method based on historical information, to realize the structure of syntax tree more exactly, reaches analytical performance preferably.

For achieving the above object, the technical solution used in the present invention is: a kind of natural language level and syntax analytic method based on historical information, to finishing the sentence of participle, at first regard each speech as initial chunk, adopt layered mode, based on contextual information is carried out chunk identification, the chunk that can make up constitutes new chunk, obtain intermediate result, middle result is repeated based on contextual information carry out chunk identification and combination, till only comprising a chunk, this chunk is the root node of syntax tree, obtains to express the syntax tree of natural language thus.

Above, when pending sentence is Chinese sentence, to carry out participle earlier usually; The sentence that if pending sentence is an English etc. has word to separate, or when having divided the Chinese sentence of speech, then can directly carry out syntactic analysis and handle.

In the technique scheme, the method of described chunk identification and combination is to adopt the BIESO labeling method, to a certain chunk classification X, B-X represents to begin new chunk X, I-X represents to be incorporated into last chunk, and E-X represents to finish last chunk, and S-X represents to constitute separately chunk X, O represents to remain unchanged, in conjunction with contextual feature,, adopt sorter to realize the identification and the combination of chunk according to the feature templates of syntactic analysis by training process and analytic process.In case each chunk mark in certain intermediate result is finished, then calls the chunk consolidation procedure chunk is merged.For example, suppose certain intermediate result for " NP (he) VC (being) NP (student) PU (.) ", and mark of each combination is followed successively by " 0 B-VP E-VP 0 ", the result after then merging for " NP (he) VP (being a student) PU (.)”。

Wherein, the feature templates of described syntactic analysis comprises:

Cons (n): the united information of the centre word of n tree, composition mark and decision-making mark, when n 〉=0, the decision-making mark omits;

Cons (n ^*): the united information of the part of speech of the centre word of n tree, composition mark and decision-making mark, when n 〉=0, the decision-making mark omits;

Cons (n ^*): the united information of composition mark of n tree and decision-making mark, when n 〉=0, the decision-making mark omits;

Described contextual feature comprises following 5 classes:

The 1st class: cons (n), cons (n ^*), cons (n ^*), wherein-totally 18 of 2≤n≤3;

The 2nd class: cons (m, n), cons (m ^*, n), cons (m, n ^*), cons (m ^*, n ^*), cons (m ^*, n), cons (m ^*, n ^*), cons (m ^*, n ^*), cons (m, n ^*), cons (m ^*, n ^*), wherein (m, n)=(1,0) or (0,1) is totally 18;

The 3rd class: cons (0, m, n), cons (0, m ^*, n ^*), cons (0, m ^*, n), and cons (0, m, n ^*), cons (0 ^*, m ^*, n ^*), (m, n)=(1,2), (2 ,-1) or (1,1), and cons (1,2,3), cons (1 ^*, 2 ^*, 3 ^*), cons (1 ^*, 2 ^*, 3 ^*), cons (2 ^*, 3 ^*, 4 ^*), cons (2 ^*, 3 ^*, 4 ^*) totally 20;

The 4th class: cons (0,1,2,3), cons (0,1 ^*, 2 ^*, 3 ^*), cons (0 ^*, 1 ^*, 2 ^*, 3 ^*), cons (1 ^*, 2 ^*, 3 ^*, 4 ^*), cons (1 ^*, 2 ^*, 3 ^*, 4 ^*) totally 5;

The 5th class: cons (0 ^*, 1 ^*, 2 ^*, 3 ^*, 4 ^*), cons (0 ^*, 1 ^*, 2 ^*, 3 ^*, 4 ^*) totally 2.

In the technique scheme, described training process is, in feature templates and contextual feature information input category device, structure is used for the sorter of level and syntax analytic, from treebank, extract part of speech mark corpus, basic phrase recognition training language material and level and syntax analytic corpus, adopt maximum entropy model, successively part of speech mark corpus, basic phrase recognition training language material and level and syntax analytic corpus are trained, obtain the maximum entropy model file;

Wherein, the extracting method of described level and syntax analytic corpus is, syntax tree in the training set is carried out pre-service, be converted to the level syntax tree, bottom-up, from the basic phrase recognition result of level syntax tree, according to feature templates, be respectively each chunk and extract contextual feature, and obtain the classification mark of chunk; Training examples of the common formation of the contextual feature of each chunk and chunk classification mark; Upwards recursion one deck is similarly the mark classification that each chunk extracts contextual feature and chunk; This process is till producing root node;

Described analytic process is that to sentence to be analyzed, the method for usefulness step analysis is called trained listening group repeatedly, carries out the merging of chunk, until the result's output that realizes whole syntax tree.

In the technique scheme, described layering chunk identification and combined method comprise:

(1) part of speech mark: given sentence is carried out part-of-speech tagging, the part of speech mark series result of N kind optimum before keeping;

(2) basic phrase identification: to the N kind part of speech mark series result of step (1), carry out basic phrase identification respectively, the basic phrase recognition result of N kind optimum before keeping at last;

(3) level and syntax analytic: with the basic phrase recognition result of the N kind of step (2) is input, the optimum level syntax tree result of final output, and according to optimum level syntax tree, the repetition node of eliminating in the hierarchical tree obtains final syntax tree output result;

Wherein, N gets 10～20 integer.The N value is excessive, will keep too many useless intermediate result in the resolving, increases system overhead; And the N value is too small, then may lose some correct intermediate results.

Optimized technical scheme, described N is 20.

Syntactic analysis is a basic problem of natural language processing.It refers to according to given grammer, derives the syntactic structure of sentence automatically, i.e. relation between the sentence unit that sentence comprised and these sentence unit.The matter of utmost importance of syntactic analysis is an ambiguity, even for a very short sentence, the grammatical syntax tree of constructing still can have hundreds and thousands of kinds, is difficult to judge which candidate syntax tree is correct or optimum.Therefore, syntactic analysis two crucial subproblems that need solve: 1) how to represent one tree; 2) how to every syntax tree marking.

At present, great majority are attempted a syntax tree is expressed as a series of decision-making based on the parser of statistics, and give probable value or score value for each decision-making, the score value of score value product as whole syntax tree of at last each being made a strategic decision.For example the probability context-free grammar (Probabilistic Context Free Grammar, PCFG) in the model, syntax tree is typically expressed as a series of context-free production (being grammar rule); The probability accumulation of each production is as the probability of whole syntax tree.

This programme solves above-mentioned two key issues in another way---and based on the level and syntax analytic method of historical information: the given intermediate result that comprises a plurality of chunks is (initial, an intermediate result can be regarded as in each speech), judge which chunk can constitute new chunk, so just, obtained new intermediate result, the often more former intermediate result of new result contains less chunk; Based on the intermediate result of new generation, repeat above-mentioned steps and in intermediate result, only comprise till the chunk, this chunk is the root node of syntax tree.The score value of every syntax tree all is the product of every layer of decision-making score value.

Technique scheme had both adopted based on historical information, had obtained available contextual information in a dynamic way, utilized the method (as maximum entropy, SVM etc.) of machine learning, and prediction is next correctly, reliably makes a strategic decision, and has realized the generation of syntax tree.

Because the technique scheme utilization, the present invention compared with prior art has following advantage:

1. the present invention proposes a level and syntax analytic method based on historical information, this method is looked the level that is configured to of syntax tree and is handled the process of going forward one by one, in every layer of processing procedure, preferentially identify the chunk of easy identification, so just can provide abundanter contextual information to carry out complicated chunk identification; The chunk of not merged chunk and new identification generation constitutes the input of step processing down jointly, repeats this process until identifying root node.Before generating certain new chunk, its all son's nodes must generate; The chunk of low layer always generates in advance, so more can improve the accuracy of decision-making prediction, thereby improve the performance of syntactic analysis.

2. experiment showed, that method of the present invention is simply effective, be better than existing method at present based on historical information at the syntactic analysis aspect of performance; In addition, though the performance of the method slightly is inferior to the statistics syntactic analysis model that drives based on centre word, its efficient is far above the latter, and both time complexities are respectively O (n ²) and O (n ⁵), embodied the quick validity of this programme.

Description of drawings

Accompanying drawing 1 is carried out schematic flow sheet in the embodiment of the invention one;

Accompanying drawing 2 be in the embodiment of the invention two in the level and syntax analytic process, according to the context of current chunk and the feature templates that pre-establishes, the signal of the contextual feature that obtains;

Accompanying drawing 3 is the heap array data synoptic diagram that adopts in the embodiment of the invention two.

Embodiment

Below in conjunction with drawings and Examples the present invention is further described:

Embodiment one: shown in accompanying drawing 1, and a kind of level and syntax analytic based on historical information, for any one sentence that need handle, carry out following step:

1. be Chinese as if sentence, and do not carry out participle, then call word-dividing mode sentence is carried out participle; If sentence is English, or divide the Chinese of speech, then skipped this step;

2. call the part-of-speech tagging module, promptly its part of speech asked in each speech in the sentence; The annotation results of K kind optimum before keeping;

3. respectively with regard to the part-of-speech tagging result of preceding K kind optimum, carry out basic phrase identification; The basic phrase recognition result of K kind optimum before final the reservation;

4. according to the number of the chunk in each basic phrase recognition result, respectively it is deposited in the corresponding array location, for example,, then it is deposited in array location heap[m if include m chunk in the basic phrase recognition result] in.Heap[m] be a pile structure, what it was deposited is that length is the m intermediate result of (promptly comprising m chunk);

5. for i=n to 2 do 6./* n be sentence word count */

6. for j=1 to|heap[i] | do 7. 8./* | heap[i] | be heap heap[i] in intermediate result number */

7. to heap[i] each chunk is asked mark (being above-mentioned BIESO) in [j], K optimal result before merging the back and keeping;

⑧for?k＝1?to?K?do?⑨

9. according to the chunk number that merges back intermediate result, being inserted in the corresponding heap unit, is m as length, then inserts heap[m] in.Obvious m≤i at this moment;

10. return heap[1] [1] as the optimum syntactic analysis result of sentence.

Wherein, the preferred value of K is 20.

The present invention carries out flow process by realizing sentence is carried out syntactic analysis as Fig. 1 mode, comprises participle, part-of-speech tagging, basic phrase identification and level and syntax analytic module among the figure; The task of each module is relatively independent, and the input of previous module is as the input of lower module.For realizing each module, need from treebank, extract participle training file, part-of-speech tagging training file, basic phrase recognition training file and level and syntax analytic training file respectively, thereafter adopt the method for certain machine learning, train as SVM, maximum entropy etc., obtain model file, adopt suitable searching algorithm to realize each functions of modules at last.

Embodiment two: shown in accompanying drawing 2, after obtaining certain intermediate level analysis result, according to the feature templates that pre-establishes, be followed successively by each chunk unit from left to right and extract contextual feature, be used to predict its mark:

See from Fig. 2, the current intermediate result that obtains for " (NP (Bush _ NR)) (PP (_ P yesterday _ NT afternoon _ NT)) (PP (from _ P Nanjing _ NR)) (VV (arrive at _ VV)) (NP (Shanghai _ NR)) (PU (._ PU)) ", the speech of the italic mark in its bracket be the centre word of this chunk, for example the centre word of chunk " (PP (oneself _ P Nanjing _ NR)) " is " certainly ".And first three unit chunk " (NP (Bush _ NR)) (PP (_ P yesterday _ NT afternoon _ NT)) (PP (from _ P Nanjing _ NR)) " having been made decision-making, is respectively " O ", " O " and " O "; Once the step is the decision-making of the 4th chunk of prediction " (VV (arrive at _ VV)) ".The foundation of decision-making prediction is the contextual information of current chunk, and for this reason, according to feature templates, the 3rd window extracts feature to the right from the 2nd window in the left side of current chunk.Feature is divided into 1 yuan, 2 yuan, 3 yuan, 4 yuan and 5 yuan of features.

What Fig. 3 represented is the heap array data structure of taking in level and syntax analytic, and the process of whole level and syntax analytic that is to say a process safeguarding and fill this data structure.The size of heap array is n, i.e. the number of word in the sentence, and this is because in analytic process, the chunk number that any one intermediate result comprised must be less than or equal to n.Heap heap[i] that deposits is that length is the intermediate result of i, and can sorts from big to small according to the intermediate result probable value.As shown in FIG. 1, optimum K result of basic phrase identification module output will add in the corresponding heap according to its chunk number that comprises respectively, if comprise i chunk in the promptly basic phrase recognition result, then it added to heap heap[i] in.

In the step analysis algorithm, will be according to extremely low (from n to 2 from height, n refers to the word number that sentence comprises), the order of (from 1 to k, the number of element during k refers to pile) is handled intermediate result heap[i one by one from left to right] [j], promptly to heap[i] each chunk in [j] predicts its decision-making.Preceding K kind is predicted the outcome, call the chunk consolidation procedure respectively,, judge which chunk merges the new chunk of generation promptly according to the chunk decision-making of prediction; Obviously, in the intermediate result that newly obtains, the number of chunk must be less than or equal to the number of former intermediate result, and the probable value of new intermediate result must be less than the probable value of former intermediate result, so, position or right-hand in former intermediate result of new intermediate result in heap, or at the left of former intermediate result.So as long as algorithm must be handled each intermediate result in the heap array data structure in strict accordance with order extremely low from height, from left to right.Heap heap[1] in record be net result because, so this moment, chunk was merged into a node, i.e. the root knot of syntax tree fruit.Therefore, heap[1] [1] be used as optimum syntax tree output.For improving the efficient of algorithm, get rid of the small probability result, the size of this programme no-mobile reactor is a constant K.

Claims

1. natural language level and syntax analytic method based on historical information, it is characterized in that: to finishing the sentence of participle, at first regard each speech as initial chunk, adopt layered mode, based on contextual information is carried out chunk identification, the chunk that can make up constitutes new chunk, obtain intermediate result, middle result is repeated based on contextual information carry out chunk identification and combination, till only comprising a chunk, this chunk is the root node of syntax tree, obtains to express the syntax tree of natural language thus.

2. natural language level and syntax analytic method according to claim 1, it is characterized in that: the method for described chunk identification and combination is, adopt the BIESO labeling method, to a certain chunk classification X, B-X represents to begin new chunk X, I-X represents to be incorporated into last chunk, E-X represents to finish last chunk, S-X represents to constitute separately chunk X, O represents to remain unchanged, in conjunction with contextual feature,, adopt sorter to realize the identification and the combination of chunk according to the feature templates of syntactic analysis by training process and analytic process.

3. natural language level and syntax analytic method according to claim 2 is characterized in that: the feature templates of described syntactic analysis comprises:

Described contextual feature comprises following 5 classes:

4. natural language level and syntax analytic method according to claim 2, it is characterized in that: described training process is, in feature templates and contextual feature information input category device, structure is used for the sorter of level and syntax analytic, from treebank, extract part of speech mark corpus, basic phrase recognition training language material and level and syntax analytic corpus, adopt maximum entropy model, successively part of speech mark corpus, basic phrase recognition training language material and level and syntax analytic corpus are trained, obtain the maximum entropy model file;

5. natural language level and syntax analytic method according to claim 1 is characterized in that: described layering chunk identification and combined method comprise:

Wherein, N gets 10～20 integer.

6. natural language level and syntax analytic method according to claim 5 is characterized in that: described N is 20.