CN107168955A

CN107168955A - Word insertion and the Chinese word cutting method of neutral net using word-based context

Info

Publication number: CN107168955A
Application number: CN201710368867.6A
Authority: CN
Inventors: 戴新宇; 郁振庭; 陈家骏; 黄书剑; 张建兵
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2017-05-23
Filing date: 2017-05-23
Publication date: 2017-09-15
Anticipated expiration: 2037-05-23
Also published as: CN107168955B

Abstract

The present invention proposes a kind of word insertion using word-based context and the Chinese word cutting method of neutral net, large-scale automatic segmentation data go to school handwriting practicing insertion, obtained word insertion will be learnt as the input of neutral net participle model, model learning can be effectively helped.Comprise the following steps that：According to word context and lexeme be marked at extensive automatic segmentation data go to school handwriting practicing insertion, using word insertion as the input of neutral net participle model, be effectively improved the performance of participle.The other Chinese words segmentations based on neutral net of contrast, the process employs the insertion of the word of word-based context, have effectively been integrated into participle model by word information, have successfully improved the accuracy of participle task.

Description

Word insertion and the Chinese word cutting method of neutral net using word-based context

Technical field

It is particularly a kind of to utilize word-based context the present invention relates to a kind of method that Chinese word segmentation is carried out using computer Word insertion carry out automatic Chinese participle with the mode that neutral net is combined method.

Background technology

Chinese word segmentation is a basic task of natural language processing, and it is widely applied demand and has attracted a large amount of correlations to grind Study carefully to promote the fast development of its correlation technique.Chinese such agglutinative language is different from western language, the word of Chinese sentence The not obvious interval between word.And the minimum unit of general nature language processing tasks is " word ", so for Chinese For, it is first to recognize word string the problem of primary.The means of processing Chinese word segmentation can substantially be divided into two classes at present, rule-based Method is with being based on statistical method.Rule and method based on dictionary needs to build a large-scale dictionary.Can be according to pre- during cutting word First designed matched rule matches the word in dictionary, so as to complete the cutting to text.Computing resource it is relatively limited and Period in short supply, machine learning method needs to take substantial amounts of computing resource and expends very big memory cost, and without practicality Property, so rule and method becomes the main stream approach of processing Chinese word segmentation problem within one considerably long period.And with meter The continuous growth of resource is calculated, the method based on machine learning slowly becomes the Main Means for solving Chinese word segmentation.

In the first Chinese word segmentation evaluation and test that SIGHAN2003 holds, the method marked based on word is suggested first, the party Although the performance of method is not highest, the discrimination of its unregistered word ranks first.It is main for Chinese word segmentation task There are two big difficult points, the identification of ambiguity partition and unregistered word, experiment proves that this two classes problem is not equality, unregistered word Influence to be far longer than ambiguity partition.Method being received by everybody slowly of serializing mark based on word, becomes processing The main stream approach of participle problem.

In the mode of modeling participle task, it is a kind of common means to be regarded as serializing mark task.Its is worked Cheng Shi：Sentence for treating participle, in units of word, is labeled to each word (labels) from left to right, general to use Mark system be to include the four lexemes mark collection of the mark of B, M, E, S tetra-, wherein B represents current word and occupies multi-character words Prefix, M, which represents current word and occupies E in the words of a multi-character words, represents the suffix that current word occupies a multi-character words, and S represents current Word is a monosyllabic word.Obtain after annotated sequence, the result of participle can be converted into.The present invention is in modeling Chinese word segmentation task It is also to be regarded as serializing mark task and employ above-mentioned mark collection.

Neutral net is a kind of conventional machine learning method, and it has from some automatic learning characteristics of ground atom feature The ability of combination, this is different from the person of needing to use and considerable task correlation is designed according to prioris such as linguistics correlations The conventional method of template.The use of neutral net, can save the work for manually customizing a large amount of assemblage characteristic templates, while can be with The combination come by the powerful ability to express of neutral net between automatically learning characteristic.Present invention uses two-way length note Recall neutral net to calculate come the word sequence to sentence, so as to more capture long distance feature.

For the model method based on neutral net, one is how to be embedded in using word the problem of important.If Possess enough training datas, then random initializtion word can be embedded in first, the study that then word is embedded in is fused to mould The training of type, so as to obtain high-quality word insertion.But for task as participle, the scale of labeled data collection is non- It is often limited, typically at tens of thousands of or so.First it is difficult to train word insertion, secondly because data scale is limited, test number is run into According to when, be frequently encountered the problem of unregistered word is such.A kind of method is to utilize the study word insertion of unsupervised data, typical side Method has Word2vec, GloVe, these method it is basic according to be according to it is distributed it is assumed that similar word appear in it is similar Context in.There can be similar or close word to be embedded between similar word.But " similar " this characteristic is again relied on specifically Task, for different tasks, " similar " this concept is different.

The content of the invention

Goal of the invention：The present invention is directed to the existing model marked based on word in current Chinese words segmentation and can not made full use of The shortcoming of word information, it is proposed that a kind of word insertion learning method of word-based context carrys out the indirect letter for merging word rank Breath, so as to lift the degree of accuracy of Chinese word segmentation task.

In order to solve the above-mentioned technical problem, the invention discloses a kind of word insertion using word-based context and nerve net The Chinese word cutting method of network and the additional information on model parameter training method used in analysis process.

The Chinese word cutting method of the word insertion of the present invention for utilizing word-based context and neutral net is including following Step：

Step 1, computer reads the data of extensive automatic segmentation, and learning method is embedded in using the word of word-based context Obtain word insertion and double word insertion；

Step 2, treat participle sentence using the method based on neutral net and carry out sentence cutting.

Wherein step 1 is comprised the following steps：

Step 1-1, system is marked according to four lexemes, and a sentence segmented can be expressed as word sequence { c₁,c₂,…, c_nAnd { l₁,l₂,…,l_n}.N is the length of sentence, l_i∈{B,M,E,S}.B, M, E, S tetra- is had in four lexeme mark systems Individual mark, wherein B represents the prefix that current word occupies a multi-character words, and M represents current word and occupies E generations in the words of a multi-character words The current word of table occupies the suffix of a multi-character words, and it is a monosyllabic word that S, which represents current word,.An example is given below and illustrates four words The implication of position mark system.A sentence for having divided word is provided first：

(1) research of natural science deepens continuously

Form of the sentence after being labeled with four lexemes mark system is as follows：

(2) from/B so/M sections/M/E /S grinds/B studies carefully/E not /B is disconnected/E depth/B enter/E

(1) and (2) is two kinds of forms of equal value in four lexeme mark systems, can mutually be changed, such as in mark side certainly Can first it be obtained in method such as the form in (2), the form being then converted into (1) is the result of participle.

Step 1-2, is learnt using the word insertion learning method of word-based context in the data of extensive automatic segmentation Obtain word insertion and double word insertion.

All sentences in step 1-2 in whole extensive automatic segmentation data are spliced into a long sentence formation data set, Whole data set table is shown as word sequence { c₁,c₂,…,c_TAnd corresponding flag sequence { l₁,l₂,…,l_T, wherein T is data Concentrate the number of word, c_TRepresent the T word in data set, l_TRepresent the corresponding mark of the T word in data set.

Step 1-2 comprises the following steps：

Step 1-2-1, the learning objective of word insertion is defined as：

Wherein, log p (c_t+j|c_t) and log p (l_t+j|c_t) be calculated as follows,

Wherein, σ represents sigmoid functions, is a real-valued function, and it is acted on represents in vector on a vector Each element does this operation, obtains one and input vector dimension identical object vector, e_uniRepresent the word insertion of input Matrix,Represent the word embeded matrix of output end, e_uni(x) represent that taking-up word x is corresponding from the word embeded matrix of input Word is embedded in,Represent to take out the corresponding word insertions of word x from the word embeded matrix of output end, k represents the number of negative sampling, P_n(c) distribution of sampling is represented, a represents the size of contextual window；

Step 1-2-2, word embeded matrix e is obtained according to stochastic gradient descent calligraphy learning_uni；

Step 1-2-3, the learning objective of double word insertion is defined as：

Wherein, logp (c_t+jc_t+j+1|c_tc_t+1) and log p (l_t+j|c_tc_t+1) be calculated as follows,

Wherein e_biThe double word embeded matrix of input is represented,Represent the word embeded matrix of output end, e_bi(x) represent from The corresponding double word insertions of double word x are taken out in the double word embeded matrix of input,Represent the double word embeded matrix from output end It is middle to take out the corresponding double word insertions of word x, c_tc_t+1Represent that t-th of word and the t+1 word link together obtained double word；

Step 1-2-4, after the learning objective of double word insertion has been defined, double word is obtained according to stochastic gradient descent calligraphy learning Embeded matrix e_bi。

The w in whole step 2₁,w₂,…,w_nThe sentence of participle is treated in expression, and n represents to treat the length of participle sentence, w_nRepresent N-th of word in sentence, step 2 comprises the following steps：

Step 2-1, in processing t^*All marking types are given a mark using neutral net during word, wherein 1≤t^t≤n；

Step 2-2, to t^*=1,2 ..., n iteration perform step 2-1, and selection highest scoring is often walked according to greedy algorithm Mark is as current markers, and n represents to treat the length of participle sentence；

After step 2-3, the lexeme annotated sequence for obtaining whole sentence, the result of sentence cutting, as sentence point are converted into The final result of analysis.

Step 2-1 comprises the following steps：

Step 2-1-1, generates characteristic vector, and characteristic vector includes word feature and double word feature；

Step 2-1-2, the characteristic vector generated in step 2-1-1 is directed to using neutral net and calculate obtaining all times Select the score of mark.

Step 2-1-1 includes such as step：

Step 2-1-1-1, the word insertion and double word insertion obtained according to step 1-2 learnings obtains word characteristic vectorWith double word characteristic vectorWherein subscript t represents current location；

Step 2-1-1-2, word characteristic vector and double word characteristic vector are stitched together and obtain the mark sheet of current location Show

Step 2-1-2 comprises the following steps：

Step 2-1-2-1, the intermediate representation for generating current location is calculated using two-way shot and long term Memory Neural Networks model, The input of wherein network isForward process is calculated as follows：

Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t It is unrelated；For t^*The hidden layer vector sum mnemon vector of the output of individual computing unit； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen hidden layer vector sum input vector transition matrix；Tanh used is hyperbolic functions in formula, is one Real-valued function, it, which is acted on, represents to do each element in vector this operation on a vector, obtain one with input to Measure dimension identical object vector；σ is sigmod functions, is a real-valued function, its act on represented on a vector to Each element in amount does this operation, obtains one and input vector dimension identical object vector；⊙ is point multiplication operation, i.e., The result vector that multiplication obtains an identical dimensional is done into the vectorial step-by-step of two dimension identicals；

The calculating process of backward network is as follows：

Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t^* It is unrelated；h^b(w_t)、For t^*The hidden layer vector sum mnemon vector of the output of individual computing unit； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； Respectively It is to calculateWhen hidden layer vector, input vector and mnemon vector transition matrix； It is meter respectively CalculateWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen it is hidden The transition matrix of layer vector sum input vector；

WillWithAnd h^b(w_t) splicing obtains final intermediate layer and represent

Step 2-1-2-2, the score of all marking types, the calculating of whole feedforward network are calculated using feedforward neural network Process equation below：

O=W₂h+b₂,

Wherein, W₁,W₂,b₁,b₂It is the model parameter trained, h is the hidden layer vector of neutral net；O is to calculate defeated Go out, be a real-valued vectors, the number of its dimension correspondence lexeme mark, wherein i-th of value is with regard to correspondence moment t^*When, mark i's Score, the score is a real number value.

Beneficial effect：The present invention is directed to the existing model marked based on word in current Chinese words segmentation and can not made full use of The shortcoming of word information, it is proposed that a kind of word insertion learning method of word-based context carrys out the indirect letter for merging word rank Breath, on the premise of model complexity is not increased, improves the degree of accuracy of Chinese word segmentation task.

Brief description of the drawings

The present invention is done with reference to the accompanying drawings and detailed description and further illustrated, of the invention is above-mentioned And/or otherwise advantage will become apparent.

Fig. 1 is the flow chart of the present invention.

Embodiment

The present invention proposes a kind of word insertion using word-based context and the Chinese word cutting method of neutral net, big The automatic segmentation data of scale are gone to school handwriting practicing insertion, will learn the input of obtained word insertion as neutral net participle model, Model learning can effectively be helped.

As shown in figure 1, the invention discloses a kind of word insertion using word-based context and the Chinese point of neutral net Word method, it can either be with word level another characteristic, and can indirectly be merged by introducing the word insertion based on term vector Word information, improves the degree of accuracy of Chinese word segmentation task.

Word of the present invention using word-based context is embedded in the Chinese word cutting method with neutral net including as follows Step:

Wherein step 1 is comprised the following steps：

(1) research of natural science deepens continuously

Step 1-2 comprises the following steps：

Step 1-2-1, the learning objective of word insertion is defined as：

Wherein, log p (c_t+j|c_t) and log p (l_t+j|c_t) be calculated as follows,

Step 1-2-3, the learning objective of double word insertion is defined as：

Wherein, log p (c_t+jc_t+j+1|c_tc_t+1) and log p (l_t+j|c_tc_t+1) be calculated as follows,

Step 2-1, in processing t^*All marking types are given a mark using neutral net during word, wherein 1≤t^*≤n；

Step 2-1 comprises the following steps：

It is characterised by, step 2-1-1 includes such as step：

Step 2-1-2 comprises the following steps：

Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t^* It is unrelated；For t^*The hidden layer vector sum mnemon vector of the output of individual computing unit； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen hidden layer vector sum input vector transition matrix；Tanh used is hyperbolic functions in formula, is one Real-valued function, it, which is acted on, represents to do each element in vector this operation on a vector, obtain one with input to Measure dimension identical object vector；σ is sigmod functions, is a real-valued function, its act on represented on a vector to Each element in amount does this operation, obtains one and input vector dimension identical object vector；⊙ is point multiplication operation, i.e., The result vector that multiplication obtains an identical dimensional is done into the vectorial step-by-step of two dimension identicals；

The calculating process of backward network is as follows：

Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t It is unrelated；h^b(w_t)、For t^*The hidden layer vector sum mnemon vector of the output of individual computing unit； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； Respectively It is to calculateWhen hidden layer vector, input vector and mnemon vector transition matrix； It is meter respectively CalculateWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen it is hidden The transition matrix of layer vector sum input vector；

WillWithSplicing obtains final intermediate layer and represented

O=W₂h+b₂,

The additional information of model parameter training method is as follows used in during analysis of the present invention：

From the step 1 during analysis it is recognised that in the word insertion study of heretofore described word-based context Parameter used includes：

1st, the parameter of study word insertion

2nd, the parameter of study double word insertion

In the study that word is embedded in, to input parameter e_uni, initialized by random value, parameter of output endPress Zero value is initialized.The target entirely trained is exactly to minimize the object function in step 1-2-1, in study word insertion In method parameter is solved usually using stochastic gradient descent method.Stochastic gradient descent method is also taken in the present embodiment to solve To parameter

In the study that double word word is embedded in, to input parameter e_bi, initialized by random value, parameter of output endInitialized by zero value.The target entirely trained is exactly to minimize the object function in step 1-2-3, embedding in study word In the method entered parameter is solved usually using stochastic gradient descent method.Stochastic gradient descent method is also taken in the present embodiment to ask Solution obtains parameter

From the step 2 during analysis it is recognised that the parameter used in the process of heretofore described participle includes Following several parts (making these parameters be model parameter group below)：

1st, the characteristic vector word insertion of correspondence inputWith double word insertionWhereinFor list Word,For double word；

2nd, the neural network parameter before being calculated in step 2-1-2-1 used in feature

3rd, the neural network parameter used in backward feature is calculated in step 2-1-2-1

4th, the network parameter W in step 2-1-2-2 used in feedforward neural network₁,W₂。

Training process using maximum likelihood training data concentrate it is correct annotated sequence, realized using iterative manner. Before training starts, parameter e can be obtained according to step 1_uniAnd e_bi.2nd, the parameter in 3,4 is initialized by random value. Then (assuming that data set size is D) dataest={ sent are integrated using labeled data₁,sent₂,…,sent_DParameter is entered Row training：A training objective is defined first, and the object definition is on whole data set, also known as loss function, and it is whole The function of all parameters in model parameter group, it is assumed that be L (dataset), for each sentence sent_rLoss function represent For loss (sent_r) both definition carried out in the following manner with calculating process：

According to step 2-2, sent can be obtained_rMiddle any time t^*On mark i score score (sent_r,t^*, I), wherein moment t^*Correct labeling be gold, then the loss function on sentence can be defined as：

E therein^xExponential function is represented, e represents the constant of natural logrithm.

Definition is for the loss function of whole training dataset：

Θ therein, E represent the function that the loss function is parameter in model parameter group.

The target of whole training process is exactly to minimize above loss function, minimizes above loss function and tries to achieve parameter Method have it is a variety of and for industry practitioner know, such as embodiment stochastic gradient descent method is wherein employed to ask it Parameter in solution 2,3,4.

Embodiment 1

First, the labeled data employed in the present embodiment is Binzhou treebank Chinese edition CTB6.0, wherein training set 23401 Sentence, development set 2078, test set 2795.Automatic segmentation data are in being total to that Chinese Gigaword (LDC2011T13) is obtained 41071242.

Word insertion and the Chinese word segmentation side of neutral net of the present embodiment using the word-based context of utilization in the present invention The complete procedure of method is as follows：

Step 1-1, determines the mark system of word marking model, defines four type B, and M, E, S, concrete meaning are shown in explanation 1-1 in book；

Training obtains word insertion e on step 1-2, then the Chinese automatic segmentation data of Gigaword_uniMatrix and double word insertion e_bi；

Step 2-1, reads a Chinese sentence " you come at once ", and calculate score of each position on mark：

1. you are score (B)=1.01score (M)=0.32score (E)=0.13score (S)=2.34

2. horse score (B)=1.82score (M)=0.46score (E)=0.39score (S)=0.42

3. on score (B)=0.25score (M)=0.23score (E)=2.26score (S)=0.47

4. cross score (B)=2.37score (M)=0.74score (E)=0.29score (S)=0.56

5. carry out score (B)=0.27score (M)=0.10score (E)=3.26score (S)=0.24

Step 2-2, flag sequence can be obtained according to Greedy strategy：You/S horses/B is upper/and E mistakes/B carrys out/E

Step 2-3, word segmentation result can be converted to by flag sequence：You come at once

Embodiment 2

Algorithm used in the present invention all writes realization using C Plus Plus.The embodiment tests used type： Intel (R) Core (TM) i7-4790K processors, dominant frequency is 4.0GHz, inside saves as 24G.Mark employed in the present embodiment Data are Binzhou treebank Chinese edition CTB6.0, wherein training set 23401, development set 2078, test set 2795.Automatically cut Divided data is totally 41071242 obtained in Chinese Gigaword (LDC2011T13).Model parameter in Gigaword data and Training is obtained in CTB6.0 data, and experimental result is as shown in table 1：

The experimental result explanation of table 1

Wherein Xu and Sun (2016) are employed as the participle model based on interdependent recurrent neural network, Liu (2016) For using the participle model of the expression of cutting, Zhang (2016) is the neutral net participle model based on conversion, Zhang (2016) comb is the participle model based on conversion for being combined traditional characteristic with neural network characteristics, and this several model is represented It is currently based on the forward position level of the participle model of neutral net.It should be noted that it is to comment that evaluation and test is carried out on the data set One usual way of valency Chinese word segmentation.As can be seen that the method in the present invention achieves higher F1- on the data set Score values, illustrate the validity of this method.

The calculation to F1-score is illustrated herein：It is to know because the test set is labeled data collection The correct annotation results in road, it is assumed that for whole data set, the set S (gold) of all correct words compositions, its size is count(gold)；After each sentence concentrated to data carries out participle in the way of in embodiment 1, all analyses are taken out As a result the word segmentation result predicted composition set S (predict) in, it is assumed that its size is count (predict)；S (gold) and S (predict) collection that cutting identical part is constituted in is combined into S (correct), and its size is count (correct)；Assuming that pre- Survey the degree of accuracy and be expressed as precision, prediction recall rate is expressed as recall, then the calculating of each value is carried out as follows：

Claims

1. a kind of word insertion and the Chinese word cutting method of neutral net using word-based context, it is characterised in that including such as Lower step：

Step 1, computer reads the data of extensive automatic segmentation, is obtained using the word insertion learning method of word-based context Word is embedded in and double word insertion；

2. according to the method described in claim 1, it is characterised in that step 1 comprises the following steps：

Step 1-1, system is marked according to four lexemes, and the sentence expression that one is segmented is into word sequence { c₁,c₂,…,c_nAnd {l₁,l₂,…,l_n, n is the length of sentence, l_i∈ { B, M, E, S }, 1≤i≤n, c_nRepresent n-th of word in sentence, l_nRepresent Represent the corresponding mark of n-th of word, l in sentence_iThe corresponding mark of i-th of word in sentence is represented, in four lexeme mark systems Shared tetra- marks of B, M, E, S, wherein B represents the prefix that current word occupies a multi-character words, and M represents current word and occupied more than one E represents the suffix that current word occupies a multi-character words in the word of words, and it is a monosyllabic word that S, which represents current word,；Step 1-2, profit With the word of word-based context insertion learning method the data of extensive automatic segmentation go to school acquistion be embedded in word it is embedding with double word Enter.

3. method according to claim 2, it is characterised in that in step 1-2 in whole extensive automatic segmentation data All sentences are spliced into a long sentence formation data set, and whole data set table is shown as word sequence { c₁,c₂,…,c_TAnd it is corresponding Flag sequence { l₁,l₂,…,l_T, wherein T is the number of word in data set, c_TRepresent the T word in data set, l_TRepresent number According to the corresponding mark of the T word of concentration.

4. method according to claim 3, it is characterised in that step 1-2 comprises the following steps：

Step 1-2-1, the learning objective of word insertion is defined as：

<mrow> <mo>-</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </msubsup> <msub> <mi>&Sigma;</mi> <mrow> <mo>-</mo> <mi>a</mi> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>a</mi> <mo>,</mo> <mi>j</mi> <mo>&NotEqual;</mo> <mn>0</mn> </mrow> </msub> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

Wherein, log p (c_t+j|c_t) and log p (l_t+j|c_t) be calculated as follows,

<mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mi>&sigma;</mi> <mrow> <mo>(</mo> <mrow> <msubsup> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <msub> <mi>E</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>~</mo> <msub> <mi>P</mi> <mi>n</mi> </msub> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>&lsqb;</mo> <mi>log</mi> <mi>&sigma;</mi> <mo>(</mo> <mo>-</mo> <msubsup> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>&rsqb;</mo> <mo>,</mo> </mrow>

<mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mi>&sigma;</mi> <mrow> <mo>(</mo> <mrow> <msubsup> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&Sigma;</mi> <mrow> <mi>l</mi> <mo>&Element;</mo> <mrow> <mo>{</mo> <mrow> <mi>B</mi> <mo>,</mo> <mi>M</mi> <mo>,</mo> <mi>E</mi> <mo>,</mo> <mi>S</mi> </mrow> <mo>}</mo> </mrow> <mi>l</mi> <mo>&NotEqual;</mo> <msub> <mi>l</mi> <mi>t</mi> </msub> </mrow> </msub> <mrow> <mo>&lsqb;</mo> <mrow> <mi>log</mi> <mi>&sigma;</mi> <mrow> <mo>(</mo> <mrow> <mo>-</mo> <msubsup> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> <mo>&rsqb;</mo> </mrow> <mo>,</mo> </mrow>

Wherein, σ represents sigmoid functions, is a real-valued function, and it is acted on represents to each in vector on a vector Element all does this operation, obtains one and input vector dimension identical object vector, e_uniRepresent the word insertion square of input Battle array,Represent the word embeded matrix of output end, e_uni(x) represent to take out the corresponding words of word x from the word embeded matrix of input It is embedded,Represent to take out the corresponding word insertions of word x from the word embeded matrix of output end, k represents the number of negative sampling, P_n (c) distribution of sampling is represented, a represents the size of contextual window；

Step 1-2-3, the learning objective of double word insertion is defined as：

<mrow> <mo>-</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </msubsup> <msub> <mi>&Sigma;</mi> <mrow> <mo>-</mo> <mi>a</mi> <mo>&le;</mo> <mi>j</mi> <mo>&le;</mo> <mi>a</mi> <mo>,</mo> <mi>j</mi> <mo>&NotEqual;</mo> <mn>0</mn> </mrow> </msub> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>log</mi> <mi>&sigma;</mi> <mrow> <mo>(</mo> <mrow> <msubsup> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>+</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <msub> <mi>E</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>~</mo> <msub> <mi>P</mi> <mi>n</mi> </msub> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> </mrow> </msub> <mrow> <mo>&lsqb;</mo> <mrow> <mi>log</mi> <mi>&sigma;</mi> <mrow> <mo>(</mo> <mrow> <mo>-</mo> <msubsup> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> <mo>&rsqb;</mo> </mrow> <mo>,</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>log</mi> <mi>&sigma;</mi> <mrow> <mo>(</mo> <mrow> <msubsup> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&Sigma;</mi> <mrow> <mi>l</mi> <mo>&Element;</mo> <mrow> <mo>{</mo> <mrow> <mi>B</mi> <mo>,</mo> <mi>M</mi> <mo>,</mo> <mi>E</mi> <mo>,</mo> <mi>S</mi> </mrow> <mo>}</mo> </mrow> <mi>l</mi> <mo>&NotEqual;</mo> <msub> <mi>l</mi> <mi>t</mi> </msub> </mrow> </msub> <mrow> <mo>&lsqb;</mo> <mrow> <mi>log</mi> <mi>&sigma;</mi> <mrow> <mo>(</mo> <mrow> <mo>-</mo> <msubsup> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> <mo>&rsqb;</mo> </mrow> <mo>,</mo> </mrow> </mtd> </mtr> </mtable> </mfenced>

Wherein e_biThe double word embeded matrix of input is represented,Represent the word embeded matrix of output end, e_bi(x) represent from input Double word embeded matrix in take out the corresponding double word insertion of double word x,Represent to take out from the double word embeded matrix of output end Double word insertion corresponding word x, c_tc_t+1Represent that t-th of word and the t+1 word link together obtained double word；

Step 1-2-4, after the learning objective of double word insertion has been defined, double word insertion is obtained according to stochastic gradient descent calligraphy learning Matrix e_bi。

5. method according to claim 4, it is characterised in that treated in step 2 in participle sentence all words from left to right according to It is secondary to be expressed as w₁,w₂,…,w_n, n is the length of sentence, w_nN-th of word in sentence is represented, step 2 comprises the following steps：

Step 2-2, to t^*=1,2 ..., n iteration perform step 2-1, and the mark that selection highest scoring is often walked according to greedy algorithm is made For current markers, n represents to treat the length of participle sentence；

After step 2-3, the lexeme annotated sequence for obtaining whole sentence, it is converted into the result of sentence cutting, as the analysis of sentence Final result.

6. the method according to claim 5, is characterised by, step 2-1 comprises the following steps：

Step 2-1-2, is directed to the characteristic vector that generates in step 2-1-1 using neutral net and calculate and obtain all candidates and mark The score of note.

7. the method according to claim 6, is characterised by, step 2-1-1 includes such as step：

Step 2-1-1-1, the word insertion and double word insertion obtained according to step 1-2 learnings obtains word characteristic vector e_uni(w_t*) With double word characteristic vector e_bi(w_t*-1w_t*), wherein subscript t represents current location；

Step 2-1-1-2, word characteristic vector and double word characteristic vector are stitched together and obtain the character representation g of current location (w_t*)。

8. the method according to claim 7, is characterised by, step 2-1-2 comprises the following steps：

Step 2-1-2-1, the intermediate representation for generating current location is calculated using two-way shot and long term Memory Neural Networks model, wherein The input of network is g (w_t*)；Forward process is calculated as follows：

Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t^* It is unrelated；h^f(w_t*)、For t^*The hidden layer vector sum mnemon vector of the output of individual computing unit； It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix； Respectively It is to calculateWhen hidden layer vector, input vector and mnemon vector transition matrix； It is meter respectively CalculateWhen hidden layer vector, input vector and mnemon vector transition matrix； It is to calculate respectivelyWhen hidden layer The transition matrix of vector sum input vector；Tanh used is hyperbolic functions in formula, is a real-valued function, it acts on one Represent to do each element in vector this operation on individual vector, obtain one with input vector dimension identical target to Amount；σ is sigmod functions, is a real-valued function, and it is acted on represents to do each element in vector on a vector This operation, obtains one and input vector dimension identical object vector；⊙ is point multiplication operation, will two dimension identicals to The result vector that multiplication obtains an identical dimensional is done in amount step-by-step；

The calculating process of backward network is as follows：

By h^f(w_t*) and h^b(w_t*) i.e. obtaining final intermediate layer represents f (w for splicing_t*)；

Step 2-1-2-2, the score of all marking types, the calculating process of whole feedforward network are calculated using feedforward neural network Equation below：

H=tanh (W₁f(w_t*)+b₁),

O=W₂h+b₂,

Wherein, W₁,W₂,b₁,b₂It is the model parameter trained, h is the hidden layer vector of neutral net；O is to calculate output, is One real-valued vectors, the number of its dimension correspondence lexeme mark, wherein i-th of value is with regard to correspondence moment t^*When, i score is marked, The score is a real number value.