CN107168955A - Word insertion and the Chinese word cutting method of neutral net using word-based context - Google Patents

Word insertion and the Chinese word cutting method of neutral net using word-based context Download PDF

Info

Publication number
CN107168955A
CN107168955A CN201710368867.6A CN201710368867A CN107168955A CN 107168955 A CN107168955 A CN 107168955A CN 201710368867 A CN201710368867 A CN 201710368867A CN 107168955 A CN107168955 A CN 107168955A
Authority
CN
China
Prior art keywords
mrow
msub
word
vector
msubsup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710368867.6A
Other languages
Chinese (zh)
Other versions
CN107168955B (en
Inventor
戴新宇
郁振庭
陈家骏
黄书剑
张建兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710368867.6A priority Critical patent/CN107168955B/en
Publication of CN107168955A publication Critical patent/CN107168955A/en
Application granted granted Critical
Publication of CN107168955B publication Critical patent/CN107168955B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Machine Translation (AREA)

Abstract

The present invention proposes a kind of word insertion using word-based context and the Chinese word cutting method of neutral net, large-scale automatic segmentation data go to school handwriting practicing insertion, obtained word insertion will be learnt as the input of neutral net participle model, model learning can be effectively helped.Comprise the following steps that:According to word context and lexeme be marked at extensive automatic segmentation data go to school handwriting practicing insertion, using word insertion as the input of neutral net participle model, be effectively improved the performance of participle.The other Chinese words segmentations based on neutral net of contrast, the process employs the insertion of the word of word-based context, have effectively been integrated into participle model by word information, have successfully improved the accuracy of participle task.

Description

Word insertion and the Chinese word cutting method of neutral net using word-based context
Technical field
It is particularly a kind of to utilize word-based context the present invention relates to a kind of method that Chinese word segmentation is carried out using computer Word insertion carry out automatic Chinese participle with the mode that neutral net is combined method.
Background technology
Chinese word segmentation is a basic task of natural language processing, and it is widely applied demand and has attracted a large amount of correlations to grind Study carefully to promote the fast development of its correlation technique.Chinese such agglutinative language is different from western language, the word of Chinese sentence The not obvious interval between word.And the minimum unit of general nature language processing tasks is " word ", so for Chinese For, it is first to recognize word string the problem of primary.The means of processing Chinese word segmentation can substantially be divided into two classes at present, rule-based Method is with being based on statistical method.Rule and method based on dictionary needs to build a large-scale dictionary.Can be according to pre- during cutting word First designed matched rule matches the word in dictionary, so as to complete the cutting to text.Computing resource it is relatively limited and Period in short supply, machine learning method needs to take substantial amounts of computing resource and expends very big memory cost, and without practicality Property, so rule and method becomes the main stream approach of processing Chinese word segmentation problem within one considerably long period.And with meter The continuous growth of resource is calculated, the method based on machine learning slowly becomes the Main Means for solving Chinese word segmentation.
In the first Chinese word segmentation evaluation and test that SIGHAN2003 holds, the method marked based on word is suggested first, the party Although the performance of method is not highest, the discrimination of its unregistered word ranks first.It is main for Chinese word segmentation task There are two big difficult points, the identification of ambiguity partition and unregistered word, experiment proves that this two classes problem is not equality, unregistered word Influence to be far longer than ambiguity partition.Method being received by everybody slowly of serializing mark based on word, becomes processing The main stream approach of participle problem.
In the mode of modeling participle task, it is a kind of common means to be regarded as serializing mark task.Its is worked Cheng Shi:Sentence for treating participle, in units of word, is labeled to each word (labels) from left to right, general to use Mark system be to include the four lexemes mark collection of the mark of B, M, E, S tetra-, wherein B represents current word and occupies multi-character words Prefix, M, which represents current word and occupies E in the words of a multi-character words, represents the suffix that current word occupies a multi-character words, and S represents current Word is a monosyllabic word.Obtain after annotated sequence, the result of participle can be converted into.The present invention is in modeling Chinese word segmentation task It is also to be regarded as serializing mark task and employ above-mentioned mark collection.
Neutral net is a kind of conventional machine learning method, and it has from some automatic learning characteristics of ground atom feature The ability of combination, this is different from the person of needing to use and considerable task correlation is designed according to prioris such as linguistics correlations The conventional method of template.The use of neutral net, can save the work for manually customizing a large amount of assemblage characteristic templates, while can be with The combination come by the powerful ability to express of neutral net between automatically learning characteristic.Present invention uses two-way length note Recall neutral net to calculate come the word sequence to sentence, so as to more capture long distance feature.
For the model method based on neutral net, one is how to be embedded in using word the problem of important.If Possess enough training datas, then random initializtion word can be embedded in first, the study that then word is embedded in is fused to mould The training of type, so as to obtain high-quality word insertion.But for task as participle, the scale of labeled data collection is non- It is often limited, typically at tens of thousands of or so.First it is difficult to train word insertion, secondly because data scale is limited, test number is run into According to when, be frequently encountered the problem of unregistered word is such.A kind of method is to utilize the study word insertion of unsupervised data, typical side Method has Word2vec, GloVe, these method it is basic according to be according to it is distributed it is assumed that similar word appear in it is similar Context in.There can be similar or close word to be embedded between similar word.But " similar " this characteristic is again relied on specifically Task, for different tasks, " similar " this concept is different.
The content of the invention
Goal of the invention:The present invention is directed to the existing model marked based on word in current Chinese words segmentation and can not made full use of The shortcoming of word information, it is proposed that a kind of word insertion learning method of word-based context carrys out the indirect letter for merging word rank Breath, so as to lift the degree of accuracy of Chinese word segmentation task.
In order to solve the above-mentioned technical problem, the invention discloses a kind of word insertion using word-based context and nerve net The Chinese word cutting method of network and the additional information on model parameter training method used in analysis process.
The Chinese word cutting method of the word insertion of the present invention for utilizing word-based context and neutral net is including following Step:
Step 1, computer reads the data of extensive automatic segmentation, and learning method is embedded in using the word of word-based context Obtain word insertion and double word insertion;
Step 2, treat participle sentence using the method based on neutral net and carry out sentence cutting.
Wherein step 1 is comprised the following steps:
Step 1-1, system is marked according to four lexemes, and a sentence segmented can be expressed as word sequence { c1,c2,…, cnAnd { l1,l2,…,ln}.N is the length of sentence, li∈{B,M,E,S}.B, M, E, S tetra- is had in four lexeme mark systems Individual mark, wherein B represents the prefix that current word occupies a multi-character words, and M represents current word and occupies E generations in the words of a multi-character words The current word of table occupies the suffix of a multi-character words, and it is a monosyllabic word that S, which represents current word,.An example is given below and illustrates four words The implication of position mark system.A sentence for having divided word is provided first:
(1) research of natural science deepens continuously
Form of the sentence after being labeled with four lexemes mark system is as follows:
(2) from/B so/M sections/M/E /S grinds/B studies carefully/E not /B is disconnected/E depth/B enter/E
(1) and (2) is two kinds of forms of equal value in four lexeme mark systems, can mutually be changed, such as in mark side certainly Can first it be obtained in method such as the form in (2), the form being then converted into (1) is the result of participle.
Step 1-2, is learnt using the word insertion learning method of word-based context in the data of extensive automatic segmentation Obtain word insertion and double word insertion.
All sentences in step 1-2 in whole extensive automatic segmentation data are spliced into a long sentence formation data set, Whole data set table is shown as word sequence { c1,c2,…,cTAnd corresponding flag sequence { l1,l2,…,lT, wherein T is data Concentrate the number of word, cTRepresent the T word in data set, lTRepresent the corresponding mark of the T word in data set.
Step 1-2 comprises the following steps:
Step 1-2-1, the learning objective of word insertion is defined as:
Wherein, log p (ct+j|ct) and log p (lt+j|ct) be calculated as follows,
Wherein, σ represents sigmoid functions, is a real-valued function, and it is acted on represents in vector on a vector Each element does this operation, obtains one and input vector dimension identical object vector, euniRepresent the word insertion of input Matrix,Represent the word embeded matrix of output end, euni(x) represent that taking-up word x is corresponding from the word embeded matrix of input Word is embedded in,Represent to take out the corresponding word insertions of word x from the word embeded matrix of output end, k represents the number of negative sampling, Pn(c) distribution of sampling is represented, a represents the size of contextual window;
Step 1-2-2, word embeded matrix e is obtained according to stochastic gradient descent calligraphy learninguni
Step 1-2-3, the learning objective of double word insertion is defined as:
Wherein, logp (ct+jct+j+1|ctct+1) and log p (lt+j|ctct+1) be calculated as follows,
Wherein ebiThe double word embeded matrix of input is represented,Represent the word embeded matrix of output end, ebi(x) represent from The corresponding double word insertions of double word x are taken out in the double word embeded matrix of input,Represent the double word embeded matrix from output end It is middle to take out the corresponding double word insertions of word x, ctct+1Represent that t-th of word and the t+1 word link together obtained double word;
Step 1-2-4, after the learning objective of double word insertion has been defined, double word is obtained according to stochastic gradient descent calligraphy learning Embeded matrix ebi
The w in whole step 21,w2,…,wnThe sentence of participle is treated in expression, and n represents to treat the length of participle sentence, wnRepresent N-th of word in sentence, step 2 comprises the following steps:
Step 2-1, in processing t*All marking types are given a mark using neutral net during word, wherein 1≤tt≤n;
Step 2-2, to t*=1,2 ..., n iteration perform step 2-1, and selection highest scoring is often walked according to greedy algorithm Mark is as current markers, and n represents to treat the length of participle sentence;
After step 2-3, the lexeme annotated sequence for obtaining whole sentence, the result of sentence cutting, as sentence point are converted into The final result of analysis.
Step 2-1 comprises the following steps:
Step 2-1-1, generates characteristic vector, and characteristic vector includes word feature and double word feature;
Step 2-1-2, the characteristic vector generated in step 2-1-1 is directed to using neutral net and calculate obtaining all times Select the score of mark.
Step 2-1-1 includes such as step:
Step 2-1-1-1, the word insertion and double word insertion obtained according to step 1-2 learnings obtains word characteristic vectorWith double word characteristic vectorWherein subscript t represents current location;
Step 2-1-1-2, word characteristic vector and double word characteristic vector are stitched together and obtain the mark sheet of current location Show
Step 2-1-2 comprises the following steps:
Step 2-1-2-1, the intermediate representation for generating current location is calculated using two-way shot and long term Memory Neural Networks model, The input of wherein network isForward process is calculated as follows:
Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t It is unrelated;For t*The hidden layer vector sum mnemon vector of the output of individual computing unit; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen hidden layer vector sum input vector transition matrix;Tanh used is hyperbolic functions in formula, is one Real-valued function, it, which is acted on, represents to do each element in vector this operation on a vector, obtain one with input to Measure dimension identical object vector;σ is sigmod functions, is a real-valued function, its act on represented on a vector to Each element in amount does this operation, obtains one and input vector dimension identical object vector;⊙ is point multiplication operation, i.e., The result vector that multiplication obtains an identical dimensional is done into the vectorial step-by-step of two dimension identicals;
The calculating process of backward network is as follows:
Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t* It is unrelated;hb(wt)、For t*The hidden layer vector sum mnemon vector of the output of individual computing unit; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; Respectively It is to calculateWhen hidden layer vector, input vector and mnemon vector transition matrix; It is meter respectively CalculateWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen it is hidden The transition matrix of layer vector sum input vector;
WillWithAnd hb(wt) splicing obtains final intermediate layer and represent
Step 2-1-2-2, the score of all marking types, the calculating of whole feedforward network are calculated using feedforward neural network Process equation below:
O=W2h+b2,
Wherein, W1,W2,b1,b2It is the model parameter trained, h is the hidden layer vector of neutral net;O is to calculate defeated Go out, be a real-valued vectors, the number of its dimension correspondence lexeme mark, wherein i-th of value is with regard to correspondence moment t*When, mark i's Score, the score is a real number value.
Beneficial effect:The present invention is directed to the existing model marked based on word in current Chinese words segmentation and can not made full use of The shortcoming of word information, it is proposed that a kind of word insertion learning method of word-based context carrys out the indirect letter for merging word rank Breath, on the premise of model complexity is not increased, improves the degree of accuracy of Chinese word segmentation task.
Brief description of the drawings
The present invention is done with reference to the accompanying drawings and detailed description and further illustrated, of the invention is above-mentioned And/or otherwise advantage will become apparent.
Fig. 1 is the flow chart of the present invention.
Embodiment
The present invention proposes a kind of word insertion using word-based context and the Chinese word cutting method of neutral net, big The automatic segmentation data of scale are gone to school handwriting practicing insertion, will learn the input of obtained word insertion as neutral net participle model, Model learning can effectively be helped.
As shown in figure 1, the invention discloses a kind of word insertion using word-based context and the Chinese point of neutral net Word method, it can either be with word level another characteristic, and can indirectly be merged by introducing the word insertion based on term vector Word information, improves the degree of accuracy of Chinese word segmentation task.
Word of the present invention using word-based context is embedded in the Chinese word cutting method with neutral net including as follows Step:
Step 1, computer reads the data of extensive automatic segmentation, and learning method is embedded in using the word of word-based context Obtain word insertion and double word insertion;
Step 2, treat participle sentence using the method based on neutral net and carry out sentence cutting.
Wherein step 1 is comprised the following steps:
Step 1-1, system is marked according to four lexemes, and a sentence segmented can be expressed as word sequence { c1,c2,…, cnAnd { l1,l2,…,ln}.N is the length of sentence, li∈{B,M,E,S}.B, M, E, S tetra- is had in four lexeme mark systems Individual mark, wherein B represents the prefix that current word occupies a multi-character words, and M represents current word and occupies E generations in the words of a multi-character words The current word of table occupies the suffix of a multi-character words, and it is a monosyllabic word that S, which represents current word,.An example is given below and illustrates four words The implication of position mark system.A sentence for having divided word is provided first:
(1) research of natural science deepens continuously
Form of the sentence after being labeled with four lexemes mark system is as follows:
(2) from/B so/M sections/M/E /S grinds/B studies carefully/E not /B is disconnected/E depth/B enter/E
(1) and (2) is two kinds of forms of equal value in four lexeme mark systems, can mutually be changed, such as in mark side certainly Can first it be obtained in method such as the form in (2), the form being then converted into (1) is the result of participle.
Step 1-2, is learnt using the word insertion learning method of word-based context in the data of extensive automatic segmentation Obtain word insertion and double word insertion.
All sentences in step 1-2 in whole extensive automatic segmentation data are spliced into a long sentence formation data set, Whole data set table is shown as word sequence { c1,c2,…,cTAnd corresponding flag sequence { l1,l2,…,lT, wherein T is data Concentrate the number of word, cTRepresent the T word in data set, lTRepresent the corresponding mark of the T word in data set.
Step 1-2 comprises the following steps:
Step 1-2-1, the learning objective of word insertion is defined as:
Wherein, log p (ct+j|ct) and log p (lt+j|ct) be calculated as follows,
Wherein, σ represents sigmoid functions, is a real-valued function, and it is acted on represents in vector on a vector Each element does this operation, obtains one and input vector dimension identical object vector, euniRepresent the word insertion of input Matrix,Represent the word embeded matrix of output end, euni(x) represent that taking-up word x is corresponding from the word embeded matrix of input Word is embedded in,Represent to take out the corresponding word insertions of word x from the word embeded matrix of output end, k represents the number of negative sampling, Pn(c) distribution of sampling is represented, a represents the size of contextual window;
Step 1-2-2, word embeded matrix e is obtained according to stochastic gradient descent calligraphy learninguni
Step 1-2-3, the learning objective of double word insertion is defined as:
Wherein, log p (ct+jct+j+1|ctct+1) and log p (lt+j|ctct+1) be calculated as follows,
Wherein ebiThe double word embeded matrix of input is represented,Represent the word embeded matrix of output end, ebi(x) represent from The corresponding double word insertions of double word x are taken out in the double word embeded matrix of input,Represent the double word embeded matrix from output end It is middle to take out the corresponding double word insertions of word x, ctct+1Represent that t-th of word and the t+1 word link together obtained double word;
Step 1-2-4, after the learning objective of double word insertion has been defined, double word is obtained according to stochastic gradient descent calligraphy learning Embeded matrix ebi
The w in whole step 21,w2,…,wnThe sentence of participle is treated in expression, and n represents to treat the length of participle sentence, wnRepresent N-th of word in sentence, step 2 comprises the following steps:
Step 2-1, in processing t*All marking types are given a mark using neutral net during word, wherein 1≤t*≤n;
Step 2-2, to t*=1,2 ..., n iteration perform step 2-1, and selection highest scoring is often walked according to greedy algorithm Mark is as current markers, and n represents to treat the length of participle sentence;
After step 2-3, the lexeme annotated sequence for obtaining whole sentence, the result of sentence cutting, as sentence point are converted into The final result of analysis.
Step 2-1 comprises the following steps:
Step 2-1-1, generates characteristic vector, and characteristic vector includes word feature and double word feature;
Step 2-1-2, the characteristic vector generated in step 2-1-1 is directed to using neutral net and calculate obtaining all times Select the score of mark.
It is characterised by, step 2-1-1 includes such as step:
Step 2-1-1-1, the word insertion and double word insertion obtained according to step 1-2 learnings obtains word characteristic vectorWith double word characteristic vectorWherein subscript t represents current location;
Step 2-1-1-2, word characteristic vector and double word characteristic vector are stitched together and obtain the mark sheet of current location Show
Step 2-1-2 comprises the following steps:
Step 2-1-2-1, the intermediate representation for generating current location is calculated using two-way shot and long term Memory Neural Networks model, The input of wherein network isForward process is calculated as follows:
Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t* It is unrelated;For t*The hidden layer vector sum mnemon vector of the output of individual computing unit; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen hidden layer vector sum input vector transition matrix;Tanh used is hyperbolic functions in formula, is one Real-valued function, it, which is acted on, represents to do each element in vector this operation on a vector, obtain one with input to Measure dimension identical object vector;σ is sigmod functions, is a real-valued function, its act on represented on a vector to Each element in amount does this operation, obtains one and input vector dimension identical object vector;⊙ is point multiplication operation, i.e., The result vector that multiplication obtains an identical dimensional is done into the vectorial step-by-step of two dimension identicals;
The calculating process of backward network is as follows:
Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t It is unrelated;hb(wt)、For t*The hidden layer vector sum mnemon vector of the output of individual computing unit; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; Respectively It is to calculateWhen hidden layer vector, input vector and mnemon vector transition matrix; It is meter respectively CalculateWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen it is hidden The transition matrix of layer vector sum input vector;
WillWithSplicing obtains final intermediate layer and represented
Step 2-1-2-2, the score of all marking types, the calculating of whole feedforward network are calculated using feedforward neural network Process equation below:
O=W2h+b2,
Wherein, W1,W2,b1,b2It is the model parameter trained, h is the hidden layer vector of neutral net;O is to calculate defeated Go out, be a real-valued vectors, the number of its dimension correspondence lexeme mark, wherein i-th of value is with regard to correspondence moment t*When, mark i's Score, the score is a real number value.
The additional information of model parameter training method is as follows used in during analysis of the present invention:
From the step 1 during analysis it is recognised that in the word insertion study of heretofore described word-based context Parameter used includes:
1st, the parameter of study word insertion
2nd, the parameter of study double word insertion
In the study that word is embedded in, to input parameter euni, initialized by random value, parameter of output endPress Zero value is initialized.The target entirely trained is exactly to minimize the object function in step 1-2-1, in study word insertion In method parameter is solved usually using stochastic gradient descent method.Stochastic gradient descent method is also taken in the present embodiment to solve To parameter
In the study that double word word is embedded in, to input parameter ebi, initialized by random value, parameter of output endInitialized by zero value.The target entirely trained is exactly to minimize the object function in step 1-2-3, embedding in study word In the method entered parameter is solved usually using stochastic gradient descent method.Stochastic gradient descent method is also taken in the present embodiment to ask Solution obtains parameter
From the step 2 during analysis it is recognised that the parameter used in the process of heretofore described participle includes Following several parts (making these parameters be model parameter group below):
1st, the characteristic vector word insertion of correspondence inputWith double word insertionWhereinFor list Word,For double word;
2nd, the neural network parameter before being calculated in step 2-1-2-1 used in feature
3rd, the neural network parameter used in backward feature is calculated in step 2-1-2-1
4th, the network parameter W in step 2-1-2-2 used in feedforward neural network1,W2
Training process using maximum likelihood training data concentrate it is correct annotated sequence, realized using iterative manner. Before training starts, parameter e can be obtained according to step 1uniAnd ebi.2nd, the parameter in 3,4 is initialized by random value. Then (assuming that data set size is D) dataest={ sent are integrated using labeled data1,sent2,…,sentDParameter is entered Row training:A training objective is defined first, and the object definition is on whole data set, also known as loss function, and it is whole The function of all parameters in model parameter group, it is assumed that be L (dataset), for each sentence sentrLoss function represent For loss (sentr) both definition carried out in the following manner with calculating process:
According to step 2-2, sent can be obtainedrMiddle any time t*On mark i score score (sentr,t*, I), wherein moment t*Correct labeling be gold, then the loss function on sentence can be defined as:
E thereinxExponential function is represented, e represents the constant of natural logrithm.
Definition is for the loss function of whole training dataset:
Θ therein, E represent the function that the loss function is parameter in model parameter group.
The target of whole training process is exactly to minimize above loss function, minimizes above loss function and tries to achieve parameter Method have it is a variety of and for industry practitioner know, such as embodiment stochastic gradient descent method is wherein employed to ask it Parameter in solution 2,3,4.
Embodiment 1
First, the labeled data employed in the present embodiment is Binzhou treebank Chinese edition CTB6.0, wherein training set 23401 Sentence, development set 2078, test set 2795.Automatic segmentation data are in being total to that Chinese Gigaword (LDC2011T13) is obtained 41071242.
Word insertion and the Chinese word segmentation side of neutral net of the present embodiment using the word-based context of utilization in the present invention The complete procedure of method is as follows:
Step 1-1, determines the mark system of word marking model, defines four type B, and M, E, S, concrete meaning are shown in explanation 1-1 in book;
Training obtains word insertion e on step 1-2, then the Chinese automatic segmentation data of GigaworduniMatrix and double word insertion ebi
Step 2-1, reads a Chinese sentence " you come at once ", and calculate score of each position on mark:
1. you are score (B)=1.01score (M)=0.32score (E)=0.13score (S)=2.34
2. horse score (B)=1.82score (M)=0.46score (E)=0.39score (S)=0.42
3. on score (B)=0.25score (M)=0.23score (E)=2.26score (S)=0.47
4. cross score (B)=2.37score (M)=0.74score (E)=0.29score (S)=0.56
5. carry out score (B)=0.27score (M)=0.10score (E)=3.26score (S)=0.24
Step 2-2, flag sequence can be obtained according to Greedy strategy:You/S horses/B is upper/and E mistakes/B carrys out/E
Step 2-3, word segmentation result can be converted to by flag sequence:You come at once
Embodiment 2
Algorithm used in the present invention all writes realization using C Plus Plus.The embodiment tests used type: Intel (R) Core (TM) i7-4790K processors, dominant frequency is 4.0GHz, inside saves as 24G.Mark employed in the present embodiment Data are Binzhou treebank Chinese edition CTB6.0, wherein training set 23401, development set 2078, test set 2795.Automatically cut Divided data is totally 41071242 obtained in Chinese Gigaword (LDC2011T13).Model parameter in Gigaword data and Training is obtained in CTB6.0 data, and experimental result is as shown in table 1:
The experimental result explanation of table 1
Wherein Xu and Sun (2016) are employed as the participle model based on interdependent recurrent neural network, Liu (2016) For using the participle model of the expression of cutting, Zhang (2016) is the neutral net participle model based on conversion, Zhang (2016) comb is the participle model based on conversion for being combined traditional characteristic with neural network characteristics, and this several model is represented It is currently based on the forward position level of the participle model of neutral net.It should be noted that it is to comment that evaluation and test is carried out on the data set One usual way of valency Chinese word segmentation.As can be seen that the method in the present invention achieves higher F1- on the data set Score values, illustrate the validity of this method.
The calculation to F1-score is illustrated herein:It is to know because the test set is labeled data collection The correct annotation results in road, it is assumed that for whole data set, the set S (gold) of all correct words compositions, its size is count(gold);After each sentence concentrated to data carries out participle in the way of in embodiment 1, all analyses are taken out As a result the word segmentation result predicted composition set S (predict) in, it is assumed that its size is count (predict);S (gold) and S (predict) collection that cutting identical part is constituted in is combined into S (correct), and its size is count (correct);Assuming that pre- Survey the degree of accuracy and be expressed as precision, prediction recall rate is expressed as recall, then the calculating of each value is carried out as follows:

Claims (8)

1. a kind of word insertion and the Chinese word cutting method of neutral net using word-based context, it is characterised in that including such as Lower step:
Step 1, computer reads the data of extensive automatic segmentation, is obtained using the word insertion learning method of word-based context Word is embedded in and double word insertion;
Step 2, treat participle sentence using the method based on neutral net and carry out sentence cutting.
2. according to the method described in claim 1, it is characterised in that step 1 comprises the following steps:
Step 1-1, system is marked according to four lexemes, and the sentence expression that one is segmented is into word sequence { c1,c2,…,cnAnd {l1,l2,…,ln, n is the length of sentence, li∈ { B, M, E, S }, 1≤i≤n, cnRepresent n-th of word in sentence, lnRepresent Represent the corresponding mark of n-th of word, l in sentenceiThe corresponding mark of i-th of word in sentence is represented, in four lexeme mark systems Shared tetra- marks of B, M, E, S, wherein B represents the prefix that current word occupies a multi-character words, and M represents current word and occupied more than one E represents the suffix that current word occupies a multi-character words in the word of words, and it is a monosyllabic word that S, which represents current word,;Step 1-2, profit With the word of word-based context insertion learning method the data of extensive automatic segmentation go to school acquistion be embedded in word it is embedding with double word Enter.
3. method according to claim 2, it is characterised in that in step 1-2 in whole extensive automatic segmentation data All sentences are spliced into a long sentence formation data set, and whole data set table is shown as word sequence { c1,c2,…,cTAnd it is corresponding Flag sequence { l1,l2,…,lT, wherein T is the number of word in data set, cTRepresent the T word in data set, lTRepresent number According to the corresponding mark of the T word of concentration.
4. method according to claim 3, it is characterised in that step 1-2 comprises the following steps:
Step 1-2-1, the learning objective of word insertion is defined as:
<mrow> <mo>-</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </msubsup> <msub> <mi>&amp;Sigma;</mi> <mrow> <mo>-</mo> <mi>a</mi> <mo>&amp;le;</mo> <mi>j</mi> <mo>&amp;le;</mo> <mi>a</mi> <mo>,</mo> <mi>j</mi> <mo>&amp;NotEqual;</mo> <mn>0</mn> </mrow> </msub> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
Wherein, log p (ct+j|ct) and log p (lt+j|ct) be calculated as follows,
<mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mrow> <msubsup> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <msub> <mi>E</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>~</mo> <msub> <mi>P</mi> <mi>n</mi> </msub> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>log</mi> <mi>&amp;sigma;</mi> <mo>(</mo> <mo>-</mo> <msubsup> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>&amp;rsqb;</mo> <mo>,</mo> </mrow>
<mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>log</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mrow> <msubsup> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&amp;Sigma;</mi> <mrow> <mi>l</mi> <mo>&amp;Element;</mo> <mrow> <mo>{</mo> <mrow> <mi>B</mi> <mo>,</mo> <mi>M</mi> <mo>,</mo> <mi>E</mi> <mo>,</mo> <mi>S</mi> </mrow> <mo>}</mo> </mrow> <mi>l</mi> <mo>&amp;NotEqual;</mo> <msub> <mi>l</mi> <mi>t</mi> </msub> </mrow> </msub> <mrow> <mo>&amp;lsqb;</mo> <mrow> <mi>log</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mrow> <mo>-</mo> <msubsup> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>u</mi> <mi>n</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mo>,</mo> </mrow>
Wherein, σ represents sigmoid functions, is a real-valued function, and it is acted on represents to each in vector on a vector Element all does this operation, obtains one and input vector dimension identical object vector, euniRepresent the word insertion square of input Battle array,Represent the word embeded matrix of output end, euni(x) represent to take out the corresponding words of word x from the word embeded matrix of input It is embedded,Represent to take out the corresponding word insertions of word x from the word embeded matrix of output end, k represents the number of negative sampling, Pn (c) distribution of sampling is represented, a represents the size of contextual window;
Step 1-2-2, word embeded matrix e is obtained according to stochastic gradient descent calligraphy learninguni
Step 1-2-3, the learning objective of double word insertion is defined as:
<mrow> <mo>-</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </msubsup> <msub> <mi>&amp;Sigma;</mi> <mrow> <mo>-</mo> <mi>a</mi> <mo>&amp;le;</mo> <mi>j</mi> <mo>&amp;le;</mo> <mi>a</mi> <mo>,</mo> <mi>j</mi> <mo>&amp;NotEqual;</mo> <mn>0</mn> </mrow> </msub> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mrow>
Wherein, log p (ct+jct+j+1|ctct+1) and log p (lt+j|ctct+1) be calculated as follows,
<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>log</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mrow> <msubsup> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>+</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <msub> <mi>E</mi> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>~</mo> <msub> <mi>P</mi> <mi>n</mi> </msub> <mrow> <mo>(</mo> <mi>c</mi> <mo>)</mo> </mrow> </mrow> </msub> <mrow> <mo>&amp;lsqb;</mo> <mrow> <mi>log</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mrow> <mo>-</mo> <msubsup> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>i</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>i</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mo>,</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>log</mi> <mi> </mi> <mi>p</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>log</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mrow> <msubsup> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>l</mi> <mrow> <mi>t</mi> <mo>+</mo> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <msub> <mi>&amp;Sigma;</mi> <mrow> <mi>l</mi> <mo>&amp;Element;</mo> <mrow> <mo>{</mo> <mrow> <mi>B</mi> <mo>,</mo> <mi>M</mi> <mo>,</mo> <mi>E</mi> <mo>,</mo> <mi>S</mi> </mrow> <mo>}</mo> </mrow> <mi>l</mi> <mo>&amp;NotEqual;</mo> <msub> <mi>l</mi> <mi>t</mi> </msub> </mrow> </msub> <mrow> <mo>&amp;lsqb;</mo> <mrow> <mi>log</mi> <mi>&amp;sigma;</mi> <mrow> <mo>(</mo> <mrow> <mo>-</mo> <msubsup> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> <mi>c</mi> </msubsup> <msup> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mi>T</mi> </msup> <msub> <mi>e</mi> <mrow> <mi>b</mi> <mi>i</mi> </mrow> </msub> <mrow> <mo>(</mo> <mrow> <msub> <mi>c</mi> <mi>t</mi> </msub> <msub> <mi>c</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> <mo>&amp;rsqb;</mo> </mrow> <mo>,</mo> </mrow> </mtd> </mtr> </mtable> </mfenced>
Wherein ebiThe double word embeded matrix of input is represented,Represent the word embeded matrix of output end, ebi(x) represent from input Double word embeded matrix in take out the corresponding double word insertion of double word x,Represent to take out from the double word embeded matrix of output end Double word insertion corresponding word x, ctct+1Represent that t-th of word and the t+1 word link together obtained double word;
Step 1-2-4, after the learning objective of double word insertion has been defined, double word insertion is obtained according to stochastic gradient descent calligraphy learning Matrix ebi
5. method according to claim 4, it is characterised in that treated in step 2 in participle sentence all words from left to right according to It is secondary to be expressed as w1,w2,…,wn, n is the length of sentence, wnN-th of word in sentence is represented, step 2 comprises the following steps:
Step 2-1, in processing t*All marking types are given a mark using neutral net during word, wherein 1≤t*≤n;
Step 2-2, to t*=1,2 ..., n iteration perform step 2-1, and the mark that selection highest scoring is often walked according to greedy algorithm is made For current markers, n represents to treat the length of participle sentence;
After step 2-3, the lexeme annotated sequence for obtaining whole sentence, it is converted into the result of sentence cutting, as the analysis of sentence Final result.
6. the method according to claim 5, is characterised by, step 2-1 comprises the following steps:
Step 2-1-1, generates characteristic vector, and characteristic vector includes word feature and double word feature;
Step 2-1-2, is directed to the characteristic vector that generates in step 2-1-1 using neutral net and calculate and obtain all candidates and mark The score of note.
7. the method according to claim 6, is characterised by, step 2-1-1 includes such as step:
Step 2-1-1-1, the word insertion and double word insertion obtained according to step 1-2 learnings obtains word characteristic vector euni(wt*) With double word characteristic vector ebi(wt*-1wt*), wherein subscript t represents current location;
Step 2-1-1-2, word characteristic vector and double word characteristic vector are stitched together and obtain the character representation g of current location (wt*)。
8. the method according to claim 7, is characterised by, step 2-1-2 comprises the following steps:
Step 2-1-2-1, the intermediate representation for generating current location is calculated using two-way shot and long term Memory Neural Networks model, wherein The input of network is g (wt*);Forward process is calculated as follows:
Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t* It is unrelated;hf(wt*)、For t*The hidden layer vector sum mnemon vector of the output of individual computing unit; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; Respectively It is to calculateWhen hidden layer vector, input vector and mnemon vector transition matrix; It is meter respectively CalculateWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen hidden layer The transition matrix of vector sum input vector;Tanh used is hyperbolic functions in formula, is a real-valued function, it acts on one Represent to do each element in vector this operation on individual vector, obtain one with input vector dimension identical target to Amount;σ is sigmod functions, is a real-valued function, and it is acted on represents to do each element in vector on a vector This operation, obtains one and input vector dimension identical object vector;⊙ is point multiplication operation, will two dimension identicals to The result vector that multiplication obtains an identical dimensional is done in amount step-by-step;
The calculating process of backward network is as follows:
Wherein, The value for being each element in the model parameter matrix trained, matrix is real number value, this group of parameter and t* It is unrelated;hb(wt)、For t*The hidden layer vector sum mnemon vector of the output of individual computing unit; It is to calculate respectivelyWhen hidden layer vector, input vector and mnemon vector transition matrix; Respectively It is to calculateWhen hidden layer vector, input vector and mnemon vector transition matrix; It is meter respectively CalculateWhen hidden layer vector, input vector and mnemon vector transition matrix; It is to calculate respectivelyWhen it is hidden The transition matrix of layer vector sum input vector;
By hf(wt*) and hb(wt*) i.e. obtaining final intermediate layer represents f (w for splicingt*);
Step 2-1-2-2, the score of all marking types, the calculating process of whole feedforward network are calculated using feedforward neural network Equation below:
H=tanh (W1f(wt*)+b1),
O=W2h+b2,
Wherein, W1,W2,b1,b2It is the model parameter trained, h is the hidden layer vector of neutral net;O is to calculate output, is One real-valued vectors, the number of its dimension correspondence lexeme mark, wherein i-th of value is with regard to correspondence moment t*When, i score is marked, The score is a real number value.
CN201710368867.6A 2017-05-23 2017-05-23 Utilize the Chinese word cutting method of the word insertion and neural network of word-based context Active CN107168955B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710368867.6A CN107168955B (en) 2017-05-23 2017-05-23 Utilize the Chinese word cutting method of the word insertion and neural network of word-based context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710368867.6A CN107168955B (en) 2017-05-23 2017-05-23 Utilize the Chinese word cutting method of the word insertion and neural network of word-based context

Publications (2)

Publication Number Publication Date
CN107168955A true CN107168955A (en) 2017-09-15
CN107168955B CN107168955B (en) 2019-06-04

Family

ID=59820524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710368867.6A Active CN107168955B (en) 2017-05-23 2017-05-23 Utilize the Chinese word cutting method of the word insertion and neural network of word-based context

Country Status (1)

Country Link
CN (1) CN107168955B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657313A (en) * 2017-09-26 2018-02-02 上海数眼科技发展有限公司 The transfer learning system and method for the natural language processing task adapted to based on field
CN107832301A (en) * 2017-11-22 2018-03-23 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN107832302A (en) * 2017-11-22 2018-03-23 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN107918605A (en) * 2017-11-22 2018-04-17 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN108062305A (en) * 2017-12-29 2018-05-22 北京时空迅致科技有限公司 A kind of unsupervised Chinese word cutting method of three-wave-length based on iteration
CN109325103A (en) * 2018-10-19 2019-02-12 北京大学 A kind of dynamic identifier representation method, the apparatus and system of Sequence Learning
CN109543836A (en) * 2018-11-30 2019-03-29 上海寒武纪信息科技有限公司 Operation method, device and Related product
CN109741806A (en) * 2019-01-07 2019-05-10 北京推想科技有限公司 A kind of Medical imaging diagnostic reports auxiliary generating method and its device
CN109960782A (en) * 2018-12-27 2019-07-02 同济大学 A kind of Tibetan language segmenting method and device based on deep neural network
CN110021373A (en) * 2017-09-19 2019-07-16 上海交通大学 A kind of legitimacy prediction technique of chemical reaction
CN110263325A (en) * 2019-05-17 2019-09-20 交通银行股份有限公司太平洋***中心 Chinese automatic word-cut
CN110491453A (en) * 2018-04-27 2019-11-22 上海交通大学 A kind of yield prediction method of chemical reaction
CN110728141A (en) * 2018-07-16 2020-01-24 中移(苏州)软件技术有限公司 Word segmentation method and device, electronic equipment and storage medium
CN110750992A (en) * 2019-10-09 2020-02-04 吉林大学 Named entity recognition method, device, electronic equipment and medium
CN111160003A (en) * 2018-11-07 2020-05-15 北京猎户星空科技有限公司 Sentence-breaking method and device
CN111428499A (en) * 2020-04-27 2020-07-17 南京大学 Idiom compression representation method for automatic question-answering system by fusing similar meaning word information
US10817665B1 (en) * 2020-05-08 2020-10-27 Coupang Corp. Systems and methods for word segmentation based on a competing neural character language model
CN112784547A (en) * 2021-02-23 2021-05-11 南方电网调峰调频发电有限公司信息通信分公司 Word segmentation method, device, equipment and medium based on model training
CN113010676A (en) * 2021-03-15 2021-06-22 北京语言大学 Text knowledge extraction method and device and natural language inference system
US11170249B2 (en) 2019-08-29 2021-11-09 Abbyy Production Llc Identification of fields in documents with neural networks using global document context
US11861925B2 (en) 2020-12-17 2024-01-02 Abbyy Development Inc. Methods and systems of field detection in a document

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2721189C1 (en) 2019-08-29 2020-05-18 Общество с ограниченной ответственностью "Аби Продакшн" Detecting sections of tables in documents by neural networks using global document context

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021227A (en) * 2016-05-16 2016-10-12 南京大学 State transition and neural network-based Chinese chunk parsing method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021227A (en) * 2016-05-16 2016-10-12 南京大学 State transition and neural network-based Chinese chunk parsing method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李雪莲等: "基于GRU神经网络的中文分词法", 《厦门大学学报(自然科学版)》 *
来斯惟等: "基于表示学习的中文分词算法探索", 《中文信息学报》 *
胡婕等: "双向循环网络中文分词模型", 《小型微型计算机***》 *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110021373A (en) * 2017-09-19 2019-07-16 上海交通大学 A kind of legitimacy prediction technique of chemical reaction
CN107657313A (en) * 2017-09-26 2018-02-02 上海数眼科技发展有限公司 The transfer learning system and method for the natural language processing task adapted to based on field
CN107657313B (en) * 2017-09-26 2021-05-18 上海数眼科技发展有限公司 System and method for transfer learning of natural language processing task based on field adaptation
CN107832302A (en) * 2017-11-22 2018-03-23 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN107918605A (en) * 2017-11-22 2018-04-17 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN107832302B (en) * 2017-11-22 2021-09-17 北京百度网讯科技有限公司 Word segmentation processing method and device, mobile terminal and computer readable storage medium
CN107918605B (en) * 2017-11-22 2021-08-20 北京百度网讯科技有限公司 Word segmentation processing method and device, mobile terminal and computer readable storage medium
CN107832301A (en) * 2017-11-22 2018-03-23 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN108062305A (en) * 2017-12-29 2018-05-22 北京时空迅致科技有限公司 A kind of unsupervised Chinese word cutting method of three-wave-length based on iteration
CN110491453A (en) * 2018-04-27 2019-11-22 上海交通大学 A kind of yield prediction method of chemical reaction
CN110728141A (en) * 2018-07-16 2020-01-24 中移(苏州)软件技术有限公司 Word segmentation method and device, electronic equipment and storage medium
CN110728141B (en) * 2018-07-16 2023-09-19 中移(苏州)软件技术有限公司 Word segmentation method and device, electronic equipment and storage medium
CN109325103A (en) * 2018-10-19 2019-02-12 北京大学 A kind of dynamic identifier representation method, the apparatus and system of Sequence Learning
CN109325103B (en) * 2018-10-19 2020-12-04 北京大学 Dynamic identifier representation method, device and system for sequence learning
CN111160003B (en) * 2018-11-07 2023-12-08 北京猎户星空科技有限公司 Sentence breaking method and sentence breaking device
CN111160003A (en) * 2018-11-07 2020-05-15 北京猎户星空科技有限公司 Sentence-breaking method and device
CN109543836A (en) * 2018-11-30 2019-03-29 上海寒武纪信息科技有限公司 Operation method, device and Related product
CN109960782A (en) * 2018-12-27 2019-07-02 同济大学 A kind of Tibetan language segmenting method and device based on deep neural network
CN109741806A (en) * 2019-01-07 2019-05-10 北京推想科技有限公司 A kind of Medical imaging diagnostic reports auxiliary generating method and its device
CN110263325A (en) * 2019-05-17 2019-09-20 交通银行股份有限公司太平洋***中心 Chinese automatic word-cut
CN110263325B (en) * 2019-05-17 2023-05-12 交通银行股份有限公司太平洋***中心 Chinese word segmentation system
US11170249B2 (en) 2019-08-29 2021-11-09 Abbyy Production Llc Identification of fields in documents with neural networks using global document context
CN110750992A (en) * 2019-10-09 2020-02-04 吉林大学 Named entity recognition method, device, electronic equipment and medium
CN111428499B (en) * 2020-04-27 2021-10-26 南京大学 Idiom compression representation method for automatic question-answering system by fusing similar meaning word information
CN111428499A (en) * 2020-04-27 2020-07-17 南京大学 Idiom compression representation method for automatic question-answering system by fusing similar meaning word information
US11113468B1 (en) * 2020-05-08 2021-09-07 Coupang Corp. Systems and methods for word segmentation based on a competing neural character language model
TWI771841B (en) * 2020-05-08 2022-07-21 南韓商韓領有限公司 Systems and methods for word segmentation based on a competing neural character language model
US10817665B1 (en) * 2020-05-08 2020-10-27 Coupang Corp. Systems and methods for word segmentation based on a competing neural character language model
US11861925B2 (en) 2020-12-17 2024-01-02 Abbyy Development Inc. Methods and systems of field detection in a document
CN112784547A (en) * 2021-02-23 2021-05-11 南方电网调峰调频发电有限公司信息通信分公司 Word segmentation method, device, equipment and medium based on model training
CN113010676A (en) * 2021-03-15 2021-06-22 北京语言大学 Text knowledge extraction method and device and natural language inference system
CN113010676B (en) * 2021-03-15 2023-12-08 北京语言大学 Text knowledge extraction method, device and natural language inference system

Also Published As

Publication number Publication date
CN107168955B (en) 2019-06-04

Similar Documents

Publication Publication Date Title
CN107168955A (en) Word insertion and the Chinese word cutting method of neutral net using word-based context
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN110807328B (en) Named entity identification method and system for legal document multi-strategy fusion
CN107168945B (en) Bidirectional cyclic neural network fine-grained opinion mining method integrating multiple features
CN109284400B (en) Named entity identification method based on Lattice LSTM and language model
CN109359291A (en) A kind of name entity recognition method
CN110245229A (en) A kind of deep learning theme sensibility classification method based on data enhancing
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN108829801A (en) A kind of event trigger word abstracting method based on documentation level attention mechanism
CN108628823A (en) In conjunction with the name entity recognition method of attention mechanism and multitask coordinated training
El Abed et al. Icdar 2009 online arabic handwriting recognition competition
CN108460089A (en) Diverse characteristics based on Attention neural networks merge Chinese Text Categorization
CN109710919A (en) A kind of neural network event extraction method merging attention mechanism
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
CN106980608A (en) A kind of Chinese electronic health record participle and name entity recognition method and system
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN111966917A (en) Event detection and summarization method based on pre-training language model
CN110287323B (en) Target-oriented emotion classification method
CN106557462A (en) Name entity recognition method and system
CN105868184A (en) Chinese name recognition method based on recurrent neural network
CN105975555A (en) Enterprise abbreviation extraction method based on bidirectional recurrent neural network
CN110532563A (en) The detection method and device of crucial paragraph in text
CN108874896B (en) Humor identification method based on neural network and humor characteristics
CN103020167B (en) A kind of computer Chinese file classification method
CN109657039B (en) Work history information extraction method based on double-layer BilSTM-CRF

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant