CN107239444A - A kind of term vector training method and system for merging part of speech and positional information - Google Patents

A kind of term vector training method and system for merging part of speech and positional information Download PDF

Info

Publication number
CN107239444A
CN107239444A CN201710384135.6A CN201710384135A CN107239444A CN 107239444 A CN107239444 A CN 107239444A CN 201710384135 A CN201710384135 A CN 201710384135A CN 107239444 A CN107239444 A CN 107239444A
Authority
CN
China
Prior art keywords
speech
word
matrix
term vector
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710384135.6A
Other languages
Chinese (zh)
Other versions
CN107239444B (en
Inventor
文坤梅
李瑞轩
刘其磊
李玉华
辜希武
昝杰
杨琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710384135.6A priority Critical patent/CN107239444B/en
Publication of CN107239444A publication Critical patent/CN107239444A/en
Application granted granted Critical
Publication of CN107239444B publication Critical patent/CN107239444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of term vector training method and system for merging part of speech and positional information, this method includes:Data pre-process to obtain target text;Participle and part-of-speech tagging are carried out to target text;Part-of-speech information is modeled and positional information is modeled;Part of speech is merged on the basis of the skip gram models based on negative sampling policy and obtains target term vector with positional information progress term vector study, the target term vector is used for word analogy task and word similarity task is assessed.The present invention considers the part-of-speech information and positional information of word, and on the basis of being modeled to the part of speech and positional information of word, the positional information between the part-of-speech information and part of speech of word is made full use of to help the training of term vector, and the renewal for parameter during training is also more reasonable.

Description

A kind of term vector training method and system for merging part of speech and positional information
Technical field
The invention belongs to natural language processing technique field, more particularly, to a kind of part of speech and positional information of merging Term vector training method and system.
Background technology
In recent years, developing rapidly with development of Mobile Internet technology so that the scale of data rapidly increases in internet, So that the complexity of data drastically increases.This allows for turning into these magnanimity without structure, the Treatment Analysis of unlabeled data A great problem.
Traditional machine learning method carries out symbolism table using Feature Engineering (Feature engineering) to data Show the bag of words presentation technology such as One-hot commonly used in modeling and solution, but Feature Engineering in order to model vectors with data The growth of complexity, the dimension of feature can also sharply increase to cause dimension disaster problem.And based on One-hot vector tables Also there is semantic gap phenomenon in the method shown.With " if two word contexts are similar, then their semanteme is also similar " Distribution hypothesis (distributional hypothesis) is suggested, and the word distribution presentation technology based on distribution hypothesis is continuous Ground is suggested.The wherein topmost distribution having based on matrix is represented, the distribution based on cluster is represented and point based on term vector Cloth is represented.Although but either based on matrix represent to be also based on distribution method for expressing that cluster represents can characteristic dimension compared with The simple contextual information of hour expression.But when characteristic dimension is higher, model for context expression especially to complexity The expression of context is just helpless.And the presentation technology based on term vector so that either for the expression of each word, also It is that the context that word is represented by the method for linear combination all avoids the problem of dimension disaster occur.And due to word The distance between can be weighed by the COS distance between their corresponding term vectors or Euclidean distance, this is also in very great Cheng The problem of semantic gap in traditional bag of words is eliminated on degree.
However, at present existing term vector research work mostly concentrate on by the structure of neutral net in simplified model come Reduce model complexity, the information such as emotion, theme have been merged in some work, and merge the research work of part-of-speech information seldom and The part of speech fineness ratio being directed in these seldom work is larger, and the utilization for part-of-speech information is very insufficient, for part-of-speech information Renewal it is also less reasonable.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, object of the present invention is to provide one kind fusion part of speech with The term vector training method and system of positional information, are directed in the research work for thus solving to merge part-of-speech information in the prior art Part of speech fineness ratio it is larger, the utilization for part-of-speech information is very insufficient, for the renewal also less rational skill of part-of-speech information Art problem.
To achieve the above object, according to one aspect of the present invention, there is provided a kind of word for merging part of speech and positional information Vectorial training method, comprises the following steps:
S1, urtext is carried out to pre-process and obtain target text;
S2, the contextual information according to word, the part of speech concentrated using part-of-speech tagging are carried out to the word in target text Part-of-speech tagging;
S3, according to the part-of-speech information of mark be modeled structure part of speech associated weights matrix M, and for part of speech to pair Answer the relative position i of word pair to be modeled, build position part of speech associated weights matrix M corresponding with positioni', wherein, matrix M ranks dimension concentrates the element in the species size of part of speech, matrix M to be the word of the row correspondence word of the element for part-of-speech tagging Co-occurrence probabilities of the property with the part of speech of the corresponding word of row of the element, matrix Mi' ranks dimension it is identical with matrix M, matrix Mi' in Element for the row correspondence word of the element part of speech and co-occurrence of the part of speech in relative position i of the corresponding word of row of the element Probability;
S4, by the matrix M and matrix M after modelingi' be fused in skip-gram term vector models and build object module, by Object module carries out term vector study and obtains target term vector, wherein, target term vector is used for word analogy task and word Similarity task.
Preferably, step S2 specifically includes following sub-step:
S2.1, to target text carry out participle, to distinguish all words in target text;
S2.2, to each sentence in target text, according to contextual information of the word in sentence, using part-of-speech tagging The part of speech of concentration carries out part-of-speech tagging to word.
Preferably, step S3 specifically includes following sub-step:
S3.1, the word-part of speech constituted to each word in target text, generation for word and its corresponding part of speech It is right, according to word-part of speech to building part of speech associated weights matrix M, wherein, matrix M ranks dimension concentrates word for part-of-speech tagging Property species size, element in matrix M for the row correspondence word of the element part of speech and the word of the corresponding word of row of the element The co-occurrence probabilities of property;
S3.2, for part of speech the relative position i of corresponding word pair is modeled, builds position corresponding with position word Property associated weights matrix M 'i, wherein, matrix M 'iRanks dimension it is identical with matrix M, matrix M 'iIn element for the element Co-occurrence probabilities of the part of speech of row correspondence word with the part of speech of the corresponding word of row of the element in relative position i.
Preferably, step S4 specifically includes following sub-step:
S4.1, structure initial target function:Wherein, C tables Show the vocabulary in whole training corpus, above and below Context (w) represents that each c word is constituted before and after target word w Literary set of words, c represents window size;
S4.2, by the matrix M and matrix M after modelingi' it is fused to the skip-gram words based on negative sampling Object module is built in vector model, and according to the fresh target function of initial target function structure object module:Wherein, NEG (w) is the negative sample collection sampled to target word w, Lw(u) marking for being sample u, positive sample marking is 1, negative sample Give a mark as 0, θuThe auxiliary vector used for sample word during model training,For context wordsCorresponding word VectorTransposition,For TuWithCo-occurrence probabilities of two parts of speech when relative position relation is i;
S4.3, fresh target function is optimized, fresh target function value is maximized, and to parameter θuAndGradient calculation and renewal are carried out, and target term vector is obtained when traveling through and completing to whole training corpus.
It is another aspect of this invention to provide that there is provided a kind of term vector training system for merging part of speech and positional information, bag Include:
Pretreatment module, for urtext pre-process obtaining target text;
Part-of-speech tagging module, for the contextual information according to word, the part of speech concentrated using part-of-speech tagging is to target text Word in this carries out part-of-speech tagging;
Position part of speech Fusion Module, for being modeled structure part of speech associated weights matrix M according to the part-of-speech information of mark, And the relative position i of corresponding word pair is modeled for part of speech, build position corresponding with position part of speech association power Weight matrix M 'i, wherein, matrix M ranks dimension concentrates the element in the species size of part of speech, matrix M for part-of-speech tagging to be somebody's turn to do The part of speech and the co-occurrence probabilities of the part of speech of the corresponding word of row of the element, matrix M ' of the row correspondence word of elementiRanks dimension It is identical with matrix M, matrix M 'iIn element for the element row correspondence word part of speech and the corresponding word of row of the element word Co-occurrence probabilities of the property in relative position i;
Term vector study module, for by the matrix M and matrix M ' after modelingiIt is fused to skip-gram term vector models Middle structure object module, carries out term vector study by object module and obtains target term vector, wherein, target term vector is used for word Analogy task and word similarity task.
Preferably, the part-of-speech tagging module includes:
Word-dividing mode, for carrying out participle to target text, to distinguish all words in target text;
Part-of-speech tagging submodule, for each sentence in target text, being believed according to context of the word in sentence Breath, the part of speech concentrated using part-of-speech tagging carries out part-of-speech tagging to word.
Preferably, the position part of speech Fusion Module includes:
Part-of-speech information modeling module, for each word in target text, generation to be directed to word and its corresponding word Property constitute word-part of speech pair, according to word-part of speech to build part of speech associated weights matrix M, wherein, matrix M ranks dimension It is the part of speech and the element of the row correspondence word of the element that the element in the species size of part of speech, matrix M is concentrated for part-of-speech tagging Row correspondence word part of speech co-occurrence probabilities;
Positional information modeling module, for being modeled for part of speech to the relative position i of corresponding word pair, build with The corresponding position part of speech associated weights matrix M ' in positioni, wherein, matrix M 'iRanks dimension it is identical with matrix M, matrix M 'iIn Element for the row correspondence word of the element part of speech and co-occurrence of the part of speech in relative position i of the corresponding word of row of the element Probability.
Preferably, the term vector study module includes:
Initial target function builds module, for building initial target function: Wherein, C represents the vocabulary in whole training corpus, and Context (w) represents each c group of words before and after target word w Into context words collection, c represents window size;
Fresh target function builds module, for by the matrix M and matrix M ' after modelingiIt is fused to based on negative sampling Object module is built in skip-gram term vector models, and according to the fresh target function of initial target function structure object module:Wherein, NEG (w) is the negative sample collection sampled to target word w, Lw(u) marking for being sample u, positive sample marking is 1, negative sample Give a mark as 0, θuThe auxiliary vector used for sample word during model training,For context wordsCorresponding word to AmountTransposition,For TuWithCo-occurrence probabilities of two parts of speech when relative position relation is i;
Term vector learns submodule, for being optimized to fresh target function, and fresh target function value is maximized, and right Parameter θuAndGradient calculation and renewal are carried out, and is obtained when whole training corpus is traveled through and completed Target term vector.
In general, the inventive method can obtain following beneficial effect compared with prior art:
(1), can be well to word by building the incidence matrix based on part of speech incidence relation Yu position incidence relation Between part of speech and positional information be modeled.
(2) adopted by the way that the modeled good incidence matrix based on part-of-speech information and positional information is fused to based on negative In the skip-gram term vector learning models of sample, more preferable term vector result on the one hand can be obtained, on the other hand can also be obtained Incidence relation weight into the corpus for model training between part of speech.
(3) because model employs the optimisation strategy of negative sampling so that the training speed of model is also than very fast.
Brief description of the drawings
Fig. 1 is that a kind of flow for the term vector training method for merging part of speech and positional information is shown disclosed in the embodiment of the present invention It is intended to;
Fig. 2 is a kind of modeler model figure of part of speech and positional information disclosed in the embodiment of the present invention;
Fig. 3 is a kind of overall flow rough schematic view disclosed in the embodiment of the present invention;
Fig. 4 is the flow of the term vector training method of another fusion part of speech and positional information disclosed in the embodiment of the present invention Schematic diagram.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below that Not constituting conflict between this can just be mutually combined.
Because existing term vector learning method ignores part of speech and its importance in natural language, the present invention provides a kind of Merge the term vector learning method of part of speech and positional information.This method is intended to consider single on the basis of original skip-gram models Part of speech incidence relation and position relationship between word, to allow model to train the term vector result of fusion more information, and Word analogy task and word similarity task are preferably completed using the term vector learnt.
It is a kind of term vector learning method for merging part of speech and positional information disclosed in the embodiment of the present invention as shown in Figure 1 Schematic flow sheet, comprises the following steps in the method shown in Fig. 1:
S1, urtext is carried out to pre-process and obtain target text;
Due to existed in the urtext of acquisition substantial amounts of garbage for example XML tag, web page interlinkage, image link with And as " [", "@", " & ", " # " etc., training of these garbages not only to term vector is unhelpful, or even can turn into noise data, Influence the study of term vector, it is therefore desirable to fall these information filterings, it is possible to use perl script falls these information filterings.
S2, the contextual information according to word, the part of speech concentrated using part-of-speech tagging are carried out to the word in target text Part-of-speech tagging;
Due to use the part-of-speech information of word in the method as proposed in the present invention, it is therefore desirable to utilize some part-of-speech taggings Instrument carries out part-of-speech tagging to text.Because the difference of context residing for a word causes it to have multiple parts of speech, it is Part-of-speech tagging can be carried out by text in advance by solving this problem, and part-of-speech tagging is carried out by its contextual information.Step S2 Specifically include following sub-step:
S2.1, to target text carry out participle, to distinguish all words in target text;
Wherein it is possible to text is subjected to participle using the tokenize participles instrument in openNLP, such as " I buy an Apple. common word " apple " will turn into " apple. " this non-existent word if not participle in ", influence word The study of vector.
S2.2, to each sentence in target text, according to contextual information of the word in sentence, using part-of-speech tagging The part of speech of concentration carries out part-of-speech tagging to word.
Wherein, part-of-speech tagging disposably is carried out to a whole sentence, thus can be by same word according to its institute The multiple parts of speech located context and can had make a distinction.Here the part of speech that word is endowed belongs to Penn Treebank POS part-of-speech tagging collection.
Such as " i love you. " and " two after she give her son too much love. " progress word marks Sentence just turns into:
I_PRP (pronoun) love_VBP (verb) you_PRP (pronoun) ._.;
She_PRP (pronoun) give_VB (verb) her_PRP $ (pronoun) son_NN (noun) too_RB (adverbial word) much_ JJ (adjective) love_NN (noun) ._..
S3, according to the part-of-speech information of mark be modeled structure part of speech associated weights matrix M, and for part of speech to pair Answer the relative position i of word pair to be modeled, build position part of speech associated weights matrix M ' corresponding with positioni, wherein, matrix M ranks dimension concentrates the element in the species size of part of speech, matrix M to be the word of the row correspondence word of the element for part-of-speech tagging Co-occurrence probabilities of the property with the part of speech of the corresponding word of row of the element, matrix M 'iRanks dimension it is identical with matrix M, matrix M 'iIn Element for the row correspondence word of the element part of speech and co-occurrence of the part of speech in relative position i of the corresponding word of row of the element Probability;A kind of modeler model figure of part of speech and positional information disclosed in the embodiment of the present invention as shown in Figure 2, wherein, in ranks T0~TNRepresent part of speech, M 'i(Tt,Tt-2) represent part of speech TtWith part of speech Tt-2Co-occurrence probabilities in relative position i.
Wherein, after the part of speech of word is obtained, how part-of-speech information is participated in term vector learning model and to new Model is solved, it is necessary to part-of-speech information is modeled first.The target of modeling is that to set up ranks dimension be all part of speech mark Note concentrates the element in the part of speech incidence relation matrix of the species size of part of speech, matrix to be the probability that two parts of speech occur.Remove Outside this, to be also modeled for position relationship, because position relationship during two part of speech co-occurrences between them is also very Important.Step S3 specifically includes following sub-step:
S3.1, the word-part of speech constituted to each word in target text, generation for word and its corresponding part of speech It is right, according to word-part of speech to building part of speech associated weights matrix M, wherein, matrix M ranks dimension concentrates word for part-of-speech tagging Property species size, element in matrix M for the row correspondence word of the element part of speech and the word of the corresponding word of row of the element The co-occurrence probabilities of property;
For example for " for the word son in she give her son too much love. ", its part of speech be NN, Word her part of speech is PRP, then the corresponding rows of part of speech PRP and the element specified by the corresponding row of part of speech NN are in matrix The co-occurrence probabilities (i.e. weights) of two parts of speech.
S3.2, for part of speech the relative position i of corresponding word pair is modeled, builds position corresponding with position word Property associated weights matrix M 'i, wherein, matrix M 'iRanks dimension it is identical with matrix M, matrix M 'iIn element for the element Co-occurrence probabilities (weights) of the part of speech of row correspondence word with the part of speech of the corresponding word of row of the element in relative position i.
If for example, window size is 2c, i ∈ [- c, c].When window size is 6, then M ' will be set up-3、M′-2、 M′-1、M′1、M′2、M′3Totally 6 matrixes.
For example for " son and her in she give her son too much love. ", when son is target word When, the associated weight value of part of speech and position corresponding to the two word parts of speech is M '-1(PRP,NN)。
S4, by the matrix M and matrix M ' after modelingiIt is fused in skip-gram term vector models and builds object module, by Object module carries out term vector study and obtains target term vector, wherein, target term vector is used for word analogy task and word Similarity task.
Wherein, step S4 specifically includes following sub-step:
S4.1, structure initial target function:Wherein, C tables Show the vocabulary in whole training corpus, above and below Context (w) represents that each c word is constituted before and after target word w Literary set of words, c represents window size;
Pass through target word w because Skip-gram model thoughts are identicaltPredict the word v (w in contextt+i) wherein, i Represent wt+iWith wtBetween position relationship.With sample (Context (wt), wt) exemplified by, wherein | Context (wt) |=2c, its In, Context (wt) it is by word wtFront and rear each c word composition.The final optimization pass target of object module is still to whole training For corpus so that all to pass through target word wtTo predict the maximization of context words namely optimize initial target Function.
For example " she give her son too much love. " words son is target word w to samplet, c is 3, then Context(wt)={ she, give, her, too, much, love }.
S4.2, by the matrix M and matrix M ' after modelingiBe fused to the skip-gram words based on negative sampling to Object module is built in amount model, and according to the fresh target function of initial target function structure object module:Wherein, NEG (w) is the negative sample collection sampled to target word w, Lw(u) marking for being sample u, positive sample marking is 1, negative sample Give a mark as 0, θuThe auxiliary vector used for sample word during model training,For context wordsCorresponding word VectorTransposition,For TuWithCo-occurrence probabilities of two parts of speech when relative position relation is i;
For example " she give her son too much love. " words son is positive sample to sample, now word son Label be 1, be exactly negative sample for other words such as dog, flower etc., its label be 0.
It is illustrated in figure 3 a kind of overall flow rough schematic view disclosed in the embodiment of the present invention, the object module tool of structure There are input layer, projection layer, three layers of output layer.Wherein:
Word w (t) centered on the input of input layer, output is the corresponding term vectors of center word w (t);
Projection layer is mainly to be projected to the output result of input layer, and the input of projection layer and output is all in the model It is center word w (t) term vector;
Output layer mainly uses center word w (t) to predict such as w (t-2), w (t-1), w (t+1), above and below w (t+2) etc. The term vector of literary word.
Present invention is primarily intended to predicted using center word w (t) during its context words, it is considered to center word and its The part of speech and position relationship of context words.
S4.3, fresh target function is optimized, fresh target function value is maximized, and to parameter θuAndGradient calculation and renewal are carried out, and target term vector is obtained when traveling through and completing to whole training corpus.
For example can be using stochastic gradient rise method (Stochastic Gradient Ascent, SGA) to fresh target letter Number, which is optimized, maximizes fresh target function value.And to parameter θuWithGradient calculation and renewal, Target term vector is then just obtained when having been traveled through to whole training corpus.
It is alternatively possible to be updated by the way of as follows and gradient calculation obtains target term vector:
It is illustrated in figure 4 the term vector training method of another fusion part of speech provided in an embodiment of the present invention and positional information Schematic flow sheet, in the method shown in Fig. 4, including data prediction, participle and part-of-speech tagging, part of speech and positional information are built Mould, term vector training, task assess five steps.Wherein data prediction, participle and part-of-speech tagging, part of speech and positional information are built Mould, term vector training method and step as described in Example 1, task assess can using learn above with part of speech with After the target term vector of positional information, target term vector can be used for the task such as word analogy task and word similarity In.Mainly include following two steps:
Word analogy task is done with the target term vector learnt.For example for two words pair<king,queen>With< man,woman>, by these words are carried out to corresponding term vector calculating can find to exist v (king)-v (queen)= Relation as v (man)-v (woman).
The similar task of word is done with the target term vector learnt.A word is for example given such as " dog ", by calculating Other words and " dog " COS distance or Euclidean distance, which may can obtain " puppy ", " cat " etc. and " dog ", close Cut the N number of words of preceding top of relation.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, it is not used to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the invention etc., it all should include Within protection scope of the present invention.

Claims (8)

1. a kind of term vector training method for merging part of speech and positional information, it is characterised in that comprise the following steps:
S1, urtext is carried out to pre-process and obtain target text;
S2, the contextual information according to word, the part of speech concentrated using part-of-speech tagging carry out part of speech to the word in target text Mark;
S3, structure part of speech associated weights matrix M is modeled according to the part-of-speech information of mark, and for part of speech to corresponding list The relative position i of word pair is modeled, and builds position part of speech associated weights matrix M ' corresponding with positioni, wherein, matrix M's Ranks dimension concentrates the element in the species size of part of speech, matrix M to be the part of speech of the row correspondence word of the element for part-of-speech tagging With the co-occurrence probabilities of the part of speech of the corresponding word of row of the element, matrix M 'iRanks dimension it is identical with matrix M, matrix M 'iIn Co-occurrence of the element for the part of speech of the row correspondence word of the element and the part of speech of the corresponding word of row of the element in relative position i is general Rate;
S4, by the matrix M and matrix M ' after modelingiIt is fused in skip-gram term vector models and builds object module, by target Model carries out term vector study and obtains target term vector, wherein, target term vector is used for word analogy task and word is similar Degree task.
2. according to the method described in claim 1, it is characterised in that step S2 specifically includes following sub-step:
S2.1, to target text carry out participle, to distinguish all words in target text;
S2.2, to each sentence in target text, according to contextual information of the word in sentence, concentrated using part-of-speech tagging Part of speech to word carry out part-of-speech tagging.
3. method according to claim 1 or 2, it is characterised in that step S3 specifically includes following sub-step:
S3.1, the word-part of speech pair constituted to each word in target text, generation for word and its corresponding part of speech, According to word-part of speech to building part of speech associated weights matrix M, wherein, matrix M ranks dimension concentrates part of speech for part-of-speech tagging Element in species size, matrix M is the part of speech and the part of speech of the corresponding word of row of the element of the row correspondence word of the element Co-occurrence probabilities;
S3.2, for part of speech the relative position i of corresponding word pair is modeled, builds position corresponding with position part of speech pass Join weight matrix M 'i, wherein, matrix M 'iRanks dimension it is identical with matrix M, matrix M 'iIn element be the element row pair Answer the co-occurrence probabilities of the part of speech of word and the part of speech of the corresponding word of row of the element in relative position i.
4. method according to claim 3, it is characterised in that step S4 specifically includes following sub-step:
S4.1, structure initial target function:Wherein, C represents whole Vocabulary in individual training corpus, Context (w) represents the context list that each c word is constituted before and after target word w Word set, c represents window size;
S4.2, by the matrix M and matrix M ' after modelingiIt is fused to the skip-gram term vector moulds based on negative sampling Object module is built in type, and according to the fresh target function of initial target function structure object module:Wherein, NEG (w) is the negative sample collection sampled to target word w, Lw(u) marking for being sample u, positive sample marking is 1, negative sample Give a mark as 0, θuThe auxiliary vector used for sample word during model training,For context wordsCorresponding word to AmountTransposition,For TuWithCo-occurrence probabilities of two parts of speech when relative position relation is i;
S4.3, fresh target function is optimized, fresh target function value is maximized, and to parameter θuAndGradient calculation and renewal are carried out, and target term vector is obtained when traveling through and completing to whole training corpus.
5. a kind of term vector training system for merging part of speech and positional information, it is characterised in that including:
Pretreatment module, for urtext pre-process obtaining target text;
Part-of-speech tagging module, for the contextual information according to word, the part of speech concentrated using part-of-speech tagging is in target text Word carry out part-of-speech tagging;
Position part of speech Fusion Module, for being modeled structure part of speech associated weights matrix M according to the part-of-speech information of mark, and The relative position i of corresponding word pair is modeled for part of speech, position part of speech associated weights square corresponding with position is built Battle array M 'i, wherein, matrix M ranks dimension concentrates the element in the species size of part of speech, matrix M to be the element for part-of-speech tagging Row correspondence word part of speech and the co-occurrence probabilities of the part of speech of the corresponding word of row of the element, matrix M 'iRanks dimension and square Battle array M is identical, matrix M 'iIn element for the element row correspondence word part of speech exist with the part of speech of the corresponding word of row of the element Co-occurrence probabilities during relative position i;
Term vector study module, for by the matrix M and matrix M ' after modelingiIt is fused in skip-gram term vector models and builds Object module, carries out term vector study by object module and obtains target term vector, wherein, target term vector is appointed for word analogy Business and word similarity task.
6. system according to claim 5, it is characterised in that the part-of-speech tagging module includes:
Word-dividing mode, for carrying out participle to target text, to distinguish all words in target text;
Part-of-speech tagging submodule, for each sentence in target text, according to contextual information of the word in sentence, adopting The part of speech concentrated with part-of-speech tagging carries out part-of-speech tagging to word.
7. the system according to claim 5 or 6, it is characterised in that the position part of speech Fusion Module includes:
Part-of-speech information modeling module, for each word in target text, generation to be directed to word and its corresponding part of speech structure Into word-part of speech pair, according to word-part of speech to building part of speech associated weights matrix M, wherein, matrix M ranks dimension is word Property mark concentrate the species size of part of speech, the element in matrix M for the row correspondence word of the element part of speech and the row of the element The co-occurrence probabilities of the part of speech of correspondence word;
Positional information modeling module, for being modeled for part of speech to the relative position i of corresponding word pair, builds and position Corresponding position part of speech associated weights matrix M 'i, wherein, matrix M 'iRanks dimension it is identical with matrix M, matrix M 'iIn member Co-occurrence of the element for the part of speech of the row correspondence word of the element and the part of speech of the corresponding word of row of the element in relative position i is general Rate.
8. system according to claim 7, it is characterised in that the term vector study module includes:
Initial target function builds module, for building initial target function: Wherein, C represents the vocabulary in whole training corpus, and Context (w) represents each c group of words before and after target word w Into context words collection, c represents window size;
Fresh target function builds module, for by the matrix M and matrix M after modelingi' it is fused to the skip-gram based on negative sampling Object module is built in term vector model, and according to the fresh target function of initial target function structure object module:Wherein, NEG (w) is the negative sample collection sampled to target word w, Lw(u) marking for being sample u, positive sample marking is 1, negative sample Give a mark as 0, θuThe auxiliary vector used for sample word during model training,For context wordsCorresponding word to AmountTransposition,For TuWithCo-occurrence probabilities of two parts of speech when relative position relation is i;
Term vector learns submodule, for being optimized to fresh target function, and fresh target function value is maximized, and to parameter θuAndGradient calculation and renewal are carried out, and target is obtained when whole training corpus is traveled through and completed Term vector.
CN201710384135.6A 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information Active CN107239444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710384135.6A CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710384135.6A CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Publications (2)

Publication Number Publication Date
CN107239444A true CN107239444A (en) 2017-10-10
CN107239444B CN107239444B (en) 2019-10-08

Family

ID=59985183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710384135.6A Active CN107239444B (en) 2017-05-26 2017-05-26 A kind of term vector training method and system merging part of speech and location information

Country Status (1)

Country Link
CN (1) CN107239444B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229818A (en) * 2017-12-29 2018-06-29 网智天元科技集团股份有限公司 Talent Value calculates the method for building up and device of coordinate system
CN108305612A (en) * 2017-11-21 2018-07-20 腾讯科技(深圳)有限公司 Text-processing, model training method, device, storage medium and computer equipment
CN108628834A (en) * 2018-05-14 2018-10-09 国家计算机网络与信息安全管理中心 A kind of word lists dendrography learning method based on syntax dependence
CN108733653A (en) * 2018-05-18 2018-11-02 华中科技大学 A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
CN108875810A (en) * 2018-06-01 2018-11-23 阿里巴巴集团控股有限公司 The method and device of negative example sampling is carried out from word frequency list for training corpus
CN109190126A (en) * 2018-09-17 2019-01-11 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109271422A (en) * 2018-09-20 2019-01-25 华中科技大学 A kind of social networks subject matter expert's lookup method driven by not firm information
CN109271636A (en) * 2018-09-17 2019-01-25 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109308353A (en) * 2018-09-17 2019-02-05 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109325231A (en) * 2018-09-21 2019-02-12 中山大学 A kind of method that multi task model generates term vector
CN109344403A (en) * 2018-09-20 2019-02-15 中南大学 A kind of document representation method of enhancing semantic feature insertion
CN109639452A (en) * 2018-10-31 2019-04-16 深圳大学 Social modeling training method, device, server and storage medium
CN109858031A (en) * 2019-02-14 2019-06-07 北京小米智能科技有限公司 Neural network model training, context-prediction method and device
CN109858024A (en) * 2019-01-04 2019-06-07 中山大学 A kind of source of houses term vector training method and device based on word2vec
CN110276052A (en) * 2019-06-10 2019-09-24 北京科技大学 A kind of archaic Chinese automatic word segmentation and part-of-speech tagging integral method and device
CN110287236A (en) * 2019-06-25 2019-09-27 平安科技(深圳)有限公司 A kind of data digging method based on interview information, system and terminal device
CN110348001A (en) * 2018-04-04 2019-10-18 腾讯科技(深圳)有限公司 A kind of term vector training method and server
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN111506726A (en) * 2020-03-18 2020-08-07 大箴(杭州)科技有限公司 Short text clustering method and device based on part-of-speech coding and computer equipment
CN111832282A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 External knowledge fused BERT model fine adjustment method and device and computer equipment
CN111859910A (en) * 2020-07-15 2020-10-30 山西大学 Word feature representation method for semantic role recognition and fusing position information
CN113010670A (en) * 2021-02-22 2021-06-22 腾讯科技(深圳)有限公司 Account information clustering method, account information detection method, account information clustering device and account information detection device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866337A (en) * 2009-04-14 2010-10-20 日电(中国)有限公司 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866337A (en) * 2009-04-14 2010-10-20 日电(中国)有限公司 Part-or-speech tagging system, and device and method thereof for training part-or-speech tagging model
CN105243129A (en) * 2015-09-30 2016-01-13 清华大学深圳研究生院 Commodity property characteristic word clustering method
CN106649275A (en) * 2016-12-28 2017-05-10 成都数联铭品科技有限公司 Relation extraction method based on part-of-speech information and convolutional neural network

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305612A (en) * 2017-11-21 2018-07-20 腾讯科技(深圳)有限公司 Text-processing, model training method, device, storage medium and computer equipment
CN108305612B (en) * 2017-11-21 2020-07-31 腾讯科技(深圳)有限公司 Text processing method, text processing device, model training method, model training device, storage medium and computer equipment
CN108229818B (en) * 2017-12-29 2021-07-13 网智天元科技集团股份有限公司 Method and device for establishing talent value measuring and calculating coordinate system
CN108229818A (en) * 2017-12-29 2018-06-29 网智天元科技集团股份有限公司 Talent Value calculates the method for building up and device of coordinate system
CN110348001A (en) * 2018-04-04 2019-10-18 腾讯科技(深圳)有限公司 A kind of term vector training method and server
CN110348001B (en) * 2018-04-04 2022-11-25 腾讯科技(深圳)有限公司 Word vector training method and server
CN108628834A (en) * 2018-05-14 2018-10-09 国家计算机网络与信息安全管理中心 A kind of word lists dendrography learning method based on syntax dependence
CN108628834B (en) * 2018-05-14 2022-04-15 国家计算机网络与信息安全管理中心 Word expression learning method based on syntactic dependency relationship
CN108733653A (en) * 2018-05-18 2018-11-02 华中科技大学 A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
CN108733653B (en) * 2018-05-18 2020-07-10 华中科技大学 Sentiment analysis method of Skip-gram model based on fusion of part-of-speech and semantic information
CN108875810A (en) * 2018-06-01 2018-11-23 阿里巴巴集团控股有限公司 The method and device of negative example sampling is carried out from word frequency list for training corpus
CN109308353A (en) * 2018-09-17 2019-02-05 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109308353B (en) * 2018-09-17 2023-08-15 鼎富智能科技有限公司 Training method and device for word embedding model
CN109190126A (en) * 2018-09-17 2019-01-11 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109271636B (en) * 2018-09-17 2023-08-11 鼎富智能科技有限公司 Training method and device for word embedding model
CN109271636A (en) * 2018-09-17 2019-01-25 北京神州泰岳软件股份有限公司 The training method and device of word incorporation model
CN109190126B (en) * 2018-09-17 2023-08-15 北京神州泰岳软件股份有限公司 Training method and device for word embedding model
CN109271422A (en) * 2018-09-20 2019-01-25 华中科技大学 A kind of social networks subject matter expert's lookup method driven by not firm information
CN109344403A (en) * 2018-09-20 2019-02-15 中南大学 A kind of document representation method of enhancing semantic feature insertion
CN109271422B (en) * 2018-09-20 2021-10-08 华中科技大学 Social network subject matter expert searching method driven by unreal information
CN109344403B (en) * 2018-09-20 2020-11-06 中南大学 Text representation method for enhancing semantic feature embedding
CN109325231A (en) * 2018-09-21 2019-02-12 中山大学 A kind of method that multi task model generates term vector
CN109639452A (en) * 2018-10-31 2019-04-16 深圳大学 Social modeling training method, device, server and storage medium
CN109858024A (en) * 2019-01-04 2019-06-07 中山大学 A kind of source of houses term vector training method and device based on word2vec
US20200265297A1 (en) * 2019-02-14 2020-08-20 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and apparatus based on neural network modeland storage medium
CN109858031A (en) * 2019-02-14 2019-06-07 北京小米智能科技有限公司 Neural network model training, context-prediction method and device
EP3696710A1 (en) * 2019-02-14 2020-08-19 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and apparatus based on neural network model and storage medium
CN109858031B (en) * 2019-02-14 2023-05-23 北京小米智能科技有限公司 Neural network model training and context prediction method and device
US11615294B2 (en) * 2019-02-14 2023-03-28 Beijing Xiaomi Intelligent Technology Co., Ltd. Method and apparatus based on position relation-based skip-gram model and storage medium
CN110276052A (en) * 2019-06-10 2019-09-24 北京科技大学 A kind of archaic Chinese automatic word segmentation and part-of-speech tagging integral method and device
CN110287236B (en) * 2019-06-25 2024-03-19 平安科技(深圳)有限公司 Data mining method, system and terminal equipment based on interview information
CN110287236A (en) * 2019-06-25 2019-09-27 平安科技(深圳)有限公司 A kind of data digging method based on interview information, system and terminal device
CN110534087B (en) * 2019-09-04 2022-02-15 清华大学深圳研究生院 Text prosody hierarchical structure prediction method, device, equipment and storage medium
CN110534087A (en) * 2019-09-04 2019-12-03 清华大学深圳研究生院 A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium
CN111506726A (en) * 2020-03-18 2020-08-07 大箴(杭州)科技有限公司 Short text clustering method and device based on part-of-speech coding and computer equipment
CN111506726B (en) * 2020-03-18 2023-09-22 大箴(杭州)科技有限公司 Short text clustering method and device based on part-of-speech coding and computer equipment
CN111859910B (en) * 2020-07-15 2022-03-18 山西大学 Word feature representation method for semantic role recognition and fusing position information
CN111859910A (en) * 2020-07-15 2020-10-30 山西大学 Word feature representation method for semantic role recognition and fusing position information
CN111832282A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 External knowledge fused BERT model fine adjustment method and device and computer equipment
CN113010670A (en) * 2021-02-22 2021-06-22 腾讯科技(深圳)有限公司 Account information clustering method, account information detection method, account information clustering device and account information detection device, and storage medium
CN113010670B (en) * 2021-02-22 2023-09-19 腾讯科技(深圳)有限公司 Account information clustering method, detection method, device and storage medium

Also Published As

Publication number Publication date
CN107239444B (en) 2019-10-08

Similar Documents

Publication Publication Date Title
CN107239444B (en) A kind of term vector training method and system merging part of speech and location information
Mathews et al. Semstyle: Learning to generate stylised image captions using unaligned text
CN107133211B (en) Composition scoring method based on attention mechanism
CN108519890A (en) A kind of robustness code abstraction generating method based on from attention mechanism
Zaman et al. HTSS: A novel hybrid text summarisation and simplification architecture
CN110750959A (en) Text information processing method, model training method and related device
CN108733653A (en) A kind of sentiment analysis method of the Skip-gram models based on fusion part of speech and semantic information
CN107729309A (en) A kind of method and device of the Chinese semantic analysis based on deep learning
CN110807328A (en) Named entity identification method and system oriented to multi-strategy fusion of legal documents
CN106980608A (en) A kind of Chinese electronic health record participle and name entity recognition method and system
CN112417880A (en) Court electronic file oriented case information automatic extraction method
CN108388560A (en) GRU-CRF meeting title recognition methods based on language model
CN106844345B (en) A kind of multitask segmenting method based on parameter linear restriction
Hu et al. PLANET: Dynamic content planning in autoregressive transformers for long-form text generation
CN111966820B (en) Method and system for constructing and extracting generative abstract model
CN111710428B (en) Biomedical text representation method for modeling global and local context interaction
CN111651973A (en) Text matching method based on syntax perception
CN114756681A (en) Evaluation text fine-grained suggestion mining method based on multi-attention fusion
Johns Distributional social semantics: Inferring word meanings from communication patterns
CN114254645A (en) Artificial intelligence auxiliary writing system
CN109033073A (en) Text contains recognition methods and device
CN112069827A (en) Data-to-text generation method based on fine-grained subject modeling
CN115906816A (en) Text emotion analysis method of two-channel Attention model based on Bert
Zhao Research and design of automatic scoring algorithm for English composition based on machine learning
Advaith et al. Parts of Speech Tagging for Kannada and Hindi Languages using ML and DL models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant