CN107193806A - A kind of vocabulary justice former automatic prediction method and device - Google Patents

A kind of vocabulary justice former automatic prediction method and device Download PDF

Info

Publication number
CN107193806A
CN107193806A CN201710429027.6A CN201710429027A CN107193806A CN 107193806 A CN107193806 A CN 107193806A CN 201710429027 A CN201710429027 A CN 201710429027A CN 107193806 A CN107193806 A CN 107193806A
Authority
CN
China
Prior art keywords
former
justice
vocabulary
vector
unknown
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710429027.6A
Other languages
Chinese (zh)
Other versions
CN107193806B (en
Inventor
孙茂松
谢若冰
袁星驰
刘知远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201710429027.6A priority Critical patent/CN107193806B/en
Publication of CN107193806A publication Critical patent/CN107193806A/en
Application granted granted Critical
Publication of CN107193806B publication Critical patent/CN107193806B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention discloses a kind of vocabulary justice former automatic prediction method and device, method includes:According to the term vector of each default vocabulary, the vector distance of each former vocabulary of unknown justice and each known former vocabulary of justice is calculated;According to each vector distance and distance threshold, the former vocabulary of justice known at least one target is selected as the former set of alternative justice of each unknown former vocabulary of justice;The justice original vector of the former vocabulary of justice according to known to each target in alternative justice original set, calculates each adopted former fraction for obtaining each former vocabulary of unknown justice;According to score threshold and the former fractions of Ge Yi, the former vector of the first justice of each former vocabulary of unknown justice is obtained.The former set of alternative justice of each former vocabulary of unknown justice is determined by vector distance, further calculate the former fraction of each justice in the former set of alternative justice, and thus obtain the former vector of the first justice of each former vocabulary of unknown justice, automatically justice original exactly can be carried out to the former vocabulary of unknown justice to predict, mitigate the pressure of mark by hand, reduce by different people mark to resulting possible deviation.

Description

A kind of vocabulary justice former automatic prediction method and device
Technical field
The present embodiments relate to field of computer technology, and in particular to a kind of vocabulary justice former automatic prediction method and dress Put.
Background technology
Sentence is made up of vocabulary one by one, and to express the different meanings, and different vocabulary has its particularity also to have Their similitude, HowNet is used for portraying these features of different vocabulary.HowNet is by manually marking, to most of common The justice that vocabulary is labelled with it is former, and justice original is a smaller set relative to vocabulary, and it illustrates some more basic of vocabulary Meaning, the former combination of different justice can represent different vocabulary, such as:The adopted original of curios shop includes:Place, business, buys, sells, Jewellery and past.And the definition of curios shop then can originally be portrayed by these justice:The commercial location for buying and selling past jewellery is exactly Curios shop.The characteristics of justice is original many good, such as judge the similar of the two vocabulary according to the former common factor of the justice of two vocabulary Degree, can be used for preferably generating term vector for more tasks in natural language processing.
Although the original many good properties of justice, the former mark of justice is the thing that part is wasted time and energy very much.HowNet is Most it is initially to take the lead mark by many language specialists through more than ten years of being born, but with the fast development of information technology, word The quantity of remittance is in explosively to increase, and how efficiently rapidly and accurately former just into one for the vocabulary mark justice of these new generations It is individual the problem of have to solve, need a kind of former automatic structure model of justice rather than by hand mark badly, can both ensure adopted former tool There is identical feature, the deviation that can also avoid people's mark from producing.
The content of the invention
Because there is above mentioned problem in prior art, the embodiment of the present invention propose a kind of former automatic prediction method of vocabulary justice and Device.
In a first aspect, the embodiment of the present invention proposes a kind of former automatic prediction method of vocabulary justice, including:
According to the term vector of each default vocabulary, each former vocabulary of unknown justice and each known former vocabulary of justice are calculated to span From;
According to each vector distance and distance threshold, justice original vocabulary known at least one target is selected as each unknown former word of justice The former set of alternative justice converged;
The former vector of justice of the former vocabulary of justice according to known to each target in the former set of alternative justice, calculating obtains each former vocabulary of unknown justice The former fraction of each justice;
According to score threshold and the former fractions of Ge Yi, the former vector of the first justice of each former vocabulary of unknown justice is obtained;
Wherein, the default vocabulary includes the former vocabulary of known justice and the former vocabulary of unknown justice.
Alternatively, methods described also includes:
Presupposition meaning original is obtained, the word for obtaining each default vocabulary is calculated according to stochastic gradient descent method and presupposition meaning original Vector.
Alternatively, it is described according to score threshold and the former fractions of Ge Yi, obtain the former vector of justice of each former vocabulary of unknown justice Afterwards, in addition to:
According to the former vocabulary vector of the unknown justice of the former vector sum of presupposition meaning, adopted former lexical matrix is obtained;
According to the former lexical matrix of the justice, the co-occurrence matrix for obtaining the former lexical matrix of the justice is calculated;
The former lexical matrix of justice and the co-occurrence matrix are decomposed respectively according to stochastic gradient descent method, the is obtained The former vector of ambiguous;
The former vector of the second justice according to the unknown former vocabulary vector sum of justice, calculating obtains desired value;
According to the desired value and the former vector of the first justice, calculating obtains the former vector of target justice;
Wherein, the former lexical matrix of the justice represents that corresponding vocabulary is former including corresponding justice by 0 and 1 expression, 1,0 expression pair It is former that the vocabulary answered does not include corresponding justice.
Alternatively, it is described that the former lexical matrix of justice and the co-occurrence matrix are carried out respectively according to stochastic gradient descent method Decompose, obtain the former vector of the second justice, specifically include:
The former lexical matrix of justice and the co-occurrence matrix are carried out respectively according to stochastic gradient descent method and loss function Decompose, obtain the former vector of the second justice;
Wherein, the loss function is:
W is the unknown former vocabulary vector of justice, S be S' be respectively the former presupposition meaning of vector sum second of the first presupposition meaning it is former to Amount, λ is predetermined coefficient, Mws、Cst, w, s be respectively the former lexical matrix of justice, the co-occurrence matrix, the unknown former vocabulary of justice Element in the former vector of first presupposition meaning described in vector sum, bwFor the biasing of the unknown former vocabulary vector of justice, bsFor described first The biasing of the former vector of presupposition meaning.
Second aspect, the embodiment of the present invention also proposes a kind of former automatic Prediction device of vocabulary justice, including:
Distance calculation module, for the term vector according to each default vocabulary, calculate each former vocabulary of unknown justice with it is each Know the vector distance of adopted former vocabulary;
Adopted former set determining module, it is adopted known at least one target for according to each vector distance and distance threshold, selecting Former vocabulary is gathered as the alternative justice original of each unknown former vocabulary of justice;
Adopted raw score computing module, for the former vector of justice of the former vocabulary of justice according to known to each target in the former set of alternative justice, Calculate the fraction for each justice original for obtaining each former vocabulary of unknown justice;
Adopted former vectorial determining module, for according to score threshold and the former fractions of Ge Yi, obtaining each former vocabulary of unknown justice The former vector of first justice;
Wherein, the default vocabulary includes the former vocabulary of known justice and the former vocabulary of unknown justice.
Alternatively, described device also includes:
Term vector computing module, it is former for obtaining presupposition meaning, calculated according to stochastic gradient descent method and presupposition meaning original Obtain the term vector of each default vocabulary.
Alternatively, described device also includes:
Adopted original lexical matrix acquisition module, for according to the former vocabulary vector of the unknown justice of the former vector sum of presupposition meaning, obtaining adopted original Lexical matrix;
Co-occurrence matrix computing module, for according to the former lexical matrix of the justice, calculating and obtaining the former lexical matrix of the justice Co-occurrence matrix;
Matrix decomposition module, for according to stochastic gradient descent method respectively to the former lexical matrix of justice and the co-occurrence square Battle array is decomposed, and obtains the former vector of the second justice;
Desired value computing module, for the former vector of the second justice according to the unknown former vocabulary vector sum of justice, is calculated To desired value;
The former vector calculation module of target justice, for according to the desired value and the former vector of the first justice, calculating to obtain mesh The former vector of mark justice;
Wherein, the former lexical matrix of the justice represents that corresponding vocabulary is former including corresponding justice by 0 and 1 expression, 1,0 expression pair It is former that the vocabulary answered does not include corresponding justice.
Alternatively, the matrix decomposition module specifically for according to stochastic gradient descent method and loss function respectively to described Adopted original lexical matrix and the co-occurrence matrix are decomposed, and obtain the former vector of the second justice;
Wherein, the loss function is:
L=∑sW ∈ W, s ∈ S(W·(S+S')+bw+bs-Mws)2+λ∑S, t ∈ S(s·t-Cst)2
W is the unknown former vocabulary vector of justice, S be S' be respectively the former presupposition meaning of vector sum second of the first presupposition meaning it is former to Amount, λ is predetermined coefficient, Mws、Cst, w, s be respectively the former lexical matrix of justice, the co-occurrence matrix, the unknown former vocabulary of justice Element in the former vector of first presupposition meaning described in vector sum, bwFor the biasing of the unknown former vocabulary vector of justice, bsFor described first The biasing of the former vector of presupposition meaning.
As shown from the above technical solution, the embodiment of the present invention by the unknown former vocabulary of justice and each known former vocabulary of justice to Span further calculates the former fraction of each justice in the former set of alternative justice from the former set of alternative justice for determining each former vocabulary of unknown justice, And the former vector of the first justice of each former vocabulary of unknown justice is thus obtained, justice exactly automatically can be carried out to the former vocabulary of unknown justice former Prediction, mitigates the pressure of mark by hand, and reduce by different people mark to resulting possible deviation.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these figures.
A kind of schematic flow sheet of the automatic prediction method for vocabulary justice original that Fig. 1 provides for one embodiment of the invention;
The former schematic diagram of justice of " curios shop " vocabulary that Fig. 2 provides for one embodiment of the invention;
The former schematic diagram of justice of " apple " vocabulary that Fig. 3 provides for one embodiment of the invention;
Fig. 4 carries out the schematic flow sheet of selection for the former set of alternative justice that one embodiment of the invention is provided;
A kind of structural representation of the automatic Prediction device for vocabulary justice original that Fig. 5 provides for one embodiment of the invention.
Embodiment
Below in conjunction with the accompanying drawings, the embodiment to the present invention is further described.Following examples are only used for more Technical scheme is clearly demonstrated, and can not be limited the scope of the invention with this.
Fig. 1 shows a kind of schematic flow sheet of the former automatic prediction method of vocabulary justice that the present embodiment is provided, including:
S101, the term vector according to each default vocabulary, calculate each former vocabulary of unknown justice and each known former vocabulary of justice Vector distance.
Wherein, the default vocabulary includes the former vocabulary of known justice and the former vocabulary of unknown justice.
Specifically, the corpus big to one counts the upper bottom between the word frequency and different vocabulary of each vocabulary first Relation;Then realize that co-occurrence matrix is decomposed to obtain the term vector of vocabulary using stochastic gradient descent.
Co-occurrence matrix has contained the correlation between the text message enriched and word and word, is dropped by matrix decomposition Dimension, the low-dimensional for obtaining vocabulary represents still embody the correlation between vocabulary and vocabulary well.
As shown in Figures 2 and 3, Fig. 2 is the justice that Chinese " curios shop " includes to the content of the former vector of justice of each default vocabulary Original, Fig. 3 is that the justice that English " apple " includes is former.
S102, according to each vector distance and distance threshold, select known at least one target justice original vocabulary as each unknown The former set of alternative justice of adopted original vocabulary.
For the former vocabulary of unknown justice, the selection of the alternative former set of justice in vector space as shown in figure 4, look for several from it Nearest known justice original vocabulary, their former former set of alternately justice of justice.And need according to the distance between vocabulary, to this A little former marking of justice.
Specifically, its distance with the vocabulary of the former vector of known justice is calculated firstly for each neologisms;And select nearest Several vocabulary.For the nearest vocabulary elected, their the former weight for neologisms of justice is calculated.
Assuming that a vocabulary and neologisms are nearer, then the justice of this vocabulary is former it is more likely that the justice of neologisms is former, so for Any one justice is former, and it can be expressed as below for a neologisms, getable fraction:
Wherein w represents neologisms, and s represents that a justice is former, and W is the set of all former vocabulary of known justice, MvsWhether represent vocabulary v There is adopted original s, it is then 1 to have, and is otherwise 0, and obtained Pr (s | w) it is higher, s is more probably v justice original.
In fact, it is complicated many veritably to calculate said process, due to the distance between normalized vector (- 1, 1) between, so it is former to distinguish different justice well, so we allow adopted original bigger power of the nearer vocabulary of distance Weight, we introduce hyper parameter p, for the vocabulary that kth is near, are multiplied by pkSo that the discrimination between the justice of different vocabulary is former is bigger, And it is multiplied by the coefficient of an exponential damping and ensure that Pr (s | w) in certain scope, will not dissipate.
The former vector of justice of adopted former vocabulary known to each target, calculates and obtains each unknown justice in the former set of the alternative justice of S103, basis The fraction of each justice original of former vocabulary.
S104, the fraction according to score threshold and Ge Yi originals, the first justice for obtaining each former vocabulary of unknown justice are former vectorial.
Specifically, often the shared justice of the word similar to it is former for vocabulary, such as " special " is all shared by China and the U.S., " country " etc. It is adopted former, but many words often have the justice of oneself former, so the method for the present embodiment offer is provided, can be from close lexicography To adopted former, it can also learn former to distinctive justice.
The present embodiment determines each former word of unknown justice by the vector distance of the former vocabulary of unknown justice and each known former vocabulary of justice The former set of alternative justice converged, further calculates the former fraction of each justice in the former set of alternative justice, and thus obtain each former word of unknown justice The former vector of the first justice converged, automatically can carry out the former pressure predicted, mitigate mark by hand of justice exactly to the former vocabulary of unknown justice Power, and reduce by different people mark to resulting possible deviation.
Further, on the basis of above method embodiment, methods described also includes:
S100, acquisition presupposition meaning are former, calculate according to stochastic gradient descent method and presupposition meaning original and obtain each default word The term vector of remittance.
Wherein, presupposition meaning original includes 1400 common justice originals.
Further, it is described according to score threshold and the former fractions of Ge Yi on the basis of above method embodiment, obtain After the former vector of justice of each former vocabulary of unknown justice, in addition to:
S105, according to the former vocabulary vector of the unknown justice of the former vector sum of presupposition meaning, obtain adopted former lexical matrix;
Wherein, the former lexical matrix of the justice represents that corresponding vocabulary is former including corresponding justice by 0 and 1 expression, 1,0 expression pair It is former that the vocabulary answered does not include corresponding justice.
S106, according to the former lexical matrix of justice, calculate the co-occurrence matrix for obtaining the former lexical matrix of the justice;
The former co-occurrence matrix of adopted original and justice has contained the relation between abundant justice original, can just as the co-occurrence matrix of vocabulary For generation term vector, justice original co-occurrence matrix can also aid in generating the preferably former vector of justice.
S107, according to stochastic gradient descent method the former lexical matrix of justice and the co-occurrence matrix are decomposed respectively, Obtain the former vector of the second justice;
Specifically, vocabulary 01 matrix former with justice is calculated first;Then the former co-occurrence matrix former with justice of justice is calculated;Last profit Two above matrix is decomposed with the method for stochastic gradient descent to obtain adopted former vector.
S108, the former vector of the second justice according to the unknown former vocabulary vector sum of justice, calculating obtain desired value;
S109, according to the desired value and the former vector of the first justice, calculating obtains the former vector of target justice;
Further, on the basis of above method embodiment, S107 is specifically included:
The former lexical matrix of justice and the co-occurrence matrix are carried out respectively according to stochastic gradient descent method and loss function Decompose, obtain the former vector of the second justice;
Wherein, the loss function is:
L=∑sW ∈ W, s ∈ S(W·(S+S′)+bw+bs-Mws)2+λ∑S, t ∈ S(s·t-Cst)2
W is the unknown former vocabulary vector of justice, S be S ' be respectively the former presupposition meaning of vector sum second of the first presupposition meaning it is former to Amount, λ is predetermined coefficient, Mws、Cst, w, s be respectively the former lexical matrix of justice, the co-occurrence matrix, the unknown former vocabulary of justice Element in the former vector of first presupposition meaning described in vector sum, bwFor the biasing of the unknown former vocabulary vector of justice, bsFor described first The biasing of the former vector of presupposition meaning.
Declined by gradient and cause L reductions so as to obtain a former vector representation of good justice, can finally use such as minor function To calculate a neologisms and the former possibility relation of a justice:
Pr (s | w)=∑v∈WCos (v, W) Mvs+ λ cos (w, s)
There is presently no adopted former automatic Prediction model, existing mode is marked by hand by people, is taken time and effort, and And mark effect varies with each individual, large effect can be produced to adopted former accuracy.The present embodiment can utilize existing mark Data carry out automatic Prediction justice original, are tested using HowNet a part of data as test set, it is found that the present embodiment As a result and manually it is labeled with significantly overlapping, the degree of accuracy is higher.Also, the model of the present embodiment it can be found that some The recommendation justice member not marked in HowNet, and these newfound former results of candidate's justice also have suitable correctness.
Fig. 5 shows a kind of structural representation of the former automatic Prediction device of vocabulary justice that the present embodiment is provided, the dress Put including:The former set determining module 502 of distance calculation module 501, justice, adopted raw score computing module 503 and the former vector of justice are determined Module 504, wherein:
The distance calculation module 501 is used for the term vector according to each default vocabulary, calculate each former vocabulary of unknown justice with The vector distance of each former vocabulary of known justice;
The former set determining module 502 of justice is used for according to each vector distance and distance threshold, selects at least one target Known adopted former vocabulary is gathered as the alternative justice original of each unknown former vocabulary of justice;
The justice that the adopted raw score computing module 503 is used for the former vocabulary of justice according to known to each target in the former set of alternative justice is former Vector, calculates the fraction for each justice original for obtaining each former vocabulary of unknown justice;
The former vectorial determining module 504 of justice is used for according to score threshold and the former fractions of Ge Yi, obtains each unknown justice former The former vector of the first justice of vocabulary;
Wherein, the default vocabulary includes the former vocabulary of known justice and the former vocabulary of unknown justice.
Specifically, the distance calculation module 501 calculates each former vocabulary of unknown justice according to the term vector of each default vocabulary With the vector distance of each known former vocabulary of justice;The justice original gathers determining module 502 according to each vector distance and distance threshold, The former vocabulary of justice known at least one target is selected as the former set of alternative justice of each unknown former vocabulary of justice;The adopted raw score is calculated The former vector of justice of the former vocabulary of justice according to known to each target in the former set of alternative justice of module 503, calculating obtains each former vocabulary of unknown justice The former fraction of each justice;The former vectorial determining module 504 of justice obtains each unknown justice according to score threshold and the former fractions of Ge Yi The former vector of the first justice of former vocabulary.
The present embodiment determines each former word of unknown justice by the vector distance of the former vocabulary of unknown justice and each known former vocabulary of justice The former set of alternative justice converged, further calculates the former fraction of each justice in the former set of alternative justice, and thus obtain each former word of unknown justice The former vector of the first justice converged, automatically can carry out the former pressure predicted, mitigate mark by hand of justice exactly to the former vocabulary of unknown justice Power, and reduce by different people mark to resulting possible deviation.
Further, on the basis of said apparatus embodiment, described device also includes:
Term vector computing module, it is former for obtaining presupposition meaning, calculated according to stochastic gradient descent method and presupposition meaning original Obtain the term vector of each default vocabulary.
Further, on the basis of said apparatus embodiment, described device also includes:
Adopted original lexical matrix acquisition module, for according to the former vocabulary vector of the unknown justice of the former vector sum of presupposition meaning, obtaining adopted original Lexical matrix;
Co-occurrence matrix computing module, for according to the former lexical matrix of the justice, calculating and obtaining the former lexical matrix of the justice Co-occurrence matrix;
Matrix decomposition module, for according to stochastic gradient descent method respectively to the former lexical matrix of justice and the co-occurrence square Battle array is decomposed, and obtains the former vector of the second justice;
Desired value computing module, for the former vector of the second justice according to the unknown former vocabulary vector sum of justice, is calculated To desired value;
The former vector calculation module of target justice, for according to the desired value and the former vector of the first justice, calculating to obtain mesh The former vector of mark justice;
Wherein, the former lexical matrix of the justice represents that corresponding vocabulary is former including corresponding justice by 0 and 1 expression, 1,0 expression pair It is former that the vocabulary answered does not include corresponding justice.
Further, on the basis of said apparatus embodiment, the matrix decomposition module is specifically for according to boarding steps Degree descent method and loss function are decomposed to the former lexical matrix of justice and the co-occurrence matrix respectively, obtain the second justice original to Amount;
Wherein, the loss function is:
L=ΣW ∈ W, s ∈ S(W·(S+S')+bw+bs-Mws)2+λΣS, t ∈ S(s·t-Cst)2
W is the unknown former vocabulary vector of justice, S be S' be respectively the former presupposition meaning of vector sum second of the first presupposition meaning it is former to Amount, λ is predetermined coefficient, Mws、Cst, w, s be respectively the former lexical matrix of justice, the co-occurrence matrix, the unknown former vocabulary of justice Element in the former vector of first presupposition meaning described in vector sum, bwFor the biasing of the unknown former vocabulary vector of justice, bsFor described first The biasing of the former vector of presupposition meaning.
The automatic Prediction device of vocabulary justice original described in the present embodiment can be used for performing above method embodiment, its principle Similar with technique effect, here is omitted.
Device embodiment described above is only schematical, wherein the unit illustrated as separating component can To be or may not be physically separate, the part shown as unit can be or may not be physics list Member, you can with positioned at a place, or can also be distributed on multiple NEs.It can be selected according to the actual needs In some or all of module realize the purpose of this embodiment scheme.Those of ordinary skill in the art are not paying creativeness Work in the case of, you can to understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can Realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Understood based on such, on The part that technical scheme substantially in other words contributes to prior art is stated to embody in the form of software product, should Computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including some fingers Order is to cause a computer equipment (can be personal computer, server, or network equipment etc.) to perform each implementation Method described in some parts of example or embodiment.
It should be noted that:The above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although reference The present invention is described in detail previous embodiment, it will be understood by those within the art that:It still can be right Technical scheme described in foregoing embodiments is modified, or carries out equivalent substitution to which part technical characteristic;And this A little modifications are replaced, and the essence of appropriate technical solution is departed from the spirit and model of various embodiments of the present invention technical scheme Enclose.

Claims (8)

1. a kind of former automatic prediction method of vocabulary justice, it is characterised in that including:
According to the term vector of each default vocabulary, the vector distance of each former vocabulary of unknown justice and each known former vocabulary of justice is calculated;
According to each vector distance and distance threshold, justice original vocabulary known at least one target is selected as each unknown former vocabulary of justice The alternative former set of justice;
The former vector of justice of the former vocabulary of justice according to known to each target in the former set of alternative justice, calculates and obtains each of each former vocabulary of unknown justice Adopted former fraction;
According to score threshold and the former fractions of Ge Yi, the former vector of the first justice of each former vocabulary of unknown justice is obtained;
Wherein, the default vocabulary includes the former vocabulary of known justice and the former vocabulary of unknown justice.
2. according to the method described in claim 1, it is characterised in that methods described also includes:
Obtain that presupposition meaning is former, calculated according to stochastic gradient descent method and presupposition meaning original obtain the word of each default vocabulary to Amount.
3. according to the method described in claim 1, it is characterised in that described according to score threshold and the former fractions of Ge Yi, obtain After the former vector of justice of each former vocabulary of unknown justice, in addition to:
According to the former vocabulary vector of the unknown justice of the former vector sum of presupposition meaning, adopted former lexical matrix is obtained;
According to the former lexical matrix of the justice, the co-occurrence matrix for obtaining the former lexical matrix of the justice is calculated;
The former lexical matrix of justice and the co-occurrence matrix are decomposed respectively according to stochastic gradient descent method, the second justice is obtained Former vector;
The former vector of the second justice according to the unknown former vocabulary vector sum of justice, calculating obtains desired value;
According to the desired value and the former vector of the first justice, calculating obtains the former vector of target justice;
Wherein, the former lexical matrix of the justice is represented by 0 and 1, and 1 represents that corresponding vocabulary is former including corresponding justice, and 0 represents corresponding It is former that vocabulary does not include corresponding justice.
4. method according to claim 3, it is characterised in that described former to the justice respectively according to stochastic gradient descent method Lexical matrix and the co-occurrence matrix are decomposed, and are obtained the former vector of the second justice, are specifically included:
The former lexical matrix of justice and the co-occurrence matrix are decomposed respectively according to stochastic gradient descent method and loss function, Obtain the former vector of the second justice;
Wherein, the loss function is:
L=∑sW ∈ W, s ∈ S(W·(S+S')+bw+bs-Mws)2+λ∑S, t ∈ S(s·t-Cst)2
W is the unknown former vocabulary vector of justice, and S is that S' is respectively the former vector of the former presupposition meaning of vector sum second of the first presupposition meaning, and λ is Predetermined coefficient, Mws、Cst, w, s be respectively the former lexical matrix of justice, the co-occurrence matrix, the unknown former vocabulary vector sum of justice Element in the former vector of first presupposition meaning, bwFor the biasing of the unknown former vocabulary vector of justice, bsFor first presupposition meaning The biasing of former vector.
5. a kind of former automatic Prediction device of vocabulary justice, it is characterised in that including:
Distance calculation module, for the term vector according to each default vocabulary, calculates each former vocabulary of unknown justice and each known justice The vector distance of former vocabulary;
Adopted former set determining module, for according to each vector distance and distance threshold, selecting justice original word known at least one target Converge and gather as the alternative justice original of each unknown former vocabulary of justice;
Adopted raw score computing module, for the former vector of justice of the former vocabulary of justice according to known to each target in the former set of alternative justice, is calculated Obtain the former fraction of each justice of each former vocabulary of unknown justice;
Adopted former vectorial determining module, for according to score threshold and the former fractions of Ge Yi, obtaining the first of each former vocabulary of unknown justice Adopted former vector;
Wherein, the default vocabulary includes the former vocabulary of known justice and the former vocabulary of unknown justice.
6. device according to claim 5, it is characterised in that described device also includes:
Term vector computing module, it is former for obtaining presupposition meaning, calculate and obtain according to stochastic gradient descent method and presupposition meaning original The term vector of each default vocabulary.
7. device according to claim 5, it is characterised in that described device also includes:
Adopted original lexical matrix acquisition module, for according to the former vocabulary vector of the unknown justice of the former vector sum of presupposition meaning, obtaining adopted former vocabulary Matrix;
Co-occurrence matrix computing module, for according to the former lexical matrix of the justice, calculating the co-occurrence for obtaining the former lexical matrix of the justice Matrix;
Matrix decomposition module, for being entered respectively to the former lexical matrix of justice and the co-occurrence matrix according to stochastic gradient descent method Row is decomposed, and obtains the former vector of the second justice;
Desired value computing module, for the former vector of the second justice according to the unknown former vocabulary vector sum of justice, calculating obtains mesh Scale value;
The former vector calculation module of target justice, for according to the desired value and the former vector of the first justice, calculating to obtain target justice Former vector;
Wherein, the former lexical matrix of the justice is represented by 0 and 1, and 1 represents that corresponding vocabulary is former including corresponding justice, and 0 represents corresponding It is former that vocabulary does not include corresponding justice.
8. device according to claim 7, it is characterised in that the matrix decomposition module is specifically for according to stochastic gradient Descent method and loss function are decomposed to the former lexical matrix of justice and the co-occurrence matrix respectively, obtain the second justice original to Amount;
Wherein, the loss function is:
L=∑sW ∈ W, s ∈ S(W·(S+S')+bw+bs-Mws)2+λ∑S, t ∈ S(s·t-Cst)2
W is the unknown former vocabulary vector of justice, and S is that S' is respectively the former vector of the former presupposition meaning of vector sum second of the first presupposition meaning, and λ is Predetermined coefficient, Mws、Cst, w, s be respectively the former lexical matrix of justice, the co-occurrence matrix, the unknown former vocabulary vector sum of justice Element in the former vector of first presupposition meaning, bwFor the biasing of the unknown former vocabulary vector of justice, bsFor first presupposition meaning The biasing of former vector.
CN201710429027.6A 2017-06-08 2017-06-08 A kind of automatic prediction method and device that vocabulary justice is former Active CN107193806B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710429027.6A CN107193806B (en) 2017-06-08 2017-06-08 A kind of automatic prediction method and device that vocabulary justice is former

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710429027.6A CN107193806B (en) 2017-06-08 2017-06-08 A kind of automatic prediction method and device that vocabulary justice is former

Publications (2)

Publication Number Publication Date
CN107193806A true CN107193806A (en) 2017-09-22
CN107193806B CN107193806B (en) 2019-11-22

Family

ID=59877677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710429027.6A Active CN107193806B (en) 2017-06-08 2017-06-08 A kind of automatic prediction method and device that vocabulary justice is former

Country Status (1)

Country Link
CN (1) CN107193806B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984533A (en) * 2018-08-03 2018-12-11 清华大学 A kind of former prediction technique of vocabulary justice and device
CN109271633A (en) * 2018-09-17 2019-01-25 北京神州泰岳软件股份有限公司 A kind of the term vector training method and device of single semantic supervision
CN109299459A (en) * 2018-09-17 2019-02-01 北京神州泰岳软件股份有限公司 A kind of the term vector training method and device of single semantic supervision
CN109446518A (en) * 2018-10-09 2019-03-08 清华大学 The coding/decoding method and decoder of language model
CN109597988A (en) * 2018-10-31 2019-04-09 清华大学 The former prediction technique of vocabulary justice, device and electronic equipment across language

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150388A (en) * 2013-03-21 2013-06-12 天脉聚源(北京)传媒科技有限公司 Method and device for extracting key words
CN103186647A (en) * 2011-12-31 2013-07-03 北京金山软件有限公司 Method and device for sequencing according to contribution degree
CN104699819A (en) * 2015-03-26 2015-06-10 浪潮集团有限公司 Sememe classification method and device
CN106610949A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Text feature extraction method based on semantic analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186647A (en) * 2011-12-31 2013-07-03 北京金山软件有限公司 Method and device for sequencing according to contribution degree
CN103150388A (en) * 2013-03-21 2013-06-12 天脉聚源(北京)传媒科技有限公司 Method and device for extracting key words
CN104699819A (en) * 2015-03-26 2015-06-10 浪潮集团有限公司 Sememe classification method and device
CN106610949A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 Text feature extraction method based on semantic analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHUAN-JIE ET AL.: "Dimentional Sentiment Analysis by Synsets and Sense Definitions", 《2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP)》 *
YAN WANG ET AL.: "Incorporating Linguistic Knowledge for Learning Distributed Word Representations", 《PLOS ONE》 *
孙茂松,等: "借重于人工知识库的词和义项的向量表示_以HowNet为例", 《中文信息学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984533A (en) * 2018-08-03 2018-12-11 清华大学 A kind of former prediction technique of vocabulary justice and device
CN109271633A (en) * 2018-09-17 2019-01-25 北京神州泰岳软件股份有限公司 A kind of the term vector training method and device of single semantic supervision
CN109299459A (en) * 2018-09-17 2019-02-01 北京神州泰岳软件股份有限公司 A kind of the term vector training method and device of single semantic supervision
CN109271633B (en) * 2018-09-17 2023-08-18 鼎富智能科技有限公司 Word vector training method and device for single semantic supervision
CN109299459B (en) * 2018-09-17 2023-08-22 北京神州泰岳软件股份有限公司 Word vector training method and device for single semantic supervision
CN109446518A (en) * 2018-10-09 2019-03-08 清华大学 The coding/decoding method and decoder of language model
CN109446518B (en) * 2018-10-09 2020-06-02 清华大学 Decoding method and decoder for language model
CN109597988A (en) * 2018-10-31 2019-04-09 清华大学 The former prediction technique of vocabulary justice, device and electronic equipment across language
CN109597988B (en) * 2018-10-31 2020-04-28 清华大学 Cross-language vocabulary semantic prediction method and device and electronic equipment

Also Published As

Publication number Publication date
CN107193806B (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
Pratt et al. What does a platypus look like? generating customized prompts for zero-shot image classification
CN110580292B (en) Text label generation method, device and computer readable storage medium
CN107193806A (en) A kind of vocabulary justice former automatic prediction method and device
CN107590134A (en) Text sentiment classification method, storage medium and computer
CN106844632B (en) Product comment emotion classification method and device based on improved support vector machine
US9224155B2 (en) Systems and methods for managing publication of online advertisements
CN110222178A (en) Text sentiment classification method, device, electronic equipment and readable storage medium storing program for executing
CN110008335A (en) The method and device of natural language processing
US8954316B2 (en) Systems and methods for categorizing and moderating user-generated content in an online environment
JP5137567B2 (en) Search filtering device and search filtering program
CN112908436B (en) Clinical test data structuring method, clinical test recommending method and device
Lou et al. Multilabel subject-based classification of poetry
CN111143569A (en) Data processing method and device and computer readable storage medium
CN108090099B (en) Text processing method and device
CN108269125A (en) Comment information method for evaluating quality and system, comment information processing method and system
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN112527981B (en) Open type information extraction method and device, electronic equipment and storage medium
CN105955975A (en) Knowledge recommendation method for academic literature
CN108920446A (en) A kind of processing method of Engineering document
CN111078546A (en) Method for expressing page features and electronic equipment
CN113761114A (en) Phrase generation method and device and computer-readable storage medium
CN107797981B (en) Target text recognition method and device
CN109241993B (en) Evaluation object emotion classification method and device integrating user and overall evaluation information
Piasecki et al. Extraction of the multi-word lexical units in the perspective of the wordnet expansion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant