CN107066443A - Multilingual sentence similarity acquisition methods and system are applied to based on linear regression - Google Patents
Multilingual sentence similarity acquisition methods and system are applied to based on linear regression Download PDFInfo
- Publication number
- CN107066443A CN107066443A CN201710187215.2A CN201710187215A CN107066443A CN 107066443 A CN107066443 A CN 107066443A CN 201710187215 A CN201710187215 A CN 201710187215A CN 107066443 A CN107066443 A CN 107066443A
- Authority
- CN
- China
- Prior art keywords
- sentence
- similar features
- value
- mrow
- features value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses comprised the following steps based on linear regression suitable for multilingual sentence similarity acquisition methods:Obtain two and above similar features value of two sentences;The corresponding feature weight of each similar features value is chosen according to two affiliated languages of sentence and application field;Two and above similar features value are subjected to linear regression according to the corresponding feature weight of each similar features value, the compound similar characteristic value of two sentences is drawn.The invention discloses be applied to multilingual sentence similarity based on linear regression to obtain system, including acquiring unit;Selecting unit;Linear regression unit.The present invention is applied to multilingual sentence similarity acquisition methods and system based on linear regression, by being weighted linear regression to the different similar features values of sentence, it is adaptable to multilingual and a variety of occasions.
Description
Technical field
The present invention relates to machine translation technical field, and in particular to is applied to multilingual sentence phase based on linear regression
Like degree acquisition methods and system.
Background technology
With developing rapidly for economic globalization and Internet, the translation of natural language is promoting politics, economic, text
Change and play more and more important effect in terms of exchanging.Past, people needed to turn over spoken and written languages in international exchange field
, it is necessary to using human translation, take time and effort when translating, and with the high speed development of computer hardware technique, machine translation and
Computer-aided translation has obtained increasingly being widely applied.
Contrast sentence similarity is to study the important topic in machine translation and computer-aided translation, traditional contrast
Using the contrast on single level more than method, such as the Duplication of vocabulary, language model Duplication, vocabulary vectorization are fixed with cosine
Reason calculates distance in semantic space and compared etc., but the sentence similarity control methods of this single level can not be directed to it is various
Language feature makes suitable contrast, such as the control methods suitable for English is not suitable for Chinese contrast.While single level
Sentence similarity control methods can not be directed to various occasion terms the characteristics of make suitable contrast, such as suitable for news English
The control methods of language is not suitable for the contrast of spoken English.
State Patent Office Patent No. CN201110303522.5 patent of invention discloses a kind of calculating sentence similarity
Method and apparatus and machine translation method and apparatus, it is poor that the patent employs vocabulary to the method for sentence similarity comparison
Different control methods, this method be applied to the obvious language of lexical gap and occasion, to lexical gap not substantially and sentence pattern
The obvious language of difference and occasion, contrast effect are poor.
The content of the invention
The technical problems to be solved by the invention are that existing sentence similarity control methods is not suitable for multilingual and many
Plant occasion, it is therefore intended that provide and be applied to based on linear regression in multilingual sentence similarity acquisition methods and system, solution
State problem.
The present invention is achieved through the following technical solutions:
Multilingual sentence similarity acquisition methods are applied to based on linear regression, comprised the following steps:S1:Obtain two
Two and above similar features value f of sentencei;The fiIncluding f1, f2, f3..., fn;S2:According to two affiliated languages of sentence and
Application field chooses the corresponding feature weight ω of each similar features valuei;The ωiIncluding ω1, ω2, ω3..., ωn;S3:Will
Two and above similar features value carry out linear regression according to the corresponding feature weight of each similar features value, draw two sentences
Compound similar characteristic value fs;The linear regression formula is as follows:F described aboveiFor similar features value,
ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.
In the prior art, using the contrast on single level, the sentence phase of this single level more than contrast sentence similarity
Suitable contrast can not be made for various language features and various occasion terms like degree control methods, such as suitable for English
Control methods is not suitable for Chinese contrast, it is adaptable to which the control methods of News English is not suitable for the contrast of spoken English.This hair
During bright application, two and above similar features value f of two sentences are first obtainedi;The fiIncluding f1, f2, f3..., fn, further according to
Two affiliated languages of sentence and application field choose the corresponding feature weight ω of each similar features valuei;The ωiIncluding ω1,
ω2, ω3..., ωn;Two and above similar features value are entered into line according to the corresponding feature weight of each similar features value again
Property return, draw the compound similar characteristic value f of two sentencess;The linear regression formula is as follows:Institute above
State fiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.For the similar of two sentences
Characteristic value, can choose but be not limited to structure similar features value, part of speech similar features value or vocabulary similar features value, these three phases
It can be very good to carry out sentence similarity contrast to mainstream speech and main flow occasion term like characteristic value, but if spy
Different language or occasion term, such as Turkish can also add root similar features value according to its feature.
Feature weight corresponding for each similar features value, according to the difference and the difference of application field of language, chooses
Feature weight it is also different.Such as English is compared with German, because German vocabulary lattice change and complete express semanteme, guest
Position of the syntactic structures such as the language adverbial modifier in sentence is very flexible, so the feature weight of reduction part of speech similar features value, and carry
The feature weight of high structure similar features value.Journalese is compared with drama term for another example, because news is mostly long sentence language
Method is of a tightly knit structure, it is necessary to which balanced consider each feature, and the sentence of drama tends to short sentence, and has the non-grammaticalization of spoken language, institute
To reduce the part of speech similar features value of drama term.The present invention to the different similar features values of sentence by being weighted linear return
Return, realize the inventive method suitable for multilingual and a variety of occasions.
Further, described two and above similar features value includes similar by the structure for calculating two obtained sentences
Characteristic value fi;Its calculation procedure is as follows:S111:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;S112:
Structure detection value TP, FP and FN between two syntax trees are drawn according to two syntax trees;S113:According to structure detection value TP,
FP and FN is calculated the structure similar features value f of two sentences by below equationi:On
The text TP receives true value for structure, and FP is structure wrong report value, and FN is that structure goes true value, and R is structure recall rate, and P is that structure is accurate
Rate, fiFor structure similar features value.
When the present invention is applied, first parsing needs two sentences compared, obtains two syntax trees corresponding with sentence, then
Draw structure detection value TP, FP and FN between two syntax trees according to two syntax trees, further according to structure detection value TP, FP and
FN is calculated the structure similar features value f of two sentences by below equationi:Institute above
State TP and receive true value for structure, FP is structure wrong report value, FN is that structure goes true value, and R is structure recall rate, and P is structure accuracy rate, fi
For structure similar features value.This algorithm to structure similar features value shows the architectural feature of sentence well.
Further, described two and above similar features value includes the part of speech that two obtained sentences are calculated by part of speech
Similar features value fi;Its calculation procedure is as follows:S121:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;
S122:Two sentences are divided into reference to sentence and former sentence;The reference sentence is only in this calculating part of speech similar features value fiWhen make
Sentence;The former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentencesiSentence;According to two
The part of speech distribution of individual syntax tree draws the minimum step number W required for being modified as another sentence from a sentence;S123:By with
Lower formula calculates the part of speech similar features value f of two sentencesi:W described above is to be modified as separately from a sentence
Minimum step number required for one sentence, L is the length with reference to sentence, fiFor part of speech similar features value.
Further, it is described from a sentence be modified as another sentence required for minimum step number W use Lai Wensi
Smooth distance.
When the present invention is applied, first parse two sentences and obtain two syntax trees corresponding with sentence, then by two sentences
Son is divided into reference to sentence and former sentence;The reference sentence is only in this calculating part of speech similar features value fiWhen the sentence that uses;It is described
Former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentencesiSentence;According to the word of two syntax trees
Property distribution draw minimum step number W required for being modified as another sentence from a sentence;Again two are calculated by below equation
The part of speech similar features value f of sentencei:W described above is to be modified as from a sentence required for another sentence
Minimum step number, L is the length with reference to sentence, fiFor part of speech similar features value.It is described to be modified as another sentence institute from a sentence
The minimum step number W needed uses Lay Weinstein distance.In sentence similarity comparison process, generally require by a sentence with it is multiple
Sentence carries out similarity comparison, and this sentence is referred to as former sentence, and multiple sentences are referred to as to refer to sentence.It is this similar to part of speech
The algorithm of characteristic value shows the part of speech feature of sentence well.
Further, described two and above similar features value includes calculating the similar of two obtained sentences by vocabulary
Characteristic value fi;Its calculation procedure is as follows:S131:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;S132:Root
Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn according to two syntax trees;S133:According to vocabulary detected value TP ',
FP ' and FN ' is calculated the vocabulary similar features value f of two sentences by below equationi:
TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, and P ' is word
Remittance accuracy rate, fiFor vocabulary similar features value.
When the present invention is applied, first parse two sentences and obtain two syntax trees corresponding with sentence, further according to two
Syntax tree draws vocabulary detected value TP ', FP ' and the FN ' between two syntax trees, further according to vocabulary detected value TP ', FP ' and FN '
The vocabulary similar features value f of two sentences is calculated by below equationi:Above
The TP ' is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, and P ' is that vocabulary is accurate
True rate, fiFor vocabulary similar features value.This algorithm to vocabulary similar features value shows the lexical feature of sentence well.
Multilingual sentence similarity is applied to based on linear regression and obtains system, including:For obtaining two sentences two
Individual and above similar features value acquiring unit;For according to two affiliated languages of sentence and application field selected characteristic weights
Selecting unit;For two and above similar features value linearly to be returned according to the corresponding feature weight of each similar features value
Return, draw the linear regression unit of the compound similar characteristic value of two sentences.
Further, obtain two of the acquiring unit and above similar features value include structure similar features value, word
Property similar features value or vocabulary similar features value.
In the prior art, using the contrast on single level, the sentence phase of this single level more than contrast sentence similarity
Suitable contrast can not be made for various language features and various occasion terms like degree control methods, such as suitable for English
Control methods is not suitable for Chinese contrast, it is adaptable to which the control methods of News English is not suitable for the contrast of spoken English.This hair
During bright application, two and above similar features value that acquiring unit obtains two sentences are first passed through, selecting unit is according to two sentences
Languages and application field selected characteristic weights belonging to sub, linear regression unit is by two and above similar features value according to each phase
Linear regression is carried out like the corresponding feature weight of characteristic value, the compound similar characteristic value of two sentences is drawn.For two sentences
Similar features value, can choose but be not limited to structure similar features value, part of speech similar features value or vocabulary similar features value, this
Three kinds of similar features values can be very good to carry out sentence similarity contrast to mainstream speech and main flow occasion term, but such as
Fruit has special language or occasion term, such as Turkish can also add root similar features value according to its feature.This
Invention by being weighted linear regression to the different similar features value of sentence, realize present system suitable for multilingual with
A variety of occasions.
The present invention compared with prior art, has the following advantages and advantages:
1st, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, by the different phases of sentence
Linear regression is weighted like characteristic value, the inventive method is realized suitable for multilingual and a variety of occasions;
2nd, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of structure
The algorithm of value indicative shows the architectural feature of sentence well;
3rd, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of part of speech
The algorithm of value indicative shows the part of speech feature of sentence well;
4th, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of vocabulary
The algorithm of value indicative shows the lexical feature of sentence well;
5th, the present invention is applied to multilingual sentence similarity acquisition system based on linear regression, by the different phases of sentence
Linear regression is weighted like characteristic value, present system is realized suitable for multilingual and a variety of occasions.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding the embodiment of the present invention, constitutes one of the application
Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the inventive method schematic diagram;
Fig. 2 is present system schematic diagram;
Fig. 3 is structure similar features value calculation procedure schematic diagram of the present invention;
Fig. 4 is part of speech similar features value calculation procedure schematic diagram of the present invention;
Fig. 5 is vocabulary similar features value calculation procedure schematic diagram of the present invention;
Fig. 6 is the embodiment of the present invention 2,3 and 4 syntax tree schematic diagrames;
Fig. 7 is the embodiment of the present invention 2,3 and 4 syntax tree schematic diagrames;
Fig. 8 is the syntax tree schematic diagram of the embodiment of the present invention 8;
Fig. 9 is the syntax tree schematic diagram of the embodiment of the present invention 8;
Figure 10 is the syntax tree schematic diagram of the embodiment of the present invention 9;
Figure 11 is the syntax tree schematic diagram of the embodiment of the present invention 9.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, with reference to embodiment and accompanying drawing, to this
Invention is described in further detail, and exemplary embodiment and its explanation of the invention is only used for explaining the present invention, does not make
For limitation of the invention.
Embodiment 1
As shown in figure 1, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, its feature exists
In comprising the following steps:S1:Obtain two and above similar features value f of two sentencesi;The fiIncluding f1, f2, f3...,
fn;S2:The corresponding feature weight ω of each similar features value is chosen according to two affiliated languages of sentence and application fieldi;It is described
ωiIncluding ω1, ω2, ω3..., ωn;S3:By two and above similar features value according to the corresponding feature of each similar features value
Weights carry out linear regression, draw the compound similar characteristic value f of two sentencess;The linear regression formula is as follows:F described aboveiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.
When the present embodiment is implemented, two and above similar features value f of two sentences are first obtainedi;The fiIncluding f1, f2,
f3..., fn;The corresponding feature weight ω of each similar features value is chosen further according to two affiliated languages of sentence and application fieldi;
The ωiIncluding ω1, ω2, ω3..., ωn;It is again that two and above similar features value are corresponding according to each similar features value
Feature weight carries out linear regression, draws the compound similar characteristic value f of two sentencess;The linear regression formula is as follows:F described aboveiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.
Embodiment 2
As shown in Fig. 3, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value
Including the structure similar features value f by calculating two obtained sentences1;Its calculation procedure is as follows:Parse two sentences and obtain
To two syntax trees corresponding with sentence;Structure detection value TP, FP between two syntax trees is drawn according to two syntax trees
And FN;The structure similar features value f of two sentences is calculated by below equation according to structure detection value TP, FP and FN1: TP described above is that structure receives true value, and FP is structure wrong report value, and FN goes for structure
True value, R is structure recall rate, and P is structure accuracy rate, f1For structure similar features value.
When the present embodiment is implemented, language is English, and two sentences are respectively:
<a>:For the moment she works as doctor.
<b>:She worked as doctor for that moment.
First parse two sentences and obtain two syntax trees, syntax tree is as shown in Figure 6 and Figure 7;Wherein:
<a>Architectural feature:S, PP, NP, VP, NP, PP, NP
<b>Architectural feature:S, NP, VP, PP, PP, NP, NP
Structure detection value TP, FP and FN between two syntax trees are drawn further according to two syntax trees:
TP=7, FP=0, FN=0;
TP, FP and FN value are substituted into formula:
Obtain P=1, R=1;
P and R value is substituted into formula
Obtain f1=1.
Embodiment 3
As shown in Fig. 4, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value
Part of speech similar features value f including calculating two obtained sentences by part of speech2;Its calculation procedure is as follows:Parse two sentences
And obtain two syntax trees corresponding with sentence;Two sentences are divided into reference to sentence and former sentence;The reference sentence is only at this
Secondary calculating part of speech similar features value f2When the sentence that uses;The former sentence is in addition to this, in addition it is also necessary to calculate word with other sentences
Property similar features value f2Sentence;Drawn according to the part of speech distribution of two syntax trees from a sentence and be modified as another sentence institute
The minimum step number W needed;The part of speech similar features value f of two sentences is calculated by below equation2:It is described above
W is the minimum step number required for being modified as another sentence from a sentence, and L is the length with reference to sentence, f2For the similar spy of part of speech
Value indicative;It is described from a sentence be modified as another sentence required for minimum step number W use Lay Weinstein distance.
When the present embodiment is implemented, language is English, and two sentences are respectively:
<a>:For the moment she works as doctor.
<b>:She worked as doctor for that moment.
First parse two sentences and obtain two syntax trees, syntax tree is as shown in Figure 6 and Figure 7;Wherein:
<a>Part of speech feature:IN, DT, NN, PRP, VBZ, IN, NN
<b>Part of speech feature:PRP, VBD, IN, NN, IN, DT, NN
Minimum step number W, W=required for being modified as another sentence from a sentence is calculated by Lay Weinstein distance
6;
With<a>To refer to sentence, with<b>For former sentence;With reference to sentence length L=7;
L and W is substituted into formula
Draw f2=0.143.
Embodiment 4
As shown in Fig. 5, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value
Similar features value f including calculating two obtained sentences by vocabulary3;Its calculation procedure is as follows:Parse two sentences and obtain
To two syntax trees corresponding with sentence;According to two syntax trees draw vocabulary detected value TP ' between two syntax trees,
FP ' and FN ';The vocabulary similar features value f of two sentences is calculated by below equation according to vocabulary detected value TP ', FP ' and FN '3:TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, FN '
True value is gone for vocabulary, R ' is vocabulary recall rate, and P ' is vocabulary accuracy rate, f3For vocabulary similar features value.
When the present embodiment is implemented, language is English, and two sentences are respectively:
<a>:For the moment she works as doctor.
<b>:She worked as doctor for that moment.
First parse two sentences and obtain two syntax trees, syntax tree is as shown in Figure 6 and Figure 7;Wherein:
<a>Lexical feature:For, the, moment, she, works, as, doctor
<b>Lexical feature:She, worked, as, doctor, for, that, moment
Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn further according to two syntax trees:
TP '=5, FP '=2, FN '=2;
TP ', FP ' and FN ' value are substituted into formula:
Obtain P '=5/7, R '=5/7;
P ' and R ' value is substituted into formula
Obtain f3=0.714.
Embodiment 5
As shown in Fig. 1, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1 to 4, based on English language feature, chooses
f1Corresponding feature weight ω1=0.5, choose f2Corresponding feature weight ω2=0.1, choose f3Corresponding feature weight ω3=
0.4, structure similar features value f1=1, part of speech similar features value f2=0.143, vocabulary similar features value f3=0.714;
When the present embodiment is implemented, by f1、f2、f3、ω1、ω2And ω3Substitute into formula
Obtain fs=f1ω1+f2ω2+f3ω3=1*0.5+0.143*0.1+0.4*0.714=0.93.
Embodiment 6
As shown in figure 1, the present embodiment is on the basis of embodiment 1 to 4, English News text is analyzed, news etc.
The text of type, is mostly that long sentence syntactic structure is rigorous, it is necessary to which balanced consider each feature.
When the present embodiment is implemented, two sentences are respectively:
<c>The stock posted a multi-year low of under$10in early 2009before
roaring to a recent high of nearly$75in April 2014.
<d>The stock will post a multi-year low of under$10in early
2016before roaring to a recent high of nearly$75in April 2014.
First parse two sentences and obtain two syntax trees;Wherein
<c>Architectural feature:ROOT, IP, NP, DP, NP, VP, IP, NP, QP, NP, VP, PP, PP, IP, NP, VP, NP,
PP, NP, NP, QP, NP, PP, NP, QP, NP, VP, PP, QP, NP, QP, PP, QP, NP, QP
<d>Architectural feature:ROOT, IP, NP, DP, NP, VP, IP, NP, QP, NP, VP, PP, PP, IP, NP, VP, NP,
PP, NP, NP, QP, NP, PP, NP, QP, NP, VP, PP, QP, NP, QP, PP, QP, NP, QP
<c>Part of speech feature:DT, NN, VV, CD, NN, VV, P, NN, NN, NT, P, NR, CD, NN, NN, P, CD, NN, VA,
P, NR, NN, CD, P, NR, CD, PU
<d>Part of speech feature:DT, NN, NN, VV, CD, NN, VV, P, NN, NN, NT, P, NR, CD, NN, NN, P, CD, NN,
VA, P, NR, NN, CD, P, NR, CD, PU
<c>Lexical feature:The, stock, posted, a, multi-year, low, of, under, $, 10, in,
Early, 2009, before, roaring, to, a, recent, high, of, nearly, $, 75, in, April, 2014
<d>Lexical feature:The, stock, will, post, a, multi-year, low, of, under, $, 10, in,
Early, 2016, before, roaring, to, a, recent, high, of, nearly, $, 75, in, April, 2014
According to calculation of the embodiment 2 into embodiment 4, obtain:
f1=1.0, f2=0.96, f3=0.91;
According to the characteristics of newsletter archive, it is necessary to which each feature is considered in equilibrium, admittedly choose f1Corresponding feature weight ω1=
0.35, choose f2Corresponding feature weight ω2=0.35, choose f3Corresponding feature weight ω3=0.3;
According to the calculation formula in embodiment 1Draw:
fs=f1ω1+f2ω2+f3ω3=0.96*0.35+1.0*0.35+0.91*0.3=0.96.
Embodiment 7
As shown in figure 1, the present embodiment is on the basis of embodiment 1 to 4, English drama text is analyzed, drama
Sentence tends to short sentence, and has the non-grammaticalization of spoken language, so needing to reduce the weight of part of speech feature;
When the present embodiment is implemented, two sentences are respectively:
<e>:Louis,you're a good writer.
<f>:You're a good investigator.maybe...
First parse two sentences and obtain two syntax trees;Wherein
<e>Architectural feature:ROOT, IP, NP, NP, VP, NP, QP, NP
<f>Architectural feature:ROOT, IP, IP, NP, VP, NP, QP, NP, IP, VP, ADVP, VP
<e>Part of speech feature:NR, PU, PN, VV, CD, NN, NN, PU
<f>Part of speech feature:PN, VV, CD, NN, NN, PU, AD, VV
<e>Lexical feature:Louis, you, ' re, a, good, writer
<f>Lexical feature:You, ' re, a, good, investigator, maybe
According to calculation of the embodiment 2 into embodiment 4, obtain:
f1=1.0, f2=0.5, f3=0.63;
According to the characteristics of drama text, it is necessary to reduce the weight of part of speech feature, admittedly choose f1Corresponding feature weight ω1=
0.5, choose f2Corresponding feature weight ω2=0.1, choose f3Corresponding feature weight ω3=0.4;
According to the calculation formula in embodiment 1Draw:
fs=f1ω1+f2ω2+f3ω3=0.5*0.1+1.0*0.5+0.63*0.4=0.8.
Embodiment 8
As shown in Fig. 1, Fig. 8 and Fig. 9, the present embodiment is analyzed German on the basis of embodiment 1 to 4, because moral
Language vocabulary lattice change it is complete express semanteme, the position of the syntactic structure in sentence such as object adverbial modifier can be very clever
It is living, need to reduce the weight of part of speech feature in this case, and increase the weight of architectural feature;
When the present embodiment is implemented, two sentences are respectively:
<g>Er kaufte ein Buch in einer Buchhandlung
<h>Er in einer Buchhandlung ein Buch kaufte
First parse two sentences and obtain two syntax trees;Wherein
<g>Architectural feature:ROOT, IP, NP, VP, PP, NP, VP
<h>Architectural feature:ROOT, IP, NP, VP, PP, NP, VP, NP
<g>Part of speech feature:NR, NN, NN, NN, P, NN, VV
<h>Part of speech feature:NR, P, NN, NN, VV, NN, NN
<g>Lexical feature:Er, kaufte, ein, buch, in, einer, buchhandlung
<h>Lexical feature:Er, in, einer, buchhandlung, ein, buch, kaufte
According to calculation of the embodiment 2 into embodiment 4, obtain:
f1=0.93, f2=0.57, f3=1;
According to the characteristics of German, it is necessary to reduce the weight of part of speech feature and increase the weight of architectural feature, admittedly choose f1It is right
The feature weight ω answered1=0.5, choose f2Corresponding feature weight ω2=0.1, choose f3Corresponding feature weight ω3=0.4;
According to the calculation formula in embodiment 1Draw:
fs=f1ω1+f2ω2+f3ω3=0.93*0.5+0.1*0.57+1*0.4=0.922.
Embodiment 9
As shown in Fig. 1, Figure 10 and Figure 11, the present embodiment is analyzed Chinese on the basis of embodiment 1 to 4, because
The expression of Chinese sentence meaning is not by vocabulary but sentence sequence, and the position that vocabulary is placed can all express grammer implication, for this language
Speech is, it is necessary to increase the weight of part of speech feature.
When the present embodiment is implemented, two sentences are respectively:
<i>Get dressed
<j>Wear the clothes well
First parse two sentences and obtain two syntax trees;Wherein
<i>Architectural feature:ROOT, IP, VP, VV, NP, ADJP, JJ, NP, NN
<j>Architectural feature:ROOT, IP, VP, NP, ADJP, NP
<i>Part of speech feature:VV, JJ, NN
<j>Part of speech feature:AD, VV, NN
<i>Lexical feature:Wear, good, clothes
<j>Lexical feature:It is good, wear, clothes
According to calculation of the embodiment 2 into embodiment 4, obtain:
f1=0.83, f2=0.33, f3=1;
According to the characteristics of Chinese, it is necessary to increase the weight of part of speech feature, admittedly choose f1Corresponding feature weight ω1=0.3,
Choose f2Corresponding feature weight ω2=0.4, choose f3Corresponding feature weight ω3=0.3;
According to the calculation formula in embodiment 1Draw:
fs=f1ω1+f2ω2+f3ω3=0.33*0.4+0.83*0.3+1.0*0.3=0.68.
Embodiment 9
As shown in Fig. 2 the present invention is applied to multilingual sentence similarity based on linear regression obtains system, including:With
In the sentence two of acquisition two and the acquiring unit of above similar features value;For being led according to two affiliated languages of sentence and application
The selecting unit of domain selected characteristic weights;For by two and above similar features value according to the corresponding spy of each similar features value
Levy weights and carry out linear regression, draw the linear regression unit of the compound similar characteristic value of two sentences.
When the present embodiment is implemented, two and above similar features value that acquiring unit obtains two sentences are first passed through, selection
Unit is according to two affiliated languages of sentence and application field selected characteristic weights, and linear regression unit is by two and the similar spy of the above
Value indicative carries out linear regression according to the corresponding feature weight of each similar features value, draws the compound similar features of two sentences
Value.
Above-described embodiment, has been carried out further to the purpose of the present invention, technical scheme and beneficial effect
Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention
Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc. all should be included
Within protection scope of the present invention.
Claims (7)
1. multilingual sentence similarity acquisition methods are applied to based on linear regression, it is characterised in that comprise the following steps:
S1:Obtain two and above similar features value f of two sentencesi;The fiIncluding f1, f2, f3..., fn;
S2:The corresponding feature weight ω of each similar features value is chosen according to two affiliated languages of sentence and application fieldi;It is described
ωiIncluding ω1, ω2, ω3..., ωn;
S3:Two and above similar features value are subjected to linear regression according to the corresponding feature weight of each similar features value, obtained
Go out the compound similar characteristic value f of two sentencess;
The linear regression formula is as follows:
F described aboveiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.
2. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression
It is, described two and above similar features value includes the structure similar features value f by calculating two obtained sentencesi;Its
Calculation procedure is as follows:
S111:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;
S112:Structure detection value TP, FP and FN between two syntax trees are drawn according to two syntax trees;
S113:The grammer similar features value f of two sentences is calculated by below equation according to structure detection value TP, FP and FNi:
<mrow>
<mi>R</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mi>T</mi>
<mi>P</mi>
</mrow>
<mrow>
<mi>T</mi>
<mi>P</mi>
<mo>+</mo>
<mi>F</mi>
<mi>N</mi>
</mrow>
</mfrac>
<mo>;</mo>
<mi>P</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mi>T</mi>
<mi>P</mi>
</mrow>
<mrow>
<mi>T</mi>
<mi>P</mi>
<mo>+</mo>
<mi>F</mi>
<mi>P</mi>
</mrow>
</mfrac>
<mo>;</mo>
<msub>
<mi>f</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<mn>2</mn>
<mi>P</mi>
<mi>R</mi>
</mrow>
<mrow>
<mi>P</mi>
<mo>+</mo>
<mi>R</mi>
</mrow>
</mfrac>
<mo>;</mo>
</mrow>
TP described above is that structure receives true value, and FP is structure wrong report value, and FN is that structure goes true value, and R is structure recall rate, and P is knot
Structure accuracy rate, fiFor structure similar features value.
3. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression
It is, described two and above similar features value includes the part of speech similar features value f by calculating two obtained sentencesi;Its
Calculation procedure is as follows:
S121:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;
S122:Two sentences are divided into reference to sentence and former sentence;The reference sentence is only in this calculating part of speech similar features value fiWhen
The sentence used;The former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentencesiSentence;According to
The part of speech distribution of two syntax trees draws the minimum step number W required for being modified as another sentence from a sentence;
S123:The part of speech similar features value f of two sentences is calculated by below equationi:
<mrow>
<msub>
<mi>f</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<mn>1</mn>
<mo>-</mo>
<mfrac>
<mi>W</mi>
<mi>L</mi>
</mfrac>
<mo>;</mo>
</mrow>
W described above is the minimum step number required for being modified as another sentence from a sentence, and L is the length with reference to sentence, fiFor
Part of speech similar features value.
4. according to claim 3 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression
Be, it is described from a sentence be modified as another sentence required for minimum step number W use Lay Weinstein distance.
5. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression
It is, described two and above similar features value includes the similar features value f by calculating two obtained sentencesi;It is calculated
Step is as follows:
S131:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;
S132:Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn according to two syntax trees;
S133:The vocabulary similar features value f of two sentences is calculated by below equation according to vocabulary detected value TP ', FP ' and FN 'i:
<mrow>
<msup>
<mi>R</mi>
<mo>&prime;</mo>
</msup>
<mo>=</mo>
<mfrac>
<mrow>
<msup>
<mi>TP</mi>
<mo>&prime;</mo>
</msup>
</mrow>
<mrow>
<msup>
<mi>TP</mi>
<mo>&prime;</mo>
</msup>
<mo>+</mo>
<msup>
<mi>FN</mi>
<mo>&prime;</mo>
</msup>
</mrow>
</mfrac>
<mo>;</mo>
<msup>
<mi>P</mi>
<mo>&prime;</mo>
</msup>
<mo>=</mo>
<mfrac>
<mrow>
<msup>
<mi>TP</mi>
<mo>&prime;</mo>
</msup>
</mrow>
<mrow>
<msup>
<mi>TP</mi>
<mo>&prime;</mo>
</msup>
<mo>+</mo>
<msup>
<mi>FP</mi>
<mo>&prime;</mo>
</msup>
</mrow>
</mfrac>
<mo>;</mo>
<msub>
<mi>f</mi>
<mi>i</mi>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<mn>2</mn>
<msup>
<mi>P</mi>
<mo>&prime;</mo>
</msup>
<msup>
<mi>R</mi>
<mo>&prime;</mo>
</msup>
</mrow>
<mrow>
<msup>
<mi>P</mi>
<mo>&prime;</mo>
</msup>
<mo>+</mo>
<msup>
<mi>R</mi>
<mo>&prime;</mo>
</msup>
</mrow>
</mfrac>
<mo>;</mo>
</mrow>
1
TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, P '
For vocabulary accuracy rate, fiFor vocabulary similar features value.
6. multilingual sentence similarity is applied to based on linear regression using claim 1 to 5 any one methods described
Acquisition system, it is characterised in that including:
For the sentence two of acquisition two and the acquiring unit of above similar features value;
For the selecting unit according to two affiliated languages of sentence and application field selected characteristic weights;
For two and above similar features value to be carried out into linear regression according to the corresponding feature weight of each similar features value, obtain
Go out the linear regression unit of the compound similar characteristic value of two sentences.
7. according to claim 6 be applied to multilingual sentence similarity acquisition system, its feature based on linear regression
It is, obtain two of the acquiring unit and above similar features value include structure similar features value, part of speech similar features value
Or vocabulary similar features value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710187215.2A CN107066443A (en) | 2017-03-27 | 2017-03-27 | Multilingual sentence similarity acquisition methods and system are applied to based on linear regression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710187215.2A CN107066443A (en) | 2017-03-27 | 2017-03-27 | Multilingual sentence similarity acquisition methods and system are applied to based on linear regression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107066443A true CN107066443A (en) | 2017-08-18 |
Family
ID=59620196
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710187215.2A Pending CN107066443A (en) | 2017-03-27 | 2017-03-27 | Multilingual sentence similarity acquisition methods and system are applied to based on linear regression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107066443A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334493A (en) * | 2018-01-07 | 2018-07-27 | 深圳前海易维教育科技有限公司 | A kind of topic knowledge point extraction method based on neural network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6810376B1 (en) * | 2000-07-11 | 2004-10-26 | Nusuara Technologies Sdn Bhd | System and methods for determining semantic similarity of sentences |
CN104424279A (en) * | 2013-08-30 | 2015-03-18 | 腾讯科技(深圳)有限公司 | Text relevance calculating method and device |
-
2017
- 2017-03-27 CN CN201710187215.2A patent/CN107066443A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6810376B1 (en) * | 2000-07-11 | 2004-10-26 | Nusuara Technologies Sdn Bhd | System and methods for determining semantic similarity of sentences |
CN104424279A (en) * | 2013-08-30 | 2015-03-18 | 腾讯科技(深圳)有限公司 | Text relevance calculating method and device |
Non-Patent Citations (2)
Title |
---|
BING LIU著,俞勇等译: "《Web数据挖掘》", 30 April 2009, 清华大学出版社 * |
李秋明等: "基于句子多种特征的相似度计算模型", 《软件导刊》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108334493A (en) * | 2018-01-07 | 2018-07-27 | 深圳前海易维教育科技有限公司 | A kind of topic knowledge point extraction method based on neural network |
CN108334493B (en) * | 2018-01-07 | 2021-04-09 | 深圳前海易维教育科技有限公司 | Question knowledge point automatic extraction method based on neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Taraldsen | The scope of wh movement in Norwegian | |
CN104933027B (en) | A kind of open Chinese entity relation extraction method of utilization dependency analysis | |
CN109146610A (en) | It is a kind of intelligently to insure recommended method, device and intelligence insurance robot device | |
CN104881402A (en) | Method and device for analyzing semantic orientation of Chinese network topic comment text | |
Snover et al. | Language and translation model adaptation using comparable corpora | |
DE112013005742T5 (en) | Intention estimation device and intention estimation method | |
CN105573994A (en) | Statistic machine translation system based on syntax framework | |
CN106502987B (en) | A kind of method and apparatus that the sentence template based on seed sentence is recalled | |
CN103020045A (en) | Statistical machine translation method based on predicate argument structure (PAS) | |
CN103336803B (en) | A kind of computer generating method of embedding name new Year scroll | |
CN109145286A (en) | Based on BiLSTM-CRF neural network model and merge the Noun Phrase Recognition Methods of Vietnamese language feature | |
CN101763403A (en) | Query translation method facing multi-lingual information retrieval system | |
CN107038155A (en) | The extracting method of text feature is realized based on improved small-world network model | |
CN107066443A (en) | Multilingual sentence similarity acquisition methods and system are applied to based on linear regression | |
Dragomirescu et al. | Syntactic archaisms preserved in a contemporary romance variety: Interpolation and scrambling in old Romanian and Istro-Romanian | |
Galinsky et al. | Improving neural network models for natural language processing in russian with synonyms | |
Bungum et al. | A survey of domain adaptation in machine translation: Towards a refinement of domain space | |
Chambers et al. | Stochastic language generation in a dialogue system: Toward a domain independent generator | |
Muntarina et al. | Tense based english to bangla translation using mt system | |
Vogel | Remarks on the architecture of optimality theoretic syntax grammars | |
Grandy | Some remarks about logical form | |
CN103294662B (en) | Match judging apparatus and consistance determination methods | |
Griscom | The diachronic developments of KI constructions in the Luo and Koman families | |
Declerck et al. | How to semantically relate dialectal Dictionaries in the Linked Data Framework | |
Yemelianova et al. | Variability of phraseological units in the British and American English and methods of their translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170818 |