CN107066443A

CN107066443A - Multilingual sentence similarity acquisition methods and system are applied to based on linear regression

Info

Publication number: CN107066443A
Application number: CN201710187215.2A
Authority: CN
Inventors: 海同舟; 李明; 王兴强; 彭成超
Original assignee: Chengdu Excellent Translation Information Technology Ltd By Share Ltd
Current assignee: Chengdu Excellent Translation Information Technology Ltd By Share Ltd
Priority date: 2017-03-27
Filing date: 2017-03-27
Publication date: 2017-08-18

Abstract

The invention discloses comprised the following steps based on linear regression suitable for multilingual sentence similarity acquisition methods：Obtain two and above similar features value of two sentences；The corresponding feature weight of each similar features value is chosen according to two affiliated languages of sentence and application field；Two and above similar features value are subjected to linear regression according to the corresponding feature weight of each similar features value, the compound similar characteristic value of two sentences is drawn.The invention discloses be applied to multilingual sentence similarity based on linear regression to obtain system, including acquiring unit；Selecting unit；Linear regression unit.The present invention is applied to multilingual sentence similarity acquisition methods and system based on linear regression, by being weighted linear regression to the different similar features values of sentence, it is adaptable to multilingual and a variety of occasions.

Description

Multilingual sentence similarity acquisition methods and system are applied to based on linear regression

Technical field

The present invention relates to machine translation technical field, and in particular to is applied to multilingual sentence phase based on linear regression Like degree acquisition methods and system.

Background technology

With developing rapidly for economic globalization and Internet, the translation of natural language is promoting politics, economic, text Change and play more and more important effect in terms of exchanging.Past, people needed to turn over spoken and written languages in international exchange field , it is necessary to using human translation, take time and effort when translating, and with the high speed development of computer hardware technique, machine translation and Computer-aided translation has obtained increasingly being widely applied.

Contrast sentence similarity is to study the important topic in machine translation and computer-aided translation, traditional contrast Using the contrast on single level more than method, such as the Duplication of vocabulary, language model Duplication, vocabulary vectorization are fixed with cosine Reason calculates distance in semantic space and compared etc., but the sentence similarity control methods of this single level can not be directed to it is various Language feature makes suitable contrast, such as the control methods suitable for English is not suitable for Chinese contrast.While single level Sentence similarity control methods can not be directed to various occasion terms the characteristics of make suitable contrast, such as suitable for news English The control methods of language is not suitable for the contrast of spoken English.

State Patent Office Patent No. CN201110303522.5 patent of invention discloses a kind of calculating sentence similarity Method and apparatus and machine translation method and apparatus, it is poor that the patent employs vocabulary to the method for sentence similarity comparison Different control methods, this method be applied to the obvious language of lexical gap and occasion, to lexical gap not substantially and sentence pattern The obvious language of difference and occasion, contrast effect are poor.

The content of the invention

The technical problems to be solved by the invention are that existing sentence similarity control methods is not suitable for multilingual and many Plant occasion, it is therefore intended that provide and be applied to based on linear regression in multilingual sentence similarity acquisition methods and system, solution State problem.

The present invention is achieved through the following technical solutions：

Multilingual sentence similarity acquisition methods are applied to based on linear regression, comprised the following steps：S1：Obtain two Two and above similar features value f of sentence_i；The f_iIncluding f₁, f₂, f₃..., f_n；S2：According to two affiliated languages of sentence and Application field chooses the corresponding feature weight ω of each similar features value_i；The ω_iIncluding ω₁, ω₂, ω₃..., ω_n；S3：Will Two and above similar features value carry out linear regression according to the corresponding feature weight of each similar features value, draw two sentences Compound similar characteristic value f_s；The linear regression formula is as follows：F described above_iFor similar features value, ω_iFor with f_iCorresponding feature weight, f_sTo be combined similar characteristic value.

In the prior art, using the contrast on single level, the sentence phase of this single level more than contrast sentence similarity Suitable contrast can not be made for various language features and various occasion terms like degree control methods, such as suitable for English Control methods is not suitable for Chinese contrast, it is adaptable to which the control methods of News English is not suitable for the contrast of spoken English.This hair During bright application, two and above similar features value f of two sentences are first obtained_i；The f_iIncluding f₁, f₂, f₃..., f_n, further according to Two affiliated languages of sentence and application field choose the corresponding feature weight ω of each similar features value_i；The ω_iIncluding ω₁, ω₂, ω₃..., ω_n；Two and above similar features value are entered into line according to the corresponding feature weight of each similar features value again Property return, draw the compound similar characteristic value f of two sentences_s；The linear regression formula is as follows：Institute above State f_iFor similar features value, ω_iFor with f_iCorresponding feature weight, f_sTo be combined similar characteristic value.For the similar of two sentences Characteristic value, can choose but be not limited to structure similar features value, part of speech similar features value or vocabulary similar features value, these three phases It can be very good to carry out sentence similarity contrast to mainstream speech and main flow occasion term like characteristic value, but if spy Different language or occasion term, such as Turkish can also add root similar features value according to its feature.

Feature weight corresponding for each similar features value, according to the difference and the difference of application field of language, chooses Feature weight it is also different.Such as English is compared with German, because German vocabulary lattice change and complete express semanteme, guest Position of the syntactic structures such as the language adverbial modifier in sentence is very flexible, so the feature weight of reduction part of speech similar features value, and carry The feature weight of high structure similar features value.Journalese is compared with drama term for another example, because news is mostly long sentence language Method is of a tightly knit structure, it is necessary to which balanced consider each feature, and the sentence of drama tends to short sentence, and has the non-grammaticalization of spoken language, institute To reduce the part of speech similar features value of drama term.The present invention to the different similar features values of sentence by being weighted linear return Return, realize the inventive method suitable for multilingual and a variety of occasions.

Further, described two and above similar features value includes similar by the structure for calculating two obtained sentences Characteristic value f_i；Its calculation procedure is as follows：S111：Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence；S112： Structure detection value TP, FP and FN between two syntax trees are drawn according to two syntax trees；S113：According to structure detection value TP, FP and FN is calculated the structure similar features value f of two sentences by below equation_i：On The text TP receives true value for structure, and FP is structure wrong report value, and FN is that structure goes true value, and R is structure recall rate, and P is that structure is accurate Rate, f_iFor structure similar features value.

When the present invention is applied, first parsing needs two sentences compared, obtains two syntax trees corresponding with sentence, then Draw structure detection value TP, FP and FN between two syntax trees according to two syntax trees, further according to structure detection value TP, FP and FN is calculated the structure similar features value f of two sentences by below equation_i：Institute above State TP and receive true value for structure, FP is structure wrong report value, FN is that structure goes true value, and R is structure recall rate, and P is structure accuracy rate, f_i For structure similar features value.This algorithm to structure similar features value shows the architectural feature of sentence well.

Further, described two and above similar features value includes the part of speech that two obtained sentences are calculated by part of speech Similar features value f_i；Its calculation procedure is as follows：S121：Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence； S122：Two sentences are divided into reference to sentence and former sentence；The reference sentence is only in this calculating part of speech similar features value f_iWhen make Sentence；The former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentences_iSentence；According to two The part of speech distribution of individual syntax tree draws the minimum step number W required for being modified as another sentence from a sentence；S123：By with Lower formula calculates the part of speech similar features value f of two sentences_i：W described above is to be modified as separately from a sentence Minimum step number required for one sentence, L is the length with reference to sentence, f_iFor part of speech similar features value.

Further, it is described from a sentence be modified as another sentence required for minimum step number W use Lai Wensi Smooth distance.

When the present invention is applied, first parse two sentences and obtain two syntax trees corresponding with sentence, then by two sentences Son is divided into reference to sentence and former sentence；The reference sentence is only in this calculating part of speech similar features value f_iWhen the sentence that uses；It is described Former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentences_iSentence；According to the word of two syntax trees Property distribution draw minimum step number W required for being modified as another sentence from a sentence；Again two are calculated by below equation The part of speech similar features value f of sentence_i：W described above is to be modified as from a sentence required for another sentence Minimum step number, L is the length with reference to sentence, f_iFor part of speech similar features value.It is described to be modified as another sentence institute from a sentence The minimum step number W needed uses Lay Weinstein distance.In sentence similarity comparison process, generally require by a sentence with it is multiple Sentence carries out similarity comparison, and this sentence is referred to as former sentence, and multiple sentences are referred to as to refer to sentence.It is this similar to part of speech The algorithm of characteristic value shows the part of speech feature of sentence well.

Further, described two and above similar features value includes calculating the similar of two obtained sentences by vocabulary Characteristic value f_i；Its calculation procedure is as follows：S131：Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence；S132：Root Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn according to two syntax trees；S133：According to vocabulary detected value TP ', FP ' and FN ' is calculated the vocabulary similar features value f of two sentences by below equation_i： TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, and P ' is word Remittance accuracy rate, f_iFor vocabulary similar features value.

When the present invention is applied, first parse two sentences and obtain two syntax trees corresponding with sentence, further according to two Syntax tree draws vocabulary detected value TP ', FP ' and the FN ' between two syntax trees, further according to vocabulary detected value TP ', FP ' and FN ' The vocabulary similar features value f of two sentences is calculated by below equation_i：Above The TP ' is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, and P ' is that vocabulary is accurate True rate, f_iFor vocabulary similar features value.This algorithm to vocabulary similar features value shows the lexical feature of sentence well.

Multilingual sentence similarity is applied to based on linear regression and obtains system, including：For obtaining two sentences two Individual and above similar features value acquiring unit；For according to two affiliated languages of sentence and application field selected characteristic weights Selecting unit；For two and above similar features value linearly to be returned according to the corresponding feature weight of each similar features value Return, draw the linear regression unit of the compound similar characteristic value of two sentences.

Further, obtain two of the acquiring unit and above similar features value include structure similar features value, word Property similar features value or vocabulary similar features value.

In the prior art, using the contrast on single level, the sentence phase of this single level more than contrast sentence similarity Suitable contrast can not be made for various language features and various occasion terms like degree control methods, such as suitable for English Control methods is not suitable for Chinese contrast, it is adaptable to which the control methods of News English is not suitable for the contrast of spoken English.This hair During bright application, two and above similar features value that acquiring unit obtains two sentences are first passed through, selecting unit is according to two sentences Languages and application field selected characteristic weights belonging to sub, linear regression unit is by two and above similar features value according to each phase Linear regression is carried out like the corresponding feature weight of characteristic value, the compound similar characteristic value of two sentences is drawn.For two sentences Similar features value, can choose but be not limited to structure similar features value, part of speech similar features value or vocabulary similar features value, this Three kinds of similar features values can be very good to carry out sentence similarity contrast to mainstream speech and main flow occasion term, but such as Fruit has special language or occasion term, such as Turkish can also add root similar features value according to its feature.This Invention by being weighted linear regression to the different similar features value of sentence, realize present system suitable for multilingual with A variety of occasions.

The present invention compared with prior art, has the following advantages and advantages：

1st, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, by the different phases of sentence Linear regression is weighted like characteristic value, the inventive method is realized suitable for multilingual and a variety of occasions；

2nd, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of structure The algorithm of value indicative shows the architectural feature of sentence well；

3rd, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of part of speech The algorithm of value indicative shows the part of speech feature of sentence well；

4th, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of vocabulary The algorithm of value indicative shows the lexical feature of sentence well；

5th, the present invention is applied to multilingual sentence similarity acquisition system based on linear regression, by the different phases of sentence Linear regression is weighted like characteristic value, present system is realized suitable for multilingual and a variety of occasions.

Brief description of the drawings

Accompanying drawing described herein is used for providing further understanding the embodiment of the present invention, constitutes one of the application Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings：

Fig. 1 is the inventive method schematic diagram；

Fig. 2 is present system schematic diagram；

Fig. 3 is structure similar features value calculation procedure schematic diagram of the present invention；

Fig. 4 is part of speech similar features value calculation procedure schematic diagram of the present invention；

Fig. 5 is vocabulary similar features value calculation procedure schematic diagram of the present invention；

Fig. 6 is the embodiment of the present invention 2,3 and 4 syntax tree schematic diagrames；

Fig. 7 is the embodiment of the present invention 2,3 and 4 syntax tree schematic diagrames；

Fig. 8 is the syntax tree schematic diagram of the embodiment of the present invention 8；

Fig. 9 is the syntax tree schematic diagram of the embodiment of the present invention 8；

Figure 10 is the syntax tree schematic diagram of the embodiment of the present invention 9；

Figure 11 is the syntax tree schematic diagram of the embodiment of the present invention 9.

Embodiment

For the object, technical solutions and advantages of the present invention are more clearly understood, with reference to embodiment and accompanying drawing, to this Invention is described in further detail, and exemplary embodiment and its explanation of the invention is only used for explaining the present invention, does not make For limitation of the invention.

Embodiment 1

As shown in figure 1, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, its feature exists In comprising the following steps：S1：Obtain two and above similar features value f of two sentences_i；The f_iIncluding f₁, f₂, f₃..., f_n；S2：The corresponding feature weight ω of each similar features value is chosen according to two affiliated languages of sentence and application field_i；It is described ω_iIncluding ω₁, ω₂, ω₃..., ω_n；S3：By two and above similar features value according to the corresponding feature of each similar features value Weights carry out linear regression, draw the compound similar characteristic value f of two sentences_s；The linear regression formula is as follows：F described above_iFor similar features value, ω_iFor with f_iCorresponding feature weight, f_sTo be combined similar characteristic value.

When the present embodiment is implemented, two and above similar features value f of two sentences are first obtained_i；The f_iIncluding f₁, f₂, f₃..., f_n；The corresponding feature weight ω of each similar features value is chosen further according to two affiliated languages of sentence and application field_i； The ω_iIncluding ω₁, ω₂, ω₃..., ω_n；It is again that two and above similar features value are corresponding according to each similar features value Feature weight carries out linear regression, draws the compound similar characteristic value f of two sentences_s；The linear regression formula is as follows：F described above_iFor similar features value, ω_iFor with f_iCorresponding feature weight, f_sTo be combined similar characteristic value.

Embodiment 2

As shown in Fig. 3, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value Including the structure similar features value f by calculating two obtained sentences₁；Its calculation procedure is as follows：Parse two sentences and obtain To two syntax trees corresponding with sentence；Structure detection value TP, FP between two syntax trees is drawn according to two syntax trees And FN；The structure similar features value f of two sentences is calculated by below equation according to structure detection value TP, FP and FN₁： TP described above is that structure receives true value, and FP is structure wrong report value, and FN goes for structure True value, R is structure recall rate, and P is structure accuracy rate, f₁For structure similar features value.

When the present embodiment is implemented, language is English, and two sentences are respectively：

<a>：For the moment she works as doctor.

：She worked as doctor for that moment.

First parse two sentences and obtain two syntax trees, syntax tree is as shown in Figure 6 and Figure 7；Wherein：

<a>Architectural feature：S, PP, NP, VP, NP, PP, NP

Architectural feature：S, NP, VP, PP, PP, NP, NP

Structure detection value TP, FP and FN between two syntax trees are drawn further according to two syntax trees：

TP=7, FP=0, FN=0；

TP, FP and FN value are substituted into formula：

Obtain P=1, R=1；

P and R value is substituted into formula

Obtain f₁=1.

Embodiment 3

As shown in Fig. 4, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value Part of speech similar features value f including calculating two obtained sentences by part of speech₂；Its calculation procedure is as follows：Parse two sentences And obtain two syntax trees corresponding with sentence；Two sentences are divided into reference to sentence and former sentence；The reference sentence is only at this Secondary calculating part of speech similar features value f₂When the sentence that uses；The former sentence is in addition to this, in addition it is also necessary to calculate word with other sentences Property similar features value f₂Sentence；Drawn according to the part of speech distribution of two syntax trees from a sentence and be modified as another sentence institute The minimum step number W needed；The part of speech similar features value f of two sentences is calculated by below equation₂：It is described above W is the minimum step number required for being modified as another sentence from a sentence, and L is the length with reference to sentence, f₂For the similar spy of part of speech Value indicative；It is described from a sentence be modified as another sentence required for minimum step number W use Lay Weinstein distance.

<a>：For the moment she works as doctor.

：She worked as doctor for that moment.

<a>Part of speech feature：IN, DT, NN, PRP, VBZ, IN, NN

Part of speech feature：PRP, VBD, IN, NN, IN, DT, NN

Minimum step number W, W=required for being modified as another sentence from a sentence is calculated by Lay Weinstein distance 6；

With<a>To refer to sentence, withFor former sentence；With reference to sentence length L=7；

L and W is substituted into formula

Draw f₂=0.143.

Embodiment 4

As shown in Fig. 5, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value Similar features value f including calculating two obtained sentences by vocabulary₃；Its calculation procedure is as follows：Parse two sentences and obtain To two syntax trees corresponding with sentence；According to two syntax trees draw vocabulary detected value TP ' between two syntax trees, FP ' and FN '；The vocabulary similar features value f of two sentences is calculated by below equation according to vocabulary detected value TP ', FP ' and FN '₃：TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, FN ' True value is gone for vocabulary, R ' is vocabulary recall rate, and P ' is vocabulary accuracy rate, f₃For vocabulary similar features value.

<a>：For the moment she works as doctor.

：She worked as doctor for that moment.

<a>Lexical feature：For, the, moment, she, works, as, doctor

Lexical feature：She, worked, as, doctor, for, that, moment

Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn further according to two syntax trees：

TP '=5, FP '=2, FN '=2；

TP ', FP ' and FN ' value are substituted into formula：

Obtain P '=5/7, R '=5/7；

P ' and R ' value is substituted into formula

Obtain f₃=0.714.

Embodiment 5

As shown in Fig. 1, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1 to 4, based on English language feature, chooses f₁Corresponding feature weight ω₁=0.5, choose f₂Corresponding feature weight ω₂=0.1, choose f₃Corresponding feature weight ω₃= 0.4, structure similar features value f₁=1, part of speech similar features value f₂=0.143, vocabulary similar features value f₃=0.714；

When the present embodiment is implemented, by f₁、f₂、f₃、ω₁、ω₂And ω₃Substitute into formula

Obtain f_s=f₁ω₁+f₂ω₂+f₃ω₃=1*0.5+0.143*0.1+0.4*0.714=0.93.

Embodiment 6

As shown in figure 1, the present embodiment is on the basis of embodiment 1 to 4, English News text is analyzed, news etc. The text of type, is mostly that long sentence syntactic structure is rigorous, it is necessary to which balanced consider each feature.

When the present embodiment is implemented, two sentences are respectively：

<c>The stock posted a multi-year low of under$10in early 2009before roaring to a recent high of nearly$75in April 2014.

<d>The stock will post a multi-year low of under$10in early 2016before roaring to a recent high of nearly$75in April 2014.

First parse two sentences and obtain two syntax trees；Wherein

<c>Architectural feature：ROOT, IP, NP, DP, NP, VP, IP, NP, QP, NP, VP, PP, PP, IP, NP, VP, NP, PP, NP, NP, QP, NP, PP, NP, QP, NP, VP, PP, QP, NP, QP, PP, QP, NP, QP

<d>Architectural feature：ROOT, IP, NP, DP, NP, VP, IP, NP, QP, NP, VP, PP, PP, IP, NP, VP, NP, PP, NP, NP, QP, NP, PP, NP, QP, NP, VP, PP, QP, NP, QP, PP, QP, NP, QP

<c>Part of speech feature：DT, NN, VV, CD, NN, VV, P, NN, NN, NT, P, NR, CD, NN, NN, P, CD, NN, VA, P, NR, NN, CD, P, NR, CD, PU

<d>Part of speech feature：DT, NN, NN, VV, CD, NN, VV, P, NN, NN, NT, P, NR, CD, NN, NN, P, CD, NN, VA, P, NR, NN, CD, P, NR, CD, PU

<c>Lexical feature：The, stock, posted, a, multi-year, low, of, under, $, 10, in, Early, 2009, before, roaring, to, a, recent, high, of, nearly, $, 75, in, April, 2014

<d>Lexical feature：The, stock, will, post, a, multi-year, low, of, under, $, 10, in, Early, 2016, before, roaring, to, a, recent, high, of, nearly, $, 75, in, April, 2014

According to calculation of the embodiment 2 into embodiment 4, obtain：

f₁=1.0, f₂=0.96, f₃=0.91；

According to the characteristics of newsletter archive, it is necessary to which each feature is considered in equilibrium, admittedly choose f₁Corresponding feature weight ω₁= 0.35, choose f₂Corresponding feature weight ω₂=0.35, choose f₃Corresponding feature weight ω₃=0.3；

According to the calculation formula in embodiment 1Draw：

f_s=f₁ω₁+f₂ω₂+f₃ω₃=0.96*0.35+1.0*0.35+0.91*0.3=0.96.

Embodiment 7

As shown in figure 1, the present embodiment is on the basis of embodiment 1 to 4, English drama text is analyzed, drama Sentence tends to short sentence, and has the non-grammaticalization of spoken language, so needing to reduce the weight of part of speech feature；

When the present embodiment is implemented, two sentences are respectively：

<e>:Louis,you're a good writer.

<f>:You're a good investigator.maybe...

First parse two sentences and obtain two syntax trees；Wherein

<e>Architectural feature：ROOT, IP, NP, NP, VP, NP, QP, NP

<f>Architectural feature：ROOT, IP, IP, NP, VP, NP, QP, NP, IP, VP, ADVP, VP

<e>Part of speech feature：NR, PU, PN, VV, CD, NN, NN, PU

<f>Part of speech feature：PN, VV, CD, NN, NN, PU, AD, VV

<e>Lexical feature：Louis, you, ' re, a, good, writer

<f>Lexical feature：You, ' re, a, good, investigator, maybe

According to calculation of the embodiment 2 into embodiment 4, obtain：

f₁=1.0, f₂=0.5, f₃=0.63；

According to the characteristics of drama text, it is necessary to reduce the weight of part of speech feature, admittedly choose f₁Corresponding feature weight ω₁= 0.5, choose f₂Corresponding feature weight ω₂=0.1, choose f₃Corresponding feature weight ω₃=0.4；

According to the calculation formula in embodiment 1Draw：

f_s=f₁ω₁+f₂ω₂+f₃ω₃=0.5*0.1+1.0*0.5+0.63*0.4=0.8.

Embodiment 8

As shown in Fig. 1, Fig. 8 and Fig. 9, the present embodiment is analyzed German on the basis of embodiment 1 to 4, because moral Language vocabulary lattice change it is complete express semanteme, the position of the syntactic structure in sentence such as object adverbial modifier can be very clever It is living, need to reduce the weight of part of speech feature in this case, and increase the weight of architectural feature；

When the present embodiment is implemented, two sentences are respectively：

<g>Er kaufte ein Buch in einer Buchhandlung

<h>Er in einer Buchhandlung ein Buch kaufte

First parse two sentences and obtain two syntax trees；Wherein

<g>Architectural feature：ROOT, IP, NP, VP, PP, NP, VP

<h>Architectural feature：ROOT, IP, NP, VP, PP, NP, VP, NP

<g>Part of speech feature：NR, NN, NN, NN, P, NN, VV

<h>Part of speech feature：NR, P, NN, NN, VV, NN, NN

<g>Lexical feature：Er, kaufte, ein, buch, in, einer, buchhandlung

<h>Lexical feature：Er, in, einer, buchhandlung, ein, buch, kaufte

According to calculation of the embodiment 2 into embodiment 4, obtain：

f₁=0.93, f₂=0.57, f₃=1；

According to the characteristics of German, it is necessary to reduce the weight of part of speech feature and increase the weight of architectural feature, admittedly choose f₁It is right The feature weight ω answered₁=0.5, choose f₂Corresponding feature weight ω₂=0.1, choose f₃Corresponding feature weight ω₃=0.4；

According to the calculation formula in embodiment 1Draw：

f_s=f₁ω₁+f₂ω₂+f₃ω₃=0.93*0.5+0.1*0.57+1*0.4=0.922.

Embodiment 9

As shown in Fig. 1, Figure 10 and Figure 11, the present embodiment is analyzed Chinese on the basis of embodiment 1 to 4, because The expression of Chinese sentence meaning is not by vocabulary but sentence sequence, and the position that vocabulary is placed can all express grammer implication, for this language Speech is, it is necessary to increase the weight of part of speech feature.

When the present embodiment is implemented, two sentences are respectively：

Get dressed

<j>Wear the clothes well

First parse two sentences and obtain two syntax trees；Wherein

Architectural feature：ROOT, IP, VP, VV, NP, ADJP, JJ, NP, NN

<j>Architectural feature：ROOT, IP, VP, NP, ADJP, NP

Part of speech feature：VV, JJ, NN

<j>Part of speech feature：AD, VV, NN

Lexical feature：Wear, good, clothes

<j>Lexical feature：It is good, wear, clothes

According to calculation of the embodiment 2 into embodiment 4, obtain：

f₁=0.83, f₂=0.33, f₃=1；

According to the characteristics of Chinese, it is necessary to increase the weight of part of speech feature, admittedly choose f₁Corresponding feature weight ω₁=0.3, Choose f₂Corresponding feature weight ω₂=0.4, choose f₃Corresponding feature weight ω₃=0.3；

According to the calculation formula in embodiment 1Draw：

f_s=f₁ω₁+f₂ω₂+f₃ω₃=0.33*0.4+0.83*0.3+1.0*0.3=0.68.

Embodiment 9

As shown in Fig. 2 the present invention is applied to multilingual sentence similarity based on linear regression obtains system, including：With In the sentence two of acquisition two and the acquiring unit of above similar features value；For being led according to two affiliated languages of sentence and application The selecting unit of domain selected characteristic weights；For by two and above similar features value according to the corresponding spy of each similar features value Levy weights and carry out linear regression, draw the linear regression unit of the compound similar characteristic value of two sentences.

When the present embodiment is implemented, two and above similar features value that acquiring unit obtains two sentences are first passed through, selection Unit is according to two affiliated languages of sentence and application field selected characteristic weights, and linear regression unit is by two and the similar spy of the above Value indicative carries out linear regression according to the corresponding feature weight of each similar features value, draws the compound similar features of two sentences Value.

Above-described embodiment, has been carried out further to the purpose of the present invention, technical scheme and beneficial effect Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc. all should be included Within protection scope of the present invention.

Claims

1. multilingual sentence similarity acquisition methods are applied to based on linear regression, it is characterised in that comprise the following steps：

S1：Obtain two and above similar features value f of two sentences_i；The f_iIncluding f₁, f₂, f₃..., f_n；

S2：The corresponding feature weight ω of each similar features value is chosen according to two affiliated languages of sentence and application field_i；It is described ω_iIncluding ω₁, ω₂, ω₃..., ω_n；

S3：Two and above similar features value are subjected to linear regression according to the corresponding feature weight of each similar features value, obtained Go out the compound similar characteristic value f of two sentences_s；

The linear regression formula is as follows：

F described above_iFor similar features value, ω_iFor with f_iCorresponding feature weight, f_sTo be combined similar characteristic value.

2. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression It is, described two and above similar features value includes the structure similar features value f by calculating two obtained sentences_i；Its Calculation procedure is as follows：

S111：Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence；

S112：Structure detection value TP, FP and FN between two syntax trees are drawn according to two syntax trees；

S113：The grammer similar features value f of two sentences is calculated by below equation according to structure detection value TP, FP and FN_i：

TP described above is that structure receives true value, and FP is structure wrong report value, and FN is that structure goes true value, and R is structure recall rate, and P is knot Structure accuracy rate, f_iFor structure similar features value.

3. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression It is, described two and above similar features value includes the part of speech similar features value f by calculating two obtained sentences_i；Its Calculation procedure is as follows：

S121：Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence；

S122：Two sentences are divided into reference to sentence and former sentence；The reference sentence is only in this calculating part of speech similar features value f_iWhen The sentence used；The former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentences_iSentence；According to The part of speech distribution of two syntax trees draws the minimum step number W required for being modified as another sentence from a sentence；

S123：The part of speech similar features value f of two sentences is calculated by below equation_i：

W described above is the minimum step number required for being modified as another sentence from a sentence, and L is the length with reference to sentence, f_iFor Part of speech similar features value.

4. according to claim 3 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression Be, it is described from a sentence be modified as another sentence required for minimum step number W use Lay Weinstein distance.

5. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression It is, described two and above similar features value includes the similar features value f by calculating two obtained sentences_i；It is calculated Step is as follows：

S131：Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence；

S132：Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn according to two syntax trees；

S133：The vocabulary similar features value f of two sentences is calculated by below equation according to vocabulary detected value TP ', FP ' and FN '_i：

<mrow> <msup> <mi>R</mi> <mo>&prime;</mo> </msup> <mo>=</mo> <mfrac> <mrow> <msup> <mi>TP</mi> <mo>&prime;</mo> </msup> </mrow> <mrow> <msup> <mi>TP</mi> <mo>&prime;</mo> </msup> <mo>+</mo> <msup> <mi>FN</mi> <mo>&prime;</mo> </msup> </mrow> </mfrac> <mo>;</mo> <msup> <mi>P</mi> <mo>&prime;</mo> </msup> <mo>=</mo> <mfrac> <mrow> <msup> <mi>TP</mi> <mo>&prime;</mo> </msup> </mrow> <mrow> <msup> <mi>TP</mi> <mo>&prime;</mo> </msup> <mo>+</mo> <msup> <mi>FP</mi> <mo>&prime;</mo> </msup> </mrow> </mfrac> <mo>;</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mn>2</mn> <msup> <mi>P</mi> <mo>&prime;</mo> </msup> <msup> <mi>R</mi> <mo>&prime;</mo> </msup> </mrow> <mrow> <msup> <mi>P</mi> <mo>&prime;</mo> </msup> <mo>+</mo> <msup> <mi>R</mi> <mo>&prime;</mo> </msup> </mrow> </mfrac> <mo>;</mo> </mrow> 1

TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, P ' For vocabulary accuracy rate, f_iFor vocabulary similar features value.

6. multilingual sentence similarity is applied to based on linear regression using claim 1 to 5 any one methods described Acquisition system, it is characterised in that including：

For the sentence two of acquisition two and the acquiring unit of above similar features value；

For the selecting unit according to two affiliated languages of sentence and application field selected characteristic weights；

For two and above similar features value to be carried out into linear regression according to the corresponding feature weight of each similar features value, obtain Go out the linear regression unit of the compound similar characteristic value of two sentences.

7. according to claim 6 be applied to multilingual sentence similarity acquisition system, its feature based on linear regression It is, obtain two of the acquiring unit and above similar features value include structure similar features value, part of speech similar features value Or vocabulary similar features value.