CN107066443A - Multilingual sentence similarity acquisition methods and system are applied to based on linear regression - Google Patents

Multilingual sentence similarity acquisition methods and system are applied to based on linear regression Download PDF

Info

Publication number
CN107066443A
CN107066443A CN201710187215.2A CN201710187215A CN107066443A CN 107066443 A CN107066443 A CN 107066443A CN 201710187215 A CN201710187215 A CN 201710187215A CN 107066443 A CN107066443 A CN 107066443A
Authority
CN
China
Prior art keywords
sentence
similar features
value
mrow
features value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710187215.2A
Other languages
Chinese (zh)
Inventor
海同舟
李明
王兴强
彭成超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Excellent Translation Information Technology Ltd By Share Ltd
Original Assignee
Chengdu Excellent Translation Information Technology Ltd By Share Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Excellent Translation Information Technology Ltd By Share Ltd filed Critical Chengdu Excellent Translation Information Technology Ltd By Share Ltd
Priority to CN201710187215.2A priority Critical patent/CN107066443A/en
Publication of CN107066443A publication Critical patent/CN107066443A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses comprised the following steps based on linear regression suitable for multilingual sentence similarity acquisition methods:Obtain two and above similar features value of two sentences;The corresponding feature weight of each similar features value is chosen according to two affiliated languages of sentence and application field;Two and above similar features value are subjected to linear regression according to the corresponding feature weight of each similar features value, the compound similar characteristic value of two sentences is drawn.The invention discloses be applied to multilingual sentence similarity based on linear regression to obtain system, including acquiring unit;Selecting unit;Linear regression unit.The present invention is applied to multilingual sentence similarity acquisition methods and system based on linear regression, by being weighted linear regression to the different similar features values of sentence, it is adaptable to multilingual and a variety of occasions.

Description

Multilingual sentence similarity acquisition methods and system are applied to based on linear regression
Technical field
The present invention relates to machine translation technical field, and in particular to is applied to multilingual sentence phase based on linear regression Like degree acquisition methods and system.
Background technology
With developing rapidly for economic globalization and Internet, the translation of natural language is promoting politics, economic, text Change and play more and more important effect in terms of exchanging.Past, people needed to turn over spoken and written languages in international exchange field , it is necessary to using human translation, take time and effort when translating, and with the high speed development of computer hardware technique, machine translation and Computer-aided translation has obtained increasingly being widely applied.
Contrast sentence similarity is to study the important topic in machine translation and computer-aided translation, traditional contrast Using the contrast on single level more than method, such as the Duplication of vocabulary, language model Duplication, vocabulary vectorization are fixed with cosine Reason calculates distance in semantic space and compared etc., but the sentence similarity control methods of this single level can not be directed to it is various Language feature makes suitable contrast, such as the control methods suitable for English is not suitable for Chinese contrast.While single level Sentence similarity control methods can not be directed to various occasion terms the characteristics of make suitable contrast, such as suitable for news English The control methods of language is not suitable for the contrast of spoken English.
State Patent Office Patent No. CN201110303522.5 patent of invention discloses a kind of calculating sentence similarity Method and apparatus and machine translation method and apparatus, it is poor that the patent employs vocabulary to the method for sentence similarity comparison Different control methods, this method be applied to the obvious language of lexical gap and occasion, to lexical gap not substantially and sentence pattern The obvious language of difference and occasion, contrast effect are poor.
The content of the invention
The technical problems to be solved by the invention are that existing sentence similarity control methods is not suitable for multilingual and many Plant occasion, it is therefore intended that provide and be applied to based on linear regression in multilingual sentence similarity acquisition methods and system, solution State problem.
The present invention is achieved through the following technical solutions:
Multilingual sentence similarity acquisition methods are applied to based on linear regression, comprised the following steps:S1:Obtain two Two and above similar features value f of sentencei;The fiIncluding f1, f2, f3..., fn;S2:According to two affiliated languages of sentence and Application field chooses the corresponding feature weight ω of each similar features valuei;The ωiIncluding ω1, ω2, ω3..., ωn;S3:Will Two and above similar features value carry out linear regression according to the corresponding feature weight of each similar features value, draw two sentences Compound similar characteristic value fs;The linear regression formula is as follows:F described aboveiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.
In the prior art, using the contrast on single level, the sentence phase of this single level more than contrast sentence similarity Suitable contrast can not be made for various language features and various occasion terms like degree control methods, such as suitable for English Control methods is not suitable for Chinese contrast, it is adaptable to which the control methods of News English is not suitable for the contrast of spoken English.This hair During bright application, two and above similar features value f of two sentences are first obtainedi;The fiIncluding f1, f2, f3..., fn, further according to Two affiliated languages of sentence and application field choose the corresponding feature weight ω of each similar features valuei;The ωiIncluding ω1, ω2, ω3..., ωn;Two and above similar features value are entered into line according to the corresponding feature weight of each similar features value again Property return, draw the compound similar characteristic value f of two sentencess;The linear regression formula is as follows:Institute above State fiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.For the similar of two sentences Characteristic value, can choose but be not limited to structure similar features value, part of speech similar features value or vocabulary similar features value, these three phases It can be very good to carry out sentence similarity contrast to mainstream speech and main flow occasion term like characteristic value, but if spy Different language or occasion term, such as Turkish can also add root similar features value according to its feature.
Feature weight corresponding for each similar features value, according to the difference and the difference of application field of language, chooses Feature weight it is also different.Such as English is compared with German, because German vocabulary lattice change and complete express semanteme, guest Position of the syntactic structures such as the language adverbial modifier in sentence is very flexible, so the feature weight of reduction part of speech similar features value, and carry The feature weight of high structure similar features value.Journalese is compared with drama term for another example, because news is mostly long sentence language Method is of a tightly knit structure, it is necessary to which balanced consider each feature, and the sentence of drama tends to short sentence, and has the non-grammaticalization of spoken language, institute To reduce the part of speech similar features value of drama term.The present invention to the different similar features values of sentence by being weighted linear return Return, realize the inventive method suitable for multilingual and a variety of occasions.
Further, described two and above similar features value includes similar by the structure for calculating two obtained sentences Characteristic value fi;Its calculation procedure is as follows:S111:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;S112: Structure detection value TP, FP and FN between two syntax trees are drawn according to two syntax trees;S113:According to structure detection value TP, FP and FN is calculated the structure similar features value f of two sentences by below equationiOn The text TP receives true value for structure, and FP is structure wrong report value, and FN is that structure goes true value, and R is structure recall rate, and P is that structure is accurate Rate, fiFor structure similar features value.
When the present invention is applied, first parsing needs two sentences compared, obtains two syntax trees corresponding with sentence, then Draw structure detection value TP, FP and FN between two syntax trees according to two syntax trees, further according to structure detection value TP, FP and FN is calculated the structure similar features value f of two sentences by below equationiInstitute above State TP and receive true value for structure, FP is structure wrong report value, FN is that structure goes true value, and R is structure recall rate, and P is structure accuracy rate, fi For structure similar features value.This algorithm to structure similar features value shows the architectural feature of sentence well.
Further, described two and above similar features value includes the part of speech that two obtained sentences are calculated by part of speech Similar features value fi;Its calculation procedure is as follows:S121:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence; S122:Two sentences are divided into reference to sentence and former sentence;The reference sentence is only in this calculating part of speech similar features value fiWhen make Sentence;The former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentencesiSentence;According to two The part of speech distribution of individual syntax tree draws the minimum step number W required for being modified as another sentence from a sentence;S123:By with Lower formula calculates the part of speech similar features value f of two sentencesiW described above is to be modified as separately from a sentence Minimum step number required for one sentence, L is the length with reference to sentence, fiFor part of speech similar features value.
Further, it is described from a sentence be modified as another sentence required for minimum step number W use Lai Wensi Smooth distance.
When the present invention is applied, first parse two sentences and obtain two syntax trees corresponding with sentence, then by two sentences Son is divided into reference to sentence and former sentence;The reference sentence is only in this calculating part of speech similar features value fiWhen the sentence that uses;It is described Former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentencesiSentence;According to the word of two syntax trees Property distribution draw minimum step number W required for being modified as another sentence from a sentence;Again two are calculated by below equation The part of speech similar features value f of sentenceiW described above is to be modified as from a sentence required for another sentence Minimum step number, L is the length with reference to sentence, fiFor part of speech similar features value.It is described to be modified as another sentence institute from a sentence The minimum step number W needed uses Lay Weinstein distance.In sentence similarity comparison process, generally require by a sentence with it is multiple Sentence carries out similarity comparison, and this sentence is referred to as former sentence, and multiple sentences are referred to as to refer to sentence.It is this similar to part of speech The algorithm of characteristic value shows the part of speech feature of sentence well.
Further, described two and above similar features value includes calculating the similar of two obtained sentences by vocabulary Characteristic value fi;Its calculation procedure is as follows:S131:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;S132:Root Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn according to two syntax trees;S133:According to vocabulary detected value TP ', FP ' and FN ' is calculated the vocabulary similar features value f of two sentences by below equationi TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, and P ' is word Remittance accuracy rate, fiFor vocabulary similar features value.
When the present invention is applied, first parse two sentences and obtain two syntax trees corresponding with sentence, further according to two Syntax tree draws vocabulary detected value TP ', FP ' and the FN ' between two syntax trees, further according to vocabulary detected value TP ', FP ' and FN ' The vocabulary similar features value f of two sentences is calculated by below equationiAbove The TP ' is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, and P ' is that vocabulary is accurate True rate, fiFor vocabulary similar features value.This algorithm to vocabulary similar features value shows the lexical feature of sentence well.
Multilingual sentence similarity is applied to based on linear regression and obtains system, including:For obtaining two sentences two Individual and above similar features value acquiring unit;For according to two affiliated languages of sentence and application field selected characteristic weights Selecting unit;For two and above similar features value linearly to be returned according to the corresponding feature weight of each similar features value Return, draw the linear regression unit of the compound similar characteristic value of two sentences.
Further, obtain two of the acquiring unit and above similar features value include structure similar features value, word Property similar features value or vocabulary similar features value.
In the prior art, using the contrast on single level, the sentence phase of this single level more than contrast sentence similarity Suitable contrast can not be made for various language features and various occasion terms like degree control methods, such as suitable for English Control methods is not suitable for Chinese contrast, it is adaptable to which the control methods of News English is not suitable for the contrast of spoken English.This hair During bright application, two and above similar features value that acquiring unit obtains two sentences are first passed through, selecting unit is according to two sentences Languages and application field selected characteristic weights belonging to sub, linear regression unit is by two and above similar features value according to each phase Linear regression is carried out like the corresponding feature weight of characteristic value, the compound similar characteristic value of two sentences is drawn.For two sentences Similar features value, can choose but be not limited to structure similar features value, part of speech similar features value or vocabulary similar features value, this Three kinds of similar features values can be very good to carry out sentence similarity contrast to mainstream speech and main flow occasion term, but such as Fruit has special language or occasion term, such as Turkish can also add root similar features value according to its feature.This Invention by being weighted linear regression to the different similar features value of sentence, realize present system suitable for multilingual with A variety of occasions.
The present invention compared with prior art, has the following advantages and advantages:
1st, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, by the different phases of sentence Linear regression is weighted like characteristic value, the inventive method is realized suitable for multilingual and a variety of occasions;
2nd, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of structure The algorithm of value indicative shows the architectural feature of sentence well;
3rd, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of part of speech The algorithm of value indicative shows the part of speech feature of sentence well;
4th, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, using to the similar spy of vocabulary The algorithm of value indicative shows the lexical feature of sentence well;
5th, the present invention is applied to multilingual sentence similarity acquisition system based on linear regression, by the different phases of sentence Linear regression is weighted like characteristic value, present system is realized suitable for multilingual and a variety of occasions.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding the embodiment of the present invention, constitutes one of the application Point, do not constitute the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the inventive method schematic diagram;
Fig. 2 is present system schematic diagram;
Fig. 3 is structure similar features value calculation procedure schematic diagram of the present invention;
Fig. 4 is part of speech similar features value calculation procedure schematic diagram of the present invention;
Fig. 5 is vocabulary similar features value calculation procedure schematic diagram of the present invention;
Fig. 6 is the embodiment of the present invention 2,3 and 4 syntax tree schematic diagrames;
Fig. 7 is the embodiment of the present invention 2,3 and 4 syntax tree schematic diagrames;
Fig. 8 is the syntax tree schematic diagram of the embodiment of the present invention 8;
Fig. 9 is the syntax tree schematic diagram of the embodiment of the present invention 8;
Figure 10 is the syntax tree schematic diagram of the embodiment of the present invention 9;
Figure 11 is the syntax tree schematic diagram of the embodiment of the present invention 9.
Embodiment
For the object, technical solutions and advantages of the present invention are more clearly understood, with reference to embodiment and accompanying drawing, to this Invention is described in further detail, and exemplary embodiment and its explanation of the invention is only used for explaining the present invention, does not make For limitation of the invention.
Embodiment 1
As shown in figure 1, the present invention is applied to multilingual sentence similarity acquisition methods based on linear regression, its feature exists In comprising the following steps:S1:Obtain two and above similar features value f of two sentencesi;The fiIncluding f1, f2, f3..., fn;S2:The corresponding feature weight ω of each similar features value is chosen according to two affiliated languages of sentence and application fieldi;It is described ωiIncluding ω1, ω2, ω3..., ωn;S3:By two and above similar features value according to the corresponding feature of each similar features value Weights carry out linear regression, draw the compound similar characteristic value f of two sentencess;The linear regression formula is as follows:F described aboveiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.
When the present embodiment is implemented, two and above similar features value f of two sentences are first obtainedi;The fiIncluding f1, f2, f3..., fn;The corresponding feature weight ω of each similar features value is chosen further according to two affiliated languages of sentence and application fieldi; The ωiIncluding ω1, ω2, ω3..., ωn;It is again that two and above similar features value are corresponding according to each similar features value Feature weight carries out linear regression, draws the compound similar characteristic value f of two sentencess;The linear regression formula is as follows:F described aboveiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.
Embodiment 2
As shown in Fig. 3, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value Including the structure similar features value f by calculating two obtained sentences1;Its calculation procedure is as follows:Parse two sentences and obtain To two syntax trees corresponding with sentence;Structure detection value TP, FP between two syntax trees is drawn according to two syntax trees And FN;The structure similar features value f of two sentences is calculated by below equation according to structure detection value TP, FP and FN1 TP described above is that structure receives true value, and FP is structure wrong report value, and FN goes for structure True value, R is structure recall rate, and P is structure accuracy rate, f1For structure similar features value.
When the present embodiment is implemented, language is English, and two sentences are respectively:
<a>:For the moment she works as doctor.
<b>:She worked as doctor for that moment.
First parse two sentences and obtain two syntax trees, syntax tree is as shown in Figure 6 and Figure 7;Wherein:
<a>Architectural feature:S, PP, NP, VP, NP, PP, NP
<b>Architectural feature:S, NP, VP, PP, PP, NP, NP
Structure detection value TP, FP and FN between two syntax trees are drawn further according to two syntax trees:
TP=7, FP=0, FN=0;
TP, FP and FN value are substituted into formula:
Obtain P=1, R=1;
P and R value is substituted into formula
Obtain f1=1.
Embodiment 3
As shown in Fig. 4, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value Part of speech similar features value f including calculating two obtained sentences by part of speech2;Its calculation procedure is as follows:Parse two sentences And obtain two syntax trees corresponding with sentence;Two sentences are divided into reference to sentence and former sentence;The reference sentence is only at this Secondary calculating part of speech similar features value f2When the sentence that uses;The former sentence is in addition to this, in addition it is also necessary to calculate word with other sentences Property similar features value f2Sentence;Drawn according to the part of speech distribution of two syntax trees from a sentence and be modified as another sentence institute The minimum step number W needed;The part of speech similar features value f of two sentences is calculated by below equation2It is described above W is the minimum step number required for being modified as another sentence from a sentence, and L is the length with reference to sentence, f2For the similar spy of part of speech Value indicative;It is described from a sentence be modified as another sentence required for minimum step number W use Lay Weinstein distance.
When the present embodiment is implemented, language is English, and two sentences are respectively:
<a>:For the moment she works as doctor.
<b>:She worked as doctor for that moment.
First parse two sentences and obtain two syntax trees, syntax tree is as shown in Figure 6 and Figure 7;Wherein:
<a>Part of speech feature:IN, DT, NN, PRP, VBZ, IN, NN
<b>Part of speech feature:PRP, VBD, IN, NN, IN, DT, NN
Minimum step number W, W=required for being modified as another sentence from a sentence is calculated by Lay Weinstein distance 6;
With<a>To refer to sentence, with<b>For former sentence;With reference to sentence length L=7;
L and W is substituted into formula
Draw f2=0.143.
Embodiment 4
As shown in Fig. 5, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1, described two and above similar features value Similar features value f including calculating two obtained sentences by vocabulary3;Its calculation procedure is as follows:Parse two sentences and obtain To two syntax trees corresponding with sentence;According to two syntax trees draw vocabulary detected value TP ' between two syntax trees, FP ' and FN ';The vocabulary similar features value f of two sentences is calculated by below equation according to vocabulary detected value TP ', FP ' and FN '3TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, FN ' True value is gone for vocabulary, R ' is vocabulary recall rate, and P ' is vocabulary accuracy rate, f3For vocabulary similar features value.
When the present embodiment is implemented, language is English, and two sentences are respectively:
<a>:For the moment she works as doctor.
<b>:She worked as doctor for that moment.
First parse two sentences and obtain two syntax trees, syntax tree is as shown in Figure 6 and Figure 7;Wherein:
<a>Lexical feature:For, the, moment, she, works, as, doctor
<b>Lexical feature:She, worked, as, doctor, for, that, moment
Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn further according to two syntax trees:
TP '=5, FP '=2, FN '=2;
TP ', FP ' and FN ' value are substituted into formula:
Obtain P '=5/7, R '=5/7;
P ' and R ' value is substituted into formula
Obtain f3=0.714.
Embodiment 5
As shown in Fig. 1, Fig. 6 and Fig. 7, the present embodiment is on the basis of embodiment 1 to 4, based on English language feature, chooses f1Corresponding feature weight ω1=0.5, choose f2Corresponding feature weight ω2=0.1, choose f3Corresponding feature weight ω3= 0.4, structure similar features value f1=1, part of speech similar features value f2=0.143, vocabulary similar features value f3=0.714;
When the present embodiment is implemented, by f1、f2、f3、ω1、ω2And ω3Substitute into formula
Obtain fs=f1ω1+f2ω2+f3ω3=1*0.5+0.143*0.1+0.4*0.714=0.93.
Embodiment 6
As shown in figure 1, the present embodiment is on the basis of embodiment 1 to 4, English News text is analyzed, news etc. The text of type, is mostly that long sentence syntactic structure is rigorous, it is necessary to which balanced consider each feature.
When the present embodiment is implemented, two sentences are respectively:
<c>The stock posted a multi-year low of under$10in early 2009before roaring to a recent high of nearly$75in April 2014.
<d>The stock will post a multi-year low of under$10in early 2016before roaring to a recent high of nearly$75in April 2014.
First parse two sentences and obtain two syntax trees;Wherein
<c>Architectural feature:ROOT, IP, NP, DP, NP, VP, IP, NP, QP, NP, VP, PP, PP, IP, NP, VP, NP, PP, NP, NP, QP, NP, PP, NP, QP, NP, VP, PP, QP, NP, QP, PP, QP, NP, QP
<d>Architectural feature:ROOT, IP, NP, DP, NP, VP, IP, NP, QP, NP, VP, PP, PP, IP, NP, VP, NP, PP, NP, NP, QP, NP, PP, NP, QP, NP, VP, PP, QP, NP, QP, PP, QP, NP, QP
<c>Part of speech feature:DT, NN, VV, CD, NN, VV, P, NN, NN, NT, P, NR, CD, NN, NN, P, CD, NN, VA, P, NR, NN, CD, P, NR, CD, PU
<d>Part of speech feature:DT, NN, NN, VV, CD, NN, VV, P, NN, NN, NT, P, NR, CD, NN, NN, P, CD, NN, VA, P, NR, NN, CD, P, NR, CD, PU
<c>Lexical feature:The, stock, posted, a, multi-year, low, of, under, $, 10, in, Early, 2009, before, roaring, to, a, recent, high, of, nearly, $, 75, in, April, 2014
<d>Lexical feature:The, stock, will, post, a, multi-year, low, of, under, $, 10, in, Early, 2016, before, roaring, to, a, recent, high, of, nearly, $, 75, in, April, 2014
According to calculation of the embodiment 2 into embodiment 4, obtain:
f1=1.0, f2=0.96, f3=0.91;
According to the characteristics of newsletter archive, it is necessary to which each feature is considered in equilibrium, admittedly choose f1Corresponding feature weight ω1= 0.35, choose f2Corresponding feature weight ω2=0.35, choose f3Corresponding feature weight ω3=0.3;
According to the calculation formula in embodiment 1Draw:
fs=f1ω1+f2ω2+f3ω3=0.96*0.35+1.0*0.35+0.91*0.3=0.96.
Embodiment 7
As shown in figure 1, the present embodiment is on the basis of embodiment 1 to 4, English drama text is analyzed, drama Sentence tends to short sentence, and has the non-grammaticalization of spoken language, so needing to reduce the weight of part of speech feature;
When the present embodiment is implemented, two sentences are respectively:
<e>:Louis,you're a good writer.
<f>:You're a good investigator.maybe...
First parse two sentences and obtain two syntax trees;Wherein
<e>Architectural feature:ROOT, IP, NP, NP, VP, NP, QP, NP
<f>Architectural feature:ROOT, IP, IP, NP, VP, NP, QP, NP, IP, VP, ADVP, VP
<e>Part of speech feature:NR, PU, PN, VV, CD, NN, NN, PU
<f>Part of speech feature:PN, VV, CD, NN, NN, PU, AD, VV
<e>Lexical feature:Louis, you, ' re, a, good, writer
<f>Lexical feature:You, ' re, a, good, investigator, maybe
According to calculation of the embodiment 2 into embodiment 4, obtain:
f1=1.0, f2=0.5, f3=0.63;
According to the characteristics of drama text, it is necessary to reduce the weight of part of speech feature, admittedly choose f1Corresponding feature weight ω1= 0.5, choose f2Corresponding feature weight ω2=0.1, choose f3Corresponding feature weight ω3=0.4;
According to the calculation formula in embodiment 1Draw:
fs=f1ω1+f2ω2+f3ω3=0.5*0.1+1.0*0.5+0.63*0.4=0.8.
Embodiment 8
As shown in Fig. 1, Fig. 8 and Fig. 9, the present embodiment is analyzed German on the basis of embodiment 1 to 4, because moral Language vocabulary lattice change it is complete express semanteme, the position of the syntactic structure in sentence such as object adverbial modifier can be very clever It is living, need to reduce the weight of part of speech feature in this case, and increase the weight of architectural feature;
When the present embodiment is implemented, two sentences are respectively:
<g>Er kaufte ein Buch in einer Buchhandlung
<h>Er in einer Buchhandlung ein Buch kaufte
First parse two sentences and obtain two syntax trees;Wherein
<g>Architectural feature:ROOT, IP, NP, VP, PP, NP, VP
<h>Architectural feature:ROOT, IP, NP, VP, PP, NP, VP, NP
<g>Part of speech feature:NR, NN, NN, NN, P, NN, VV
<h>Part of speech feature:NR, P, NN, NN, VV, NN, NN
<g>Lexical feature:Er, kaufte, ein, buch, in, einer, buchhandlung
<h>Lexical feature:Er, in, einer, buchhandlung, ein, buch, kaufte
According to calculation of the embodiment 2 into embodiment 4, obtain:
f1=0.93, f2=0.57, f3=1;
According to the characteristics of German, it is necessary to reduce the weight of part of speech feature and increase the weight of architectural feature, admittedly choose f1It is right The feature weight ω answered1=0.5, choose f2Corresponding feature weight ω2=0.1, choose f3Corresponding feature weight ω3=0.4;
According to the calculation formula in embodiment 1Draw:
fs=f1ω1+f2ω2+f3ω3=0.93*0.5+0.1*0.57+1*0.4=0.922.
Embodiment 9
As shown in Fig. 1, Figure 10 and Figure 11, the present embodiment is analyzed Chinese on the basis of embodiment 1 to 4, because The expression of Chinese sentence meaning is not by vocabulary but sentence sequence, and the position that vocabulary is placed can all express grammer implication, for this language Speech is, it is necessary to increase the weight of part of speech feature.
When the present embodiment is implemented, two sentences are respectively:
<i>Get dressed
<j>Wear the clothes well
First parse two sentences and obtain two syntax trees;Wherein
<i>Architectural feature:ROOT, IP, VP, VV, NP, ADJP, JJ, NP, NN
<j>Architectural feature:ROOT, IP, VP, NP, ADJP, NP
<i>Part of speech feature:VV, JJ, NN
<j>Part of speech feature:AD, VV, NN
<i>Lexical feature:Wear, good, clothes
<j>Lexical feature:It is good, wear, clothes
According to calculation of the embodiment 2 into embodiment 4, obtain:
f1=0.83, f2=0.33, f3=1;
According to the characteristics of Chinese, it is necessary to increase the weight of part of speech feature, admittedly choose f1Corresponding feature weight ω1=0.3, Choose f2Corresponding feature weight ω2=0.4, choose f3Corresponding feature weight ω3=0.3;
According to the calculation formula in embodiment 1Draw:
fs=f1ω1+f2ω2+f3ω3=0.33*0.4+0.83*0.3+1.0*0.3=0.68.
Embodiment 9
As shown in Fig. 2 the present invention is applied to multilingual sentence similarity based on linear regression obtains system, including:With In the sentence two of acquisition two and the acquiring unit of above similar features value;For being led according to two affiliated languages of sentence and application The selecting unit of domain selected characteristic weights;For by two and above similar features value according to the corresponding spy of each similar features value Levy weights and carry out linear regression, draw the linear regression unit of the compound similar characteristic value of two sentences.
When the present embodiment is implemented, two and above similar features value that acquiring unit obtains two sentences are first passed through, selection Unit is according to two affiliated languages of sentence and application field selected characteristic weights, and linear regression unit is by two and the similar spy of the above Value indicative carries out linear regression according to the corresponding feature weight of each similar features value, draws the compound similar features of two sentences Value.
Above-described embodiment, has been carried out further to the purpose of the present invention, technical scheme and beneficial effect Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc. all should be included Within protection scope of the present invention.

Claims (7)

1. multilingual sentence similarity acquisition methods are applied to based on linear regression, it is characterised in that comprise the following steps:
S1:Obtain two and above similar features value f of two sentencesi;The fiIncluding f1, f2, f3..., fn
S2:The corresponding feature weight ω of each similar features value is chosen according to two affiliated languages of sentence and application fieldi;It is described ωiIncluding ω1, ω2, ω3..., ωn
S3:Two and above similar features value are subjected to linear regression according to the corresponding feature weight of each similar features value, obtained Go out the compound similar characteristic value f of two sentencess
The linear regression formula is as follows:
F described aboveiFor similar features value, ωiFor with fiCorresponding feature weight, fsTo be combined similar characteristic value.
2. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression It is, described two and above similar features value includes the structure similar features value f by calculating two obtained sentencesi;Its Calculation procedure is as follows:
S111:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;
S112:Structure detection value TP, FP and FN between two syntax trees are drawn according to two syntax trees;
S113:The grammer similar features value f of two sentences is calculated by below equation according to structure detection value TP, FP and FNi
<mrow> <mi>R</mi> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>P</mi> </mrow> <mrow> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>N</mi> </mrow> </mfrac> <mo>;</mo> <mi>P</mi> <mo>=</mo> <mfrac> <mrow> <mi>T</mi> <mi>P</mi> </mrow> <mrow> <mi>T</mi> <mi>P</mi> <mo>+</mo> <mi>F</mi> <mi>P</mi> </mrow> </mfrac> <mo>;</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mn>2</mn> <mi>P</mi> <mi>R</mi> </mrow> <mrow> <mi>P</mi> <mo>+</mo> <mi>R</mi> </mrow> </mfrac> <mo>;</mo> </mrow>
TP described above is that structure receives true value, and FP is structure wrong report value, and FN is that structure goes true value, and R is structure recall rate, and P is knot Structure accuracy rate, fiFor structure similar features value.
3. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression It is, described two and above similar features value includes the part of speech similar features value f by calculating two obtained sentencesi;Its Calculation procedure is as follows:
S121:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;
S122:Two sentences are divided into reference to sentence and former sentence;The reference sentence is only in this calculating part of speech similar features value fiWhen The sentence used;The former sentence is in addition to this, in addition it is also necessary to calculate part of speech similar features value f with other sentencesiSentence;According to The part of speech distribution of two syntax trees draws the minimum step number W required for being modified as another sentence from a sentence;
S123:The part of speech similar features value f of two sentences is calculated by below equationi
<mrow> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>=</mo> <mn>1</mn> <mo>-</mo> <mfrac> <mi>W</mi> <mi>L</mi> </mfrac> <mo>;</mo> </mrow>
W described above is the minimum step number required for being modified as another sentence from a sentence, and L is the length with reference to sentence, fiFor Part of speech similar features value.
4. according to claim 3 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression Be, it is described from a sentence be modified as another sentence required for minimum step number W use Lay Weinstein distance.
5. according to claim 1 be applied to multilingual sentence similarity acquisition methods, its feature based on linear regression It is, described two and above similar features value includes the similar features value f by calculating two obtained sentencesi;It is calculated Step is as follows:
S131:Two sentences of parsing simultaneously obtain two syntax trees corresponding with sentence;
S132:Vocabulary detected value TP ', FP ' and the FN ' between two syntax trees are drawn according to two syntax trees;
S133:The vocabulary similar features value f of two sentences is calculated by below equation according to vocabulary detected value TP ', FP ' and FN 'i
<mrow> <msup> <mi>R</mi> <mo>&amp;prime;</mo> </msup> <mo>=</mo> <mfrac> <mrow> <msup> <mi>TP</mi> <mo>&amp;prime;</mo> </msup> </mrow> <mrow> <msup> <mi>TP</mi> <mo>&amp;prime;</mo> </msup> <mo>+</mo> <msup> <mi>FN</mi> <mo>&amp;prime;</mo> </msup> </mrow> </mfrac> <mo>;</mo> <msup> <mi>P</mi> <mo>&amp;prime;</mo> </msup> <mo>=</mo> <mfrac> <mrow> <msup> <mi>TP</mi> <mo>&amp;prime;</mo> </msup> </mrow> <mrow> <msup> <mi>TP</mi> <mo>&amp;prime;</mo> </msup> <mo>+</mo> <msup> <mi>FP</mi> <mo>&amp;prime;</mo> </msup> </mrow> </mfrac> <mo>;</mo> <msub> <mi>f</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mn>2</mn> <msup> <mi>P</mi> <mo>&amp;prime;</mo> </msup> <msup> <mi>R</mi> <mo>&amp;prime;</mo> </msup> </mrow> <mrow> <msup> <mi>P</mi> <mo>&amp;prime;</mo> </msup> <mo>+</mo> <msup> <mi>R</mi> <mo>&amp;prime;</mo> </msup> </mrow> </mfrac> <mo>;</mo> </mrow> 1
TP ' described above is that vocabulary receives true value, and FP ' is vocabulary wrong report value, and FN ' is that vocabulary goes true value, and R ' is vocabulary recall rate, P ' For vocabulary accuracy rate, fiFor vocabulary similar features value.
6. multilingual sentence similarity is applied to based on linear regression using claim 1 to 5 any one methods described Acquisition system, it is characterised in that including:
For the sentence two of acquisition two and the acquiring unit of above similar features value;
For the selecting unit according to two affiliated languages of sentence and application field selected characteristic weights;
For two and above similar features value to be carried out into linear regression according to the corresponding feature weight of each similar features value, obtain Go out the linear regression unit of the compound similar characteristic value of two sentences.
7. according to claim 6 be applied to multilingual sentence similarity acquisition system, its feature based on linear regression It is, obtain two of the acquiring unit and above similar features value include structure similar features value, part of speech similar features value Or vocabulary similar features value.
CN201710187215.2A 2017-03-27 2017-03-27 Multilingual sentence similarity acquisition methods and system are applied to based on linear regression Pending CN107066443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710187215.2A CN107066443A (en) 2017-03-27 2017-03-27 Multilingual sentence similarity acquisition methods and system are applied to based on linear regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710187215.2A CN107066443A (en) 2017-03-27 2017-03-27 Multilingual sentence similarity acquisition methods and system are applied to based on linear regression

Publications (1)

Publication Number Publication Date
CN107066443A true CN107066443A (en) 2017-08-18

Family

ID=59620196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710187215.2A Pending CN107066443A (en) 2017-03-27 2017-03-27 Multilingual sentence similarity acquisition methods and system are applied to based on linear regression

Country Status (1)

Country Link
CN (1) CN107066443A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334493A (en) * 2018-01-07 2018-07-27 深圳前海易维教育科技有限公司 A kind of topic knowledge point extraction method based on neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6810376B1 (en) * 2000-07-11 2004-10-26 Nusuara Technologies Sdn Bhd System and methods for determining semantic similarity of sentences
CN104424279A (en) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 Text relevance calculating method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6810376B1 (en) * 2000-07-11 2004-10-26 Nusuara Technologies Sdn Bhd System and methods for determining semantic similarity of sentences
CN104424279A (en) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 Text relevance calculating method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BING LIU著,俞勇等译: "《Web数据挖掘》", 30 April 2009, 清华大学出版社 *
李秋明等: "基于句子多种特征的相似度计算模型", 《软件导刊》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334493A (en) * 2018-01-07 2018-07-27 深圳前海易维教育科技有限公司 A kind of topic knowledge point extraction method based on neural network
CN108334493B (en) * 2018-01-07 2021-04-09 深圳前海易维教育科技有限公司 Question knowledge point automatic extraction method based on neural network

Similar Documents

Publication Publication Date Title
Taraldsen The scope of wh movement in Norwegian
CN104933027B (en) A kind of open Chinese entity relation extraction method of utilization dependency analysis
CN109146610A (en) It is a kind of intelligently to insure recommended method, device and intelligence insurance robot device
CN104881402A (en) Method and device for analyzing semantic orientation of Chinese network topic comment text
Snover et al. Language and translation model adaptation using comparable corpora
DE112013005742T5 (en) Intention estimation device and intention estimation method
CN105573994A (en) Statistic machine translation system based on syntax framework
CN106502987B (en) A kind of method and apparatus that the sentence template based on seed sentence is recalled
CN103020045A (en) Statistical machine translation method based on predicate argument structure (PAS)
CN103336803B (en) A kind of computer generating method of embedding name new Year scroll
CN109145286A (en) Based on BiLSTM-CRF neural network model and merge the Noun Phrase Recognition Methods of Vietnamese language feature
CN101763403A (en) Query translation method facing multi-lingual information retrieval system
CN107038155A (en) The extracting method of text feature is realized based on improved small-world network model
CN107066443A (en) Multilingual sentence similarity acquisition methods and system are applied to based on linear regression
Dragomirescu et al. Syntactic archaisms preserved in a contemporary romance variety: Interpolation and scrambling in old Romanian and Istro-Romanian
Galinsky et al. Improving neural network models for natural language processing in russian with synonyms
Bungum et al. A survey of domain adaptation in machine translation: Towards a refinement of domain space
Chambers et al. Stochastic language generation in a dialogue system: Toward a domain independent generator
Muntarina et al. Tense based english to bangla translation using mt system
Vogel Remarks on the architecture of optimality theoretic syntax grammars
Grandy Some remarks about logical form
CN103294662B (en) Match judging apparatus and consistance determination methods
Griscom The diachronic developments of KI constructions in the Luo and Koman families
Declerck et al. How to semantically relate dialectal Dictionaries in the Linked Data Framework
Yemelianova et al. Variability of phraseological units in the British and American English and methods of their translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170818