CN108334495A

CN108334495A - Short text similarity calculating method and system

Info

Publication number: CN108334495A
Application number: CN201810090296.9A
Authority: CN
Inventors: 王慧; 汪立东; 王博; 刘春阳; 张旭; 王萌; 李雄
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2018-01-30
Filing date: 2018-01-30
Publication date: 2018-07-27

Abstract

The present invention provides a kind of short text similarity calculating methods, include the following steps：S1, training corpus is segmented, obtains the term vector of each word using word2vec algorithms, and combine and form term vector set；S2, short text to be calculated is segmented respectively, the term vector of each word of short text to be calculated is found in term vector set, and combine and form short text vector set；S3, the cosine similarity for calculating each term vector and each term vector in short text vector set in term vector set, and the maximum similarity value for obtaining each term vector combines to obtain short text sentence vector；Similarity between two S4, calculating short text sentence vectors, you can calculate the similarity between two short texts.The present invention also provides a kind of short text similarity calculation systems.The similarity algorithm of the present invention is by indicating short text sentence with sentence vector, and effectively feature the semantic similarity between short text sentence, accuracy rate is high.

Description

Short text similarity calculating method and system

Technical field

The invention belongs to short text similarity technical fields, and in particular to a kind of short text similarity calculating method and be System.

Background technology

With the fast development of computer science and technology and internet, more and more data occur in the form of short text On the internet, such as Twitter message, headline, mhkc speech etc..For internet short text data application class, cluster Equal machine learning techniques, therefrom excavate valuable information and provide useful facility for people’s lives, to meet not Tongfang Face needs as a very popular project in current big data application technology.However Chinese short text has word dilute The features such as dredging, is semantic discrete, word is random, it is extremely challenging that this is that the research of Chinese short text is brought, therefore to short essay Notebook data excavate and carry out accurately cognition to its inherent meaning becoming an extremely urgent task, and with very Important theoretical significance.

Method about Chinese short text similarity calculation at present mainly indicates text using vector space model (VSM) This, and then realize the similarity calculation of text.In the model, content of text turns to a point in hyperspace by form, It is provided by the form of vector, operation vectorial in vector space is reduced to the processing of content of text, makes the complexity of problem Degree reduces.Analyzing processing is carried out to short text in this way and is primarily present following two points problem, first is due to short essay The sparsity of eigen word so that since text vector is excessively sparse when it utilizes the algorithm process of common text, cause poly- Effect is undesirable when class, is unable to reach the comparable effect of long text；Second is to indicate text only using vector space model The statistical property of word within a context is considered, based on the assumed condition of linear independence between keyword, without considering word sheet The semantic information of body, therefore there is significant limitation, can not accurately express in sentence semantic meaning.

Invention content

It is an object of the invention to solve the above problems, and provide the advantages of at least will be described later.

It is a still further object of the present invention to provide a kind of short text similarity calculating methods, utilize deep learning word2vec Algorithm trains to obtain the term vector of each word in training corpus, calculates each term vector and short text vector in term vector set The cosine similarity of each term vector in set, obtains the maximum similarity value of each term vector in term vector set, will be each The maximum similarity value of term vector combines to obtain short text sentence vector, then cosine similarity algorithm is used to calculate short text sentence Similarity between subvector effectively features the semantic similarity between short text sentence.

In order to realize these purposes and other advantages according to the present invention, a kind of short text similarity calculation side is provided Method includes the following steps：

S1, training corpus is obtained, training corpus is segmented, using deep learning word2vec algorithms to training corpus It is trained, obtains the term vector (a of each word in training corpus_1i,a_2i,a_3i...), then each term vector is combined to be formed Term vector set S；

S=((a₁₁,a₂₁,a₃₁…),(a₁₂,a₂₂,a₃₂…),(a₁₃,a₂₃,a₃₃…),…(a_1i,a_2i,a_3i…)…(a_1N, a_2N,a_3N…))

S2, short text to be calculated is segmented respectively, after short text participle to be calculated is found in term vector set The corresponding term vector word of each word_i, and combine and form short text vector set sen；

Sen=(word₁,word₂,word₃,…word_i…word_M)

S3, calculated in term vector set using cosine similarity formula each term vector in short text vector set each The cosine similarity of term vector, and obtain the maximum similarity value max of each term vector in term vector set_i, by each word to The maximum similarity value max of amount_iCombination obtains short text sentence vector senVec；

SenVec=(max₁,max₂,max₃,…max_i…max_N)；

S4, the similarity between two short text sentence vectors is calculated using cosine similarity formula, you can calculate two short Similarity between text.

Preferably, the short text similarity calculating method, the method that training corpus is obtained in S1 are：Obtain language material Data remove non-legible class data in corpus data and obtain training corpus；The term vector of each word in obtaining training corpus Afterwards, remove stop words in corpus data and word frequency is less than the corresponding term vector of word of predetermined threshold value, the corresponding word of remaining word Vector combination forms term vector set S, and predetermined threshold value is between 5~10.

Preferably, the short text similarity calculating method passes through HMM model and Viterbi algorithm pair in S2 Short text and training corpus to be calculated is segmented.

Preferably, the short text similarity calculating method, the corpus data are obtained by crawler technology.

Preferably, the short text similarity calculating method, in S4 when similarity value is more than 0.7, then it is assumed that two It is being semantically similar between a short text sentence.

The present invention also provides a kind of short text similarity calculation systems, including：

Training corpus word-dividing mode is used to obtain training corpus and is segmented to training corpus；

Term vector training module is connect with training corpus word-dividing mode, and the term vector training module is used for training Language material is trained to obtain the term vector of each word in training corpus, then combines each term vector to form term vector collection It closes；

Short text word-dividing mode is used to be segmented to obtain multiple words to short text to be calculated；

Short text vector generation module is connect with term vector training module, and short text vector generation module is used in word The corresponding term vector of each word after participle is found in vector set, and the term vector of each word after participle is combined into shape At short text vector set；

First similarity calculation module is connect, described first with term vector training module, short text vector generation module Similarity calculation module be used to calculate each term vector in term vector set in short text vector set each term vector it is remaining String similarity；

Comparison module is connect with term vector training module, the comparison module be used for by each word of term vector set to The similarity value of amount is compared, and obtains the maximum value of the similarity of each term vector in term vector set；

Short text sentence vector generation module, connect with comparison module, and the short text sentence vector generation module is used In combining to obtain short text sentence vector by the maximum similarity value of each term vector in term vector set；

Second similarity calculation module is connect with short text sentence vector generation module, second similarity calculation Module calculates the similarity between short text sentence vector using cosine similarity formula.

Preferably, the short text similarity calculation system, the training corpus word-dividing mode include：

Acquiring unit is used to obtain training corpus；

Participle unit is connect with acquiring unit, and the participle unit is for segmenting training corpus.

Preferably, the short text similarity calculation system, the term vector training module include：

Term vector training unit, be used to that training corpus to be trained to obtain the word of each word in training corpus to Amount；

Term vector assembled unit is connect with term vector training unit, and the term vector assembled unit is used for each word Vector combination forms term vector set.

Preferably, the short text similarity calculation system, short text vector generation module include：

Searching unit is used to find the corresponding term vector of each word after participle in term vector set；

Short text vector assembled unit, connect with searching unit, and the short text vector assembled unit will be for that will segment The term vector of each word afterwards combines to form short text vector set.

The present invention includes at least following advantageous effect：

1, the present invention is trained training corpus using deep learning word2vec algorithms, and obtains in training corpus The term vector of each word, indicate word in the form of term vector, the effectively expressing real inherent meaning of word；It then will be every A term vector combines to form term vector set, is segmented respectively to short text to be calculated, is calculated using cosine similarity formula The cosine similarity of each term vector and each term vector in short text vector set in term vector set, and obtain term vector collection The maximum similarity value of each term vector in conjunction, by the maximum similarity value of each term vector combine to obtain short text sentence to Amount, i.e., indicate a short text sentence in the form of real vector, and converting short text to mathematic vector by algorithm indicates shape Formula, the short text sentence vector being built such that fully consider the inherent semantic meaning of sentence word；Then cosine similarity is used Algorithm calculates the similarity between short text sentence vector, effectively features the semantic similarity between short text sentence, has There is higher accuracy rate, is the support of the natural language processings task creation powerful techniques such as follow-up short text clustering, classification.

Part is illustrated to embody by further advantage, target and the feature of the present invention by following, and part will also be by this The research and practice of invention and be understood by the person skilled in the art.

Description of the drawings

Fig. 1 is the flow diagram of the short text similarity calculating method of the present invention；

Fig. 2 is the short text similarity calculation system composition schematic diagram of the present invention.

Specific implementation mode

Present invention will be described in further detail below with reference to the accompanying drawings, to enable those skilled in the art with reference to specification text Word can be implemented according to this.

It should be appreciated that such as " having ", "comprising" and " comprising " term used herein do not allot one or more The presence or addition of a other elements or combinations thereof.

As shown in Figure 1, a kind of short text similarity calculating method, includes the following steps：

Sen=(word₁,word₂,word₃,…word_i…word_M)

SenVec=(max₁,max₂,max₃,…max_i…max_N)；

The short text similarity calculating method of the present invention, after obtaining training corpus on the internet first, by segmenting work Tool segments training corpus, i.e., is the set formed containing a large amount of word in training corpus, word number is super in practice 10,000 are crossed, for example training corpus includes for I, is come from, the U.S., Europe, people is that Chinese seven words are convenient for explanation herein It is illustrated by taking seven words as an example, the word of each word in training corpus is then calculated by deep learning word2vec algorithms Vector, such as my term vector are (a₁₁,a₂₁,a₃₁), from term vector be (a₁₂,a₂₂,a₃₂), the term vector in the U.S. is (a₁₃, a₂₃,a₃₃), European term vector is (a₁₄,a₂₄,a₃₄), the term vector of people is (a₁₅,a₂₅,a₃₅), the term vector for being is (a₁₆,a₂₆, a₃₆), Chinese term vector is (a₁₇,a₂₇,a₃₇), term vector is three-dimensional herein, can be in practice multidimensional, then by upper predicate The combination of vector forms term vector set.There are two short text sentences to be calculated now, " I am Chinese ", " during I comes from State " first respectively segments short text sentence according to segmenting method identical with training corpus, such as " I am Chinese " Word after participle is for I, be, China, people, the word after " I am from China " participle is me, comes from, China, every after participle A word can be found in training corpus, can be corresponded in this way in term vector set find the corresponding word of each word to Amount, then combination form short text vector set, then calculate separately each term vector of term vector set and short text vector set The cosine similarity of each term vector in conjunction, by taking short text sentence " I am Chinese " as an example, by above-mentioned term vector set I Corresponding term vector respectively in short text sentence I, be, China, the corresponding term vector of four words of people carry out cosine similarity Four numerical value are calculated, then compare the size of cosine similarity, chooses cosine similarity maximum value and is denoted as a₁, then calculate Come in term vector set self-corresponding term vector in short text sentence I, be, China, the corresponding term vector progress of four words of people Four numerical value are calculated in cosine similarity, then compare the size of cosine similarity, choose cosine similarity maximum value and are denoted as a₂, and so on, finally obtain in term vector set me, come from, the U.S., Europe, people is, each word term vector of China it is remaining String similarity maximum value is a₁、a₂、a₃、a₄、a₅、a₆、a₇, then the cosine similarity maximum value of each word term vector is combined To short text sentence vector, you can the short text sentence vector representation for obtaining " I am Chinese " is denoted as senVec1= (a₁, a₂, a₃, a₄, a₅, a₆, a₇), similarly, calculate the short text sentence vector of " I from China ", first in term vector set I Corresponding term vector respectively in short text sentence I, come from, the corresponding term vector of Chinese three words carries out cosine similarity Three numerical value are calculated, then compare the size of cosine similarity, chooses cosine similarity maximum value and is denoted as b₁, then calculate Come in term vector set self-corresponding term vector in short text sentence I, come from, the corresponding term vector progress of Chinese three words Three numerical value are calculated in cosine similarity, then compare the size of cosine similarity, choose cosine similarity maximum value and are denoted as b₂, and so on, finally obtain in term vector set me, come from, the U.S., Europe, people is, each word term vector of China it is remaining String similarity maximum value is b₁、b₂、b₃、b₄、b₅、b₆、b₇, you can the short text sentence vector for obtaining " I am from China " indicates shape Formula is denoted as senVec2=(b₁, b₂, b₃, b₄, b₅, b₆, b₇).Then in the phase of calculating " I am Chinese " and " I am from China " Like degree, specific formula is：

Similarity value is bigger closer to 1 two short text sentence similarity between 0~1.

In another technical solution, the short text similarity calculating method, the method that training corpus is obtained in S1 For：Corpus data is obtained, non-legible class data in corpus data is removed and obtains training corpus；Obtaining each of training corpus After the term vector of word, remove the corresponding term vector of word that stop words and word frequency in corpus data are less than predetermined threshold value, it is remaining The corresponding term vector of word combines to form term vector set S, and predetermined threshold value is between 5~10.In the technical scheme, it is interconnecting Online to obtain corpus data, corpus data includes for the article inside mhkc, forum, comment, professional journals magazine etc., removing language The information data of the non-legible classes such as some link, emoticon in material data obtains training expectation；Using word2vec algorithms Before being trained to training corpus, classify to training corpus word, count the frequency and stop words that each word occurs, Stop words includes auxiliary words of mood, adverbial word, preposition, conjunction etc., these words itself have no specific meaning, only put it into one Just have certain effect in a complete sentence, as it is common " ", " " etc, instructed after being trained to training corpus Practice the term vector of each word in language material, and removes the corresponding term vector of word of stop words and word frequency less than predetermined threshold value, this In can set a threshold value in advance, for example threshold value is 8, then word corresponding term vector of the word frequency less than 8 all removes, i.e. frequency Rate is too small to show that the sentence being made of the word can seldom be ignored substantially, can reduce term vector in term vector set in this way Quantity improves calculating speed, if short text sentence to be calculated contains the word removed, such as short text sentence to be calculated point Include A after word₁、A₂、A₃、A₄Four words, A₃For the low-frequency word removed, then A is searched respectively in term vector set in S2₁、A₂、 A₄Corresponding term vector, and combine corresponding term vector to form short text vector set, the calculating of S3, S4 step is then carried out again The similarity of the short text sentence.

In another technical solution, the short text similarity calculating method, in S2 by HMM model and Viterbi algorithm segments short text and training corpus to be calculated.Using identical method to short essay to be calculated This and training corpus are segmented, so that short text to be calculated each word after participle can be looked in training corpus It arrives.

In another technical solution, the short text similarity calculating method, the corpus data passes through reptile skill Art obtains.

In another technical solution, the short text similarity calculating method, when similarity value is more than 0.7 in S4 When, then it is assumed that it is being semantically similar between two short text sentences.The even more big then two short text sentences of fruit similarity value More close, when similarity value is more than 0.7, then two short text sentence semantics are identical.

As shown in Fig. 2, the present invention also provides a kind of short text similarity calculation systems, including：

The short text similarity calculation system of the present invention obtains training corpus and to training by training corpus word-dividing mode Language material is segmented；Then it is trained to obtain each word in training corpus to training corpus by term vector training module Then term vector combines each term vector to form term vector set；Again by short text word-dividing mode to short essay to be calculated This is segmented to obtain multiple words；Each of found in term vector set by short text vector generation module again after participle The corresponding term vector of word, and combine the term vector of each word after participle to form short text vector set；Then pass through First similarity calculation module calculate each term vector in term vector set in short text vector set each term vector it is remaining String similarity；The similarity value of each term vector of term vector set is compared by comparing module, and obtains term vector collection The maximum value of the similarity of each term vector in conjunction；By short text sentence vector generation module by each word in term vector set The maximum similarity value of vector combines to obtain short text sentence vector；Cosine similarity is used by the second similarity calculation module Formula calculates the similarity between short text sentence vector.

In another technical solution, the short text similarity calculation system, the training corpus word-dividing mode packet It includes：

Acquiring unit is used to obtain training corpus；

In another technical solution, the short text similarity calculation system, the term vector training module includes：

In another technical solution, the short text similarity calculation system, short text vector generation module includes：

Although the embodiments of the present invention have been disclosed as above, but its is not only in the description and the implementation listed With it can be fully applied to various fields suitable for the present invention, for those skilled in the art, can be easily Realize other modification, therefore without departing from the general concept defined in the claims and the equivalent scope, the present invention is simultaneously unlimited In specific details and legend shown and described herein.

Claims

1. a kind of short text similarity calculating method, which is characterized in that include the following steps：

S1, training corpus is obtained, training corpus is segmented, training corpus is carried out using deep learning word2vec algorithms Training, obtains the term vector (a of each word in training corpus_1i,a_2i,a_3i...), then each term vector is combined to be formed word to Duration set S；

S=((a₁₁,a₂₁,a₃₁…),(a₁₂,a₂₂,a₃₂…),(a₁₃,a₂₃,a₃₃…),…(a_1i,a_2i,a_3i…)…(a_1N,a_2N, a_3N…))

Each of S2, short text to be calculated is segmented respectively, found in term vector set after short text to be calculated participle The corresponding term vector word of word_i, and combine and form short text vector set sen；

Sen=(word₁,word₂,word₃,…word_iword_M)

S3, calculated in term vector set using cosine similarity formula each term vector in short text vector set each word to The cosine similarity of amount, and obtain the maximum similarity value max of each term vector in term vector set_i, by each term vector Maximum similarity value max_iCombination obtains short text sentence vector senVec；

SenVec=(max₁,max₂,max₃,…max_i…max_N)；

S4, the similarity between two short text sentence vectors is calculated using cosine similarity formula, you can calculate two short texts Between similarity.

2. short text similarity calculating method as described in claim 1, which is characterized in that the method for obtaining training corpus in S1 For：Corpus data is obtained, non-legible class data in corpus data is removed and obtains training corpus；Obtaining each of training corpus After the term vector of word, remove the corresponding term vector of word that stop words and word frequency in corpus data are less than predetermined threshold value, it is remaining The corresponding term vector of word combines to form term vector set S, and predetermined threshold value is between 5~10.

3. short text similarity calculating method as described in claim 1, which is characterized in that in S2 by HMM model and Viterbi algorithm segments short text and training corpus to be calculated.

4. short text similarity calculating method as claimed in claim 2, which is characterized in that the corpus data passes through reptile skill Art obtains.

5. short text similarity calculating method as described in claim 1, which is characterized in that when similarity value is more than 0.7 in S4 When, then it is assumed that it is being semantically similar between two short text sentences.

6. a kind of short text similarity calculation system as described in claim 1, which is characterized in that including：

Term vector training module is connect with training corpus word-dividing mode, and the term vector training module is used for training corpus It is trained to obtain the term vector of each word in training corpus, then combines each term vector to form term vector set；

Short text vector generation module is connect with term vector training module, and short text vector generation module is used in term vector The corresponding term vector of each word after participle is found in set, and the term vector of each word after participle combined to be formed it is short Text vector set；

First similarity calculation module is connect with term vector training module, short text vector generation module, and described first is similar Degree computing module is used to calculate the cosine phase of each term vector and each term vector in short text vector set in term vector set Like degree；

Comparison module is connect with term vector training module, and the comparison module is used for each term vector of term vector set Similarity value is compared, and obtains the maximum value of the similarity of each term vector in term vector set；

Short text sentence vector generation module, connect with comparison module, and the short text sentence vector generation module is used for will The maximum similarity value of each term vector combines to obtain short text sentence vector in term vector set；

Second similarity calculation module is connect with short text sentence vector generation module, second similarity calculation module Similarity between short text sentence vector is calculated using cosine similarity formula.

7. short text similarity calculation system as claimed in claim 6, which is characterized in that the training corpus word-dividing mode packet It includes：

Acquiring unit is used to obtain training corpus；

8. short text similarity calculation system as claimed in claim 6, which is characterized in that the term vector training module packet It includes：

Term vector training unit is used to be trained to obtain to training corpus the term vector of each word in training corpus；

Term vector assembled unit is connect with term vector training unit, and the term vector assembled unit is used for each term vector Combination forms term vector set.

9. short text similarity calculation system as claimed in claim 6, which is characterized in that short text vector generation module packet It includes：

Short text vector assembled unit, connect with searching unit, and the short text vector assembled unit will be for after segmenting The term vector of each word combines to form short text vector set.