CN110413956B

CN110413956B - Text similarity calculation method based on bootstrapping

Info

Publication number: CN110413956B
Application number: CN201810400574.6A
Authority: CN
Inventors: 王清琛; 杜振东
Original assignee: Nanjing Yunwen Network Technology Co ltd
Current assignee: Nanjing Yunwen Network Technology Co ltd
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2023-08-01
Anticipated expiration: 2038-04-28
Also published as: CN110413956A

Abstract

The invention discloses a text similarity calculation method based on bootstrapping, which comprises the following steps: calculating the reverse document frequency of the word as an initial value of the word weight; selecting an initial core word list according to the reverse document frequency; calculating co-occurrence matrixes of words in the text; according to a bootstrapping algorithm, calculating the correlation degree between the candidate words and the initial core words as a coefficient of updating weight; sentence vectors are calculated from word vector V, word weight W and part-of-speech weight F. By adopting the technical scheme of the invention, the similarity calculation of the short text can be obviously improved.

Description

Text similarity calculation method based on bootstrapping

Technical Field

The invention relates to a word weight calculation method, in particular to a text similarity calculation method based on bootstrapping.

Background

In the present information internet age, a large amount of text information needs to be processed to be effectively utilized. Therefore, the field of natural language processing is continuously evolving. In natural language processing, text is segmented and expressed by word weights, and generating a vector space model is a common processing mode. Many effective methods have been proposed on the word weight calculation method, and the use of tfidf as one of the most commonly used methods.

The bootstarping algorithm is a process that uses limited samples for repeated sampling on a statistical basis. Each iteration generates a new sample to extract a new sample similar to the original sample.

Word vectors refer to mapping each word into a multidimensional vector space rich in the above information by counting the word segmentation in the corpus. The dimension of the word vector can be set according to specific tasks, so that the text information can be conveniently converted into the computable numerical information, and the method plays an important role in natural language processing.

Disclosure of Invention

In order to solve the defect that the traditional idf only judges the weight of the word segmentation from word frequency and neglects the correlation between words, the invention provides a text similarity calculation method based on bootstrapping, which is used for optimizing the weight of the idf word to improve the text similarity.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a text similarity calculation method based on bootstrapping comprises the following steps:

step one, calculating the reverse document frequency of a word as an initial value of word weight;

step two, selecting an initial core word list according to the reverse document frequency;

step three, calculating co-occurrence matrixes of words in the text;

calculating the correlation degree between the candidate words and the initial core words as a coefficient of updating weight according to a bootstrapping algorithm;

and fifthly, calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F.

Further, the word weight in the first step is represented by a numerical value representing the word in the text, and a real value vector of the text is generated. The more words in text can represent the subject, the lower the weight.

Further, the calculation formula of the correlation in the fourth step is as follows:

wherein S is _i Refers to the ith word in the initial core vocabulary S, R _j Refers to the jth word, F (S) _i ,R _j ) Refer to the initial core word S _i And candidate word R _j Is equal to or greater than the co-occurrence frequency of F (R _j ) Meaning that the candidate word R is included _j At the frequency of occurrence of the document.

Further, the calculation formula of the coefficient of the update weight in the fourth step is as follows:

wherein max _i T(S _i ,R _j ) The maximum degree of relatedness is indicated, and S is the number of words in the core vocabulary S.

Further, once each iteration in the fourth step, the coefficient is updated once, and the updated calculation formula is as follows:

where j refers to the j-th iteration, n refers to the total number of iterations, K _j (C _i ) Refers to the word C in the jth iteration _i Is used to update the coefficients.

Further, the calculation formula of the sentence vector in the fifth step is as follows:

wherein D is _ij Refers to text D _i J-th word of (D) _i The word "D" means the text _i The number of words in the model, and the alpha parameter repeatedly and iteratively searches the optimal value from 0 to 1.

The beneficial effects are that:

1. the invention uses the relativity in bootstarping algorithm for updating the word weight.

2. The invention calculates the correlation degree of a candidate word w and a plurality of initial core words s respectively, and obtains the basis for judging whether the candidate word is put into a new initial core word list or not by maximum value (maximum value-average value).

3. And calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F.

4. By adopting the technical scheme of the invention, the similarity calculation of the short text can be obviously improved.

Drawings

FIG. 1 is a flow chart of text similarity calculation according to an embodiment of the present invention;

FIG. 2 is a flow chart of updating word segmentation weights according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated below with reference to figures 1,2 and examples.

In the embodiment, the questions of the user in the customer service question-answering system of the e-commerce platform and the standard questions in the knowledge base are used as the corpus D.

As shown in fig. 1 and 2, after word segmentation, stop word removal and part-of-speech tagging are performed on the corpus D, a word2vec algorithm is used to generate a word vector V and the word vector V is stored.

In this embodiment, a user problem or a standard problem in the corpus D is used as a document, and the inverse document frequency of all words is calculated, for example, word C is calculated _i Is the inverse document frequency idf (C) _i ) The formula is as follows:

where |D| refers to the total number of documents in corpus D, |{ C _i E D } | means containing word C _i Is a document number of (c). The top 20 ranked words are selected as the initial core vocabulary S and the initial core words are removed from the total vocabulary C to generate the candidate vocabulary R. The initial core vocabulary S is shown in table 1 below.

TABLE 1 initial core vocabulary S

Taking the inverse document frequency of all words in the corpus D as the respective initial weight value w, calculating the relevance T by using bootstrapping algorithm, for example, calculating the initial core word S _i And candidate word R _j The correlation formula of (2) is as follows:

wherein S is _i Refers to the ith word in the initial core vocabulary S, R _j Refers to the jth word, F (S) _i ,R _j ) Refer to the initial core word S _i And candidate word R _j Is equal to or greater than the co-occurrence frequency of F (R _j ) Refers to a bagContaining candidate words R _j At the frequency of occurrence of the document. Taking "shipping" and "vacation" as candidate words, taking "me" as initial core word, for example, T (me, shipping) =log2 (13098) =13098/284020=0.63, T (me, vacation) =log2 (28) ×28/896=0.15. And extracting a new initial core word according to the relevance threshold p, adding the new initial core word into the initial core word list S, removing the new initial core word from the candidate word list R, and performing the next iteration until the new initial core word does not appear any more. The setting of the correlation threshold p depends on the calculation of the correlation T, and in order to control the number of new initial core words generated by each iteration, the value range of p is controlled to be [0.8,1 ]]Between them.

In the iterative process, according to the candidate word R _j And each initial core word S _i Is related to T to obtain candidate word R _j Update coefficient of weight K (R _j ) The coefficient calculation formula of the update weight is as follows:

wherein max _i T(S _i ,R _j ) Refers to the maximum degree of relatedness, |s|, refers to the number of words in the initial set of core words S. Taking "shipping" and "vacation" as examples, K (shipping) =1.878×1.878-1.218) =1.24, K (vacation) =1.167×1.167-0.992) =0.2. The updating weight coefficient K of the candidate word is controlled to be [1, 2]]Within a range of (2) the update coefficient of the initial core word defaults to 2.

Updating word weights using idf and K, e.g. word C _i The updated calculation formula is as follows:

where j refers to the j-th iteration, n refers to the total number of iterations, K _j (C _i ) Refers to the word C in the jth iteration _i Is used to update the coefficients. Taking "shipping" and "holiday" as examples, W (shipping) =1/(0.369×1.125×1.011×1.001)/7=0.339, W (leave) =1/(0.112×1.016×1.006×0.115)/7=1.235

The value range is controlled to be 0.2-2 according to the increment of the part of speech by 0.1, and the obtained part of speech weight vocabulary is F, as shown in the following table 2.

TABLE 2 part-of-speech weight vocabulary

And calculating a vector M of the text, and comparing the similarity by using cosine similarity. For example text D _i The sentence vector calculation formula of (2) is as follows:

wherein D is _ij Refers to text D _i J-th word of (D) _i The word "D" means the text _i The number of words in the model, and the alpha parameter repeatedly and iteratively searches the optimal value from 0 to 1. The test results are shown in table 3 below:

table 3 comparison of test results

And calculating the similarity between the text vectors by using the cosine similarity, and taking the standard question with the highest similarity as a correct target result.

It should be apparent to those skilled in the art that various modifications or variations can be made in the present invention without requiring any inventive effort by those skilled in the art based on the technical solutions of the present invention.

Claims

1. A text similarity calculation method based on bootstrapping is characterized by comprising the following steps:

step three, calculating co-occurrence matrixes of words in the text;

step five, calculating sentence vectors according to the word vectors V, the word weights W and the part-of-speech weights F;

the calculation formula of the correlation degree in the fourth step is as follows:

wherein T (S) _i ,R _j ) Refer to the initial core word S _i And candidate word R _j Correlation of S _i Refers to the ith word in the initial core vocabulary S, R _j Refers to the jth word, F (S) _i ,R _j ) Refer to the initial core word S _i And candidate word R _j Is equal to or greater than the co-occurrence frequency of F (R _j ) Meaning that the candidate word R is included _j Frequency of occurrence in documents;

the calculation formula of the coefficient of the update weight in the fourth step is as follows:

wherein K (R) _j ) Refers to candidate word R _j The update coefficients of the weights are used to update the weights,refers to the maximum correlation, and |S| refers to the kernelThe number of words in the heart word list S;

and in the fourth step, updating the update coefficient once every iteration, wherein the update calculation formula of the word weight is as follows:

wherein W (Ci) refers to the word C _i J means the jth iteration, n means the total number of iterations, K _j (C _i ) Refers to the word C in the jth iteration _i Update coefficients of the weights.

2. The text similarity calculation method based on bootstrapping of claim 1, wherein the word weights in the step one are represented by a numerical value representing words in the text, and a real-valued vector of the text is generated.

3. The text similarity calculation method based on bootstrapping according to claim 1, wherein the calculation formula of the sentence vector in the fifth step is as follows: