CN110399458B

CN110399458B - Text similarity calculation method based on latent semantic analysis and random projection

Info

Publication number: CN110399458B
Application number: CN201910598004.7A
Authority: CN
Inventors: 朱全银; 吴思凯; 王啸; 赵建洋; 宗慧; 冯万利; 周泓; 丁瑾; 陈伯伦; 曹苏群
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2023-05-26
Anticipated expiration: 2039-07-04
Also published as: CN110399458A

Abstract

The invention discloses a text similarity calculation method based on latent semantic analysis and random projection, which is suitable for the problem of general unsupervised text clustering. The method comprises the steps of firstly converting a label text to be processed into a word bag model, and weighting the label text by using a TF-IDF algorithm to obtain a weight vector set. And then processing the weight vector set by using an LSA algorithm to obtain an LSA index library, and processing the weight vector set by using a random projection algorithm to obtain an RP index library. Finally, after TF-IDF processing is carried out on the corpus to be calculated, LSA algorithm and RP algorithm are respectively used for processing and then the corpus to be calculated is compared with the content of the index library, and the text similarity is obtained. The invention can calculate the effective similarity of the text content and recommend the related content through the text with higher similarity.

Description

Text similarity calculation method based on latent semantic analysis and random projection

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a text similarity calculation method based on latent semantic analysis and random projection.

Background

In the conventional text recommendation algorithm, researchers choose to perform similarity calculation through a common vector space model or the like. By exploring semantic association among texts, a reliable text similarity calculation method is provided for a related system by combining latent semantic analysis and random projection.

Latent semantic analysis:

the potential semantic analysis is a new information retrieval algebraic model, is a calculation theory and method for knowledge acquisition and display, and uses a statistical calculation method to analyze a large number of text sets, thereby extracting potential semantic structures among words, and using the potential semantic structures to represent the words and the text, so as to achieve the purposes of eliminating the relativity among the words and simplifying text vectors to realize dimension reduction.

Random projection:

random projection is a simple and effective dimension reduction method, which is different from some feature extraction methods based on the appearance. Feature extraction is completely independent of the original sample data set and does not cause significant distortion of the data. The data after dimension reduction still keeps the important characteristic information contained in the original high-dimension data, and a transformation matrix is obtained without matrix decomposition, so that the capability of processing the data in real time can be greatly improved.

When the traditional text similarity comparison problem is oriented, the existing papers mainly use simple common words to carry out similarity comparison, but the similarity comparison effect of the method on semantic topic association among documents is poor.

Disclosure of Invention

The invention aims to: aiming at the problems, the invention provides a text similarity calculation method based on latent semantic analysis and random projection, which changes the limitation of the traditional calculation method, reduces the dimension of a vector space by utilizing a plurality of dimension reduction modes, and effectively improves the accuracy and the reliability of the result.

The technical scheme is as follows: the invention provides a text similarity calculation method based on latent semantic analysis and random projection, which comprises the following steps:

(1) Vectorizing tag to obtain an entry tag vector set V1, and obtaining a tag weight vector set V2 by using a TF-IDF algorithm;

(2) Using an LSA algorithm to V2 to obtain an LSA model M1 and an index library I1;

(3) Using a random projection algorithm to V2 to obtain an RP model M2 and an index library I2;

(4) And (3) using TF-IDF to process the corpus to be processed, and performing LSA and RP processing to obtain a final recommendation set.

Further, the specific steps for obtaining the tag weight vector set V2 in the step (1) are as follows:

(1.1) defining D1 as an encyclopedia entry dataset, d1= { id1, title1, paramgram 1, image1, url1, tag1}, wherein id1, title1, paramgram 1, image1, url1, tag1 represent a number, a title, a paragraph, a picture link, a web page link, and an entry tag, respectively;

(1.2) T1= { w was obtained by using split method on tag1 _i1 ,w _i2 ,…,w _in }，w _iA Is the A-th entry tag set of encyclopedic entry data set, wherein the variable A epsilon [1, n]；

(1.3) obtaining Dictionary 1 by using Dictionary method for T1;

(1.4) saving dictionary Dict1 locally;

(1.5) obtaining the term tag vector set v1= { V by using Doc2Bow method for T1 _i1 ,v _i2 ,…,v _in }，v _iA Is the A-th entry tag vector of the entry tag vector set V1, where the variables A ε [1, n]；

(1.6) obtaining an entry tag weight vector set v2= { V by performing a TF-IDF method on V1 _j1 ,v _j2 ,…,v _jn }，v _jA Is the A-th term tag weight vector of the term tag weight vector set V2, wherein the variables A E [1, n ]]；

(1.7) save the set of tag weight vectors V2 locally.

Further, the specific steps of obtaining the LSA model M1 and the index library I1 by using the LSA algorithm for V2 in the step (2) are as follows:

(2.1) local loading of the tag weight vector set V3, v3= { V _k1 ,v _k2 ,…,v _kn }，v _kB Is the B weight vector of the term tag weight vector set V3, where B is [1, n ]]；

(2.2) loading dictionary din 2 from local;

(2.3) define id2 word=dct 2, topic number num_topics=300;

(2.4) training the V3 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;

(2.5) processing V3 through a model M1 to obtain a packaging corpus C1;

(2.6) establishing an index library for the C1 to obtain an index library I1;

(2.7) saving the model M1 and the index library I1.

Further, the specific steps of obtaining the RP model M2 and the index library I2 by using the random projection algorithm for V2 in the step (3) are as follows:

(3.1) local loading of the tag weight vector set V4, v4= { V _l1 ,v _l2 ,…,v _ln }，v _lC Is the C weight vector of the term tag weight vector set V4, where C is [1, n ]]；

(3.2) defining a topic number num_topics=500;

(3.3) training the V4 by using an RP method, and obtaining a model M2 by the input parameter num_topics;

(3.4) processing V4 through a model M2 to obtain a packaging corpus C2;

(3.5) establishing an index library for the C2 to obtain an index library I2;

(3.6) saving the model M2 and the index library I2.

Further, the specific steps for obtaining the final recommended set in the step (4) are as follows:

(4.1) defining D2 as an encyclopedia entry test set, d2= { id2, title2, paramgram 2, image2, url2, tag2}, wherein id2, title2, paramgram 2, image2, url2, tag2 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;

(42) taking title2 as input, t2= { w was obtained by using split method on tag2 _j1 ,w _j2 ,…,w _jn }，w _jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n]；

(4.3) obtaining the term tag vector set v5= { V by using Doc2Bow method for T2 _m1 ,v _m2 ,…,v _mn }，v _mE The E-th entry tag vector of the entry tag vector set V5, where the variable E.epsilon.1, n]；

(4.4) obtaining an entry tag weight vector set V6 = { V by performing TF-IDF method on V5 _o1 ,v _o2 ,…,v _on }，v _oF The F-th term tag weight vector of the term tag weight vector set V6, wherein the variables F E [1, n]；

(4.5) defining the variable k=1 as a cyclic variable for traversing V6;

(4.6) definition sets R1, R2 and R3, r1= { sim _i1 ,sim _i2 ,…,sim _in }，R2＝{sim _j1 ,sim _j2 ,…,sim _jn R3 is an empty set, sim _iG And sim _jG Respectively represent the G-th similarity set, sim in R1 and R2 _iG And sim _jG The initial value is null, wherein G.epsilon.1, n]；

(4.7) importing an LSA model M3 and a random projection model M4, and importing an LSA index library I3 and a random projection index library I4;

(4.8) if k < = n go to step (4.9), otherwise go to step (4.14);

(4.9) by the method of v _ok Packaging to obtain vec1 by LSA method _k By v of _ok Packaging by random projection method to obtain vec2 _k ；

(4.10) by pairing vec1 _k Index library I3 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I3 _k Is stored in sim _ik By a method of v c2 _k Index library I4 is retrieved, and cosine similarity is used for calculating and obtaining elements and vec2 in I4 _k Is stored in sim _jk ；

(4.11) sim _ik And sim _jk Adding the corresponding elements and taking an average value to obtain sim _lk ；

(4.12) sim _lk Inserted into R3;

(4.13) k=k+1, proceeding to step (4.8);

and (4.14) taking 8 elements with highest similarity in each set of R3 to form a set, and storing the set into a result set R4, wherein each element in the R4 is a recommended set.

The invention adopts the technical scheme and has the following beneficial effects:

according to the method, the text similarity calculation is carried out on the entry obtained from the hundred degrees encyclopedia by utilizing potential semantic analysis and random projection, the limitation of the traditional calculation method is changed, the vector space is reduced in dimension by utilizing various dimension reduction modes, and the accuracy and the reliability of the result are effectively improved.

Drawings

FIG. 1 is a general flow chart of the present invention;

FIG. 2 is a flowchart showing the TF-IDF algorithm in FIG. 1;

FIG. 3 is a specific flow chart of the LSA algorithm of FIG. 1;

FIG. 4 is a flowchart showing the random projection algorithm of FIG. 1;

fig. 5 is a flowchart showing the similarity recommendation function in fig. 1.

Detailed Description

The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.

As shown in fig. 1 to 5, the text similarity calculation method based on latent semantic analysis and random projection according to the present invention includes the following steps:

step 1: the tag is vectorized to obtain an entry tag vector set V1, and a TF-IDF algorithm is used for the entry tag vector set to obtain a tag weight vector set V2, and the specific method is as follows:

step 1.1: define D1 as an encyclopedic entry dataset, d1= { id1, title1, parameter 1, image1, url1, tag1}, where id1, title1, parameter 1, image1, url1, tag1 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;

step 1.2: t1= { w was obtained by using split method on tag1 _i1 ,w _i2 ,…,w _in }，w _iA Is the A-th entry tag set of encyclopedic entry data set, wherein the variable A epsilon [1, n]；

Step 1.3: dictionary 1 is obtained by using Dictionary method on T1;

step 1.4: storing dictionary Dict1 to local;

step 1.5: obtaining an entry tag vector set V1= { V by using a Doc2Bow method on T1 _i1 ,v _i2 ,…,v _in }，v _iA Is the A-th entry tag vector of the entry tag vector set V1, where the variables A ε [1, n]；

Step 1.6: obtaining a term tag weight vector set V2= { V by performing TF-IDF method on V1 _j1 ,v _j2 ,…,v _jn }，v _jA Is the A-th term tag weight vector of the term tag weight vector set V2, wherein the variables A E [1, n ]]；

Step 1.7: the set of tag weight vectors V2 is saved locally.

Step 2: the LSA algorithm is used for V2 to obtain an LSA model M1 and an index library I1, and the specific method is as follows:

step 2.1: local loading of tag weight vector set V3, v3= { V _k1 ,v _k2 ,…,v _kn }，v _kB Is the B weight vector of the term tag weight vector set V3, where B is [1, n ]]；

Step 2.2: loading dictionary Dict2 from local;

step 2.3: defining id2 word=dct 2, topic number num_topics=300;

step 2.4: training V3 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;

step 2.5: processing V3 through a model M1 to obtain a packaging corpus C1;

step 2.6: establishing an index library for the C1 to obtain an index library I1;

step 2.7: the model M1 and the index library I1 are saved.

Step 3: the RP model M2 and the index library I2 are obtained by using a random projection algorithm on the V2, and the specific method comprises the following steps:

step 3.1: local loading of tag weight vector set V4, v4= { V _l1 ,v _l2 ,…,v _ln }，v _lC Is the C weight vector of the term tag weight vector set V4, where C is [1, n ]]；

Step 3.2: defining a topic number num_topics=500;

step 3.3: training V4 by using an RP method, and obtaining a model M2 by an incoming parameter num_topics;

step 3.4: processing V4 through a model M2 to obtain a packaging corpus C2;

step 3.5: establishing an index library for C2 to obtain an index library I2;

step 3.6: the model M2 and the index library I2 are saved.

Step 4: the method comprises the steps of using TF-IDF to process corpus to be processed, and performing LSA and RP to obtain a final recommendation set, wherein the specific method comprises the following steps:

step 4.1: define D2 as an encyclopedic entry test set, d2= { id2, title2, parameter 2, image2, url2, tag2}, where id2, title2, parameter 2, image2, url2, tag2 represent numbers, titles, paragraphs, picture links, web links, and entry tags, respectively;

step 4.2: using title2 as input, t2= { w was obtained by using split method on tag2 _j1 ,w _j2 ,…,w _jn }，w _jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n]；

Step 4.3: obtaining an entry tag vector set V5= { V by using a Doc2Bow method for T2 _m1 ,v _m2 ,…,v _mn }，v _mE The E-th entry tag vector of the entry tag vector set V5, where the variable E.epsilon.1, n]；

Step 4.4: obtaining a term tag weight vector set V6= { V by performing TF-IDF method on V5 _o1 ,v _o2 ,…,v _on }，v _oF Is an entry markThe F-th entry tag weight vector of the tag weight vector set V6, where the variable F ε [1, n]；

Step 4.5: defining variable k=1 as a cyclic variable for traversing V6;

step 4.6: definition sets R1, R2 and R3, r1= { sim _i1 ,sim _i2 ,…,sim _in }，R2＝{sim _j1 ,sim _j2 ,…,sim _jn R3 is an empty set, sim _iG And sim _jG Respectively represent the G-th similarity set, sim in R1 and R2 _iG And sim _jG The initial value is null, wherein G.epsilon.1, n]；

Step 4.7: importing an LSA model M3 and a random projection model M4, and importing an LSA index library I3 and a random projection index library I4;

step 4.8: if k < = n then go to step 4.9, otherwise go to step 4.14;

step 4.9: by v pair _ok Packaging to obtain vec1 by LSA method _k By v of _ok Packaging by random projection method to obtain vec2 _k ；

Step 4.10: by pairing vec1 _k Index library I3 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I3 _k Is stored in sim _ik By a method of v c2 _k Index library I4 is retrieved, and cosine similarity is used for calculating and obtaining elements and vec2 in I4 _k Is stored in sim _jk ；

Step 4.11: will sim _ik And sim _jk Adding the corresponding elements and taking an average value to obtain sim _lk ；

Step 4.12: will sim _lk Inserted into R3;

step 4.13: k=k+1, go to step 4.8;

step 4.14: and 8 elements with highest similarity in each set of R3 are taken to form a set, the set is stored in a result set R4, and each element in the R4 is the recommended set.

The following is a table of variables according to the present invention:

table 1 global variable table

Table 2 step 1 variable table

Variable definition	Variable name
		D1	Encyclopedic entry dataset
id1	Entry dataset numbering
		title1	Entry dataset header
paragraph1	Entry dataset paragraph
		image1	Entry dataset picture linking
url1	Entry dataset web page linking
		tag1	Entry label
T1	Collection of term tag sets
		w _iA	The A-th entry tag set in the collection of entry tag sets
Dict1	Dictionary with a plurality of dictionary marks
		v _iA	The A-th term tag vector of V1
v _jA	The A-th term tag weight vector of V2

TABLE 3 step 2 variable table

Variable definition	Variable name
		V3	Tag weight vector set
v _kB	The B weight vector of V3
		Dict2	Dictionary with a plurality of dictionary marks
id2word	Parameter 1 of the LSA method
		num_topics	Parameter 2 of the LSA method
C1	Corpus LSA packaging V3

Table 4 step 3 variable table

Variable definition	Variable name
		V4	Tag weight vector set
v _lC	C weight vector of V4
		num_topics	Parameter 1 of the RP method
C2	Corpus RP-packaged V4
		num_topics	Parameter 2 of the LSA method
C1	Corpus LSA packaging V3

Table 5 step 4 variable table

/>

In order to illustrate the effectiveness of the method, the method uses latent semantic analysis and random projection to perform space projection on the entry labels and uses cosine similarity to calculate the result by performing data processing on 46935 historic entries. The method effectively overcomes the limitation of traditional text similarity calculation, and optimizes the text similarity calculation through the combination of two algorithms.

The invention can be combined with a computer system so as to complete text similarity calculation and output a recommendation set.

The invention creatively provides a method for combining a latent semantic analysis method with a random projection method to project a text so as to obtain an optimal recommendation result.

The foregoing is merely exemplary of the present invention and is not intended to limit the present invention. All equivalents and alternatives falling within the spirit of the invention are intended to be included within the scope of the invention. The invention, which is not described in detail, should be considered as belonging to the prior art known to those skilled in the art.

Claims

1. The text similarity calculation method based on latent semantic analysis and random projection is characterized by comprising the following steps of:

(1) Vectorizing tag to obtain a vocabulary entry tag vector set V1, and obtaining a vocabulary entry tag weight vector set V2 by using a TF-IDF algorithm;

(4) The method comprises the steps of using TF-IDF to process corpus to be processed, and performing LSA and RP processing to obtain a final recommendation set;

the specific steps of obtaining the LSA model M1 and the index library I1 by using the LSA algorithm for V2 in the step (2) are as follows:

(2.1) load the vocabulary entry tag weight vector set from local V2, v2= { V _k1 ,v _k2 ,…,v _kn }，v _kB Is the B weight vector of the term tag weight vector set V2, where B is [1, n ]]；

(2.2) loading dictionary din 2 from local;

(2.3) define id2 word=dct 2, topic number num_topics=300;

(2.4) training the V2 by using an LSA method, and obtaining a model M1 by introducing parameters id2word and num_topics;

(2.5) processing V2 through a model M1 to obtain a packaging corpus C1;

(2.6) establishing an index library for the C1 to obtain an index library I1;

(2.7) saving the model M1 and the index library I1;

the specific steps of obtaining the RP model M2 and the index library I2 by using a random projection algorithm to the V2 in the step (3) are as follows:

(3.1) load the vocabulary entry tag weight vector set from local V2, v2= { V _l1 ,v _l2 ,…,v _ln }，v _lC Is the C weight vector of the term tag weight vector set V2, where C is [1, n ]]；

(3.2) defining a topic number num_topics=500;

(3.3) training the V2 by using an RP method, and obtaining a model M2 by the input parameter num_topics;

(3.4) processing V2 through a model M2 to obtain a packaging corpus C2;

(3.5) establishing an index library for the C2 to obtain an index library I2;

(3.6) saving the model M2 and the index library I2;

the specific steps for obtaining the final recommended set in the step (4) are as follows:

(4.2) Using title2 as input, T2 = { w was obtained by using split method on tag2 _j1 ,w _j2 ,…,w _jn }，w _jD Is the D-th entry tag set of the encyclopedic entry data set, wherein the variables D epsilon [1, n]；

(4.3) obtaining the term tag vector set v1= { V by using Doc2Bow method for T2 _m1 ,v _m2 ,…,v _mn }，v _mE The E-th entry tag vector of the entry tag vector set V1, where the variable E.epsilon.1, n]；

(4.4) obtaining an entry tag weight vector set V2= { V by performing TF-IDF method on V1 _o1 ,v _o2 ,…,v _on }，v _oF The F-th term tag weight vector of the term tag weight vector set V2, wherein the variables F E [1, n]；

(4.5) defining the variable k=1 as a cyclic variable for traversing V2;

(4.7) importing an LSA model M1 and a random projection model M2, and importing an LSA index library I1 and a random projection index library I2;

(4.8) if k < = n go to step (4.9), otherwise go to step (4.14);

(4.10) by pairing vec1 _k Index library I1 is searched, and cosine similarity is used for calculating and obtaining elements and vec1 in I1 _k Is stored in sim _ik By a method of v c2 _k Index database I2The cosine similarity is used for calculating and obtaining the element and vec2 in the element I2 _k Is stored in sim _jk ；

(4.12) sim _lk Inserted into R3;

(4.13) k=k+1, proceeding to step (4.8);

2. The text similarity calculation method based on latent semantic analysis and random projection according to claim 1, wherein the specific steps of obtaining the term tag weight vector set V2 in the step (1) are as follows:

(1.3) obtaining Dictionary 1 by using Dictionary method for T1;

(1.4) saving dictionary Dict1 locally;

(1.7) saving the entry tag weight vector set V2 locally.